dangit, my inner data structure has a structural fault because of rust ownership semantics.Conceptually, the inner struct is a binary tree where inner nodes can carry a value, and leaves can carry a value or a child tree. If you need a leaf to hold bot...
-
dangit, my inner data structure has a structural fault because of rust ownership semantics.
Conceptually, the inner struct is a binary tree where inner nodes can carry a value, and leaves can carry a value or a child tree. If you need a leaf to hold both a child and a value, you store the child and move the value to the child's root node.
Conceptually again, lookups walk down this tree-of-trees looking for the node representing the lookup key, and nearest self-or-parent value is the result.
-
But to make lookups fast, "empty" nodes under a value node hold a reference to their nearest parent value (again conceptually), and the tree is laid out such that you can look up the leaf nodes in a single memory access.
So, as you traverse down, you do leaf lookups only, and the answer you get is one of: nothing here; a value; a reference to a value somewhere up-tree; or a child tree.
-
To be able to answer the "nearest parent" query a lookup is asking, as you're walking down the tree-of-trees, whenever you see a child tree and continue down, you also remember which value _would_ be the answer if there was no child tree. That way you can traverse downwards a single time, with no backtracking, and if you end up in a sub-tree where you get a "nothing here" answer, the last value you saw in another tree on the way down is your answer.
-
In C, this is all straightforward to implement: it's all raw pointers, so you just sprinkle the pointers into the various trees where they need to be and that's all fine.
In rust, I implemented the tree nodes as an enum with the 4 alternatives, which is fine... Except for what happens at leaves with children.
Say you had a tree with a value somewhere in the middle, such that a leaf node is a reference to that value. (again reference conceptually, in rust specifically it's an Indirect(usize)
-
But then as you're inserting a new value, you discover that this leaf needs to be a child tree. That's fine, the rule from earlier is that if you need a child tree at a place that isn't empty, you attach the child tree and move the previous value into the child's root node.
And this is where the rust police kicks down the door, because in the scenario above, the child's root node now references a node in its parent tree, which you can't express cleanly in safe rust.
-
It's _possible_ to express what's happening here, but fundamentally the C-ish representation of this data structure demands a bunch of mutable pointers that span multiple data structures, forming a graph that basic Rust ownership cannot express. You have to start reaching for Rcs and Cells and other things to transpose the algorithm directly, and I'm trying to avoid that.
-
The way I implemented this in Go was: leaf nodes in the inner trees can have _both_ a value and a child. So, in rust terms, the leaf node looks something like:
enum Value {
Direct(V),
Indirect(OtherNodeID),
}struct Leaf {
value: Option<Value>,
child: Option<Box<Table>>
}Which is fine, that works, but also it doubles the memory footprint of nodes, something I was specifically trying to avoid in the rust impl.
-
@danderson yeah, we hear that. like, we do think those ownership rules exist for a reason, lifetime bugs in the C versions of things like this can be very subtle, but... yeah
if you can't remove the up-pointers, you pretty much have to turn them into weak pointers, we don't see another way
-
So... Need to ponder some more. This data structure is already quite memory hungry. It's the core tradeoff it makes - memory cost scales down poorly, but for large numbers of values it amortizes decently and in return all operations are very fast indeed.
If I have to double the node size (as I did in Go), the overall type ends up being 1.5-2x more expensive in memory than it needs to be, and that really sucks.
-
@irenes The obvious other way is to replace my current Indirect(usize) values (which tell you what other index of an inner array to look at for the actual value) with a naked *Value, and start using unsafe blocks to manage that.
All the unsafety would be hidden way inside the type, and that's the most efficient straight-line way to achieve equivalent performance to the C version, and it's _possible_ to maintain the invariants... But I'm trying to avoid unsafe because noob.
-
@danderson You can't express it with a `&` reference to the parent, but you are free to express it with a `ReferenceToValueInParentNode(usize)` struct or enum variant. Perhaps this is another reason you need a type to represent the state of the tree traversal — if you captured that state, you could use it to interpret that kind of reference.
-
@willglynn Yeah, adding one more variant to the enum is on my list of options here, but I don't love it because it forces lookups to carry more state and branches, and extremely fast/cheap lookups is one of the key benefits of this type.
But, short of embracing unsafe and stashing raw pointers as the indirection mechanism (and take on all the risk of enforcing the invariants), I don't see many other options.
-
@danderson we very much think it's appropriate to have guard rails no matter how experienced you are.
-
Fundamentally, if I'm sticking to the algorithms as defined, I have 3 options: leaf nodes become 2x larger (carry both a value and a child, both optional), lookups become slower (need to carry enough state to resolve non-ptr cross-tree references), or segfaults (use unsafe and raw ptrs as references, which the struct _can_ hide and still obey the rules of unsafety, but I may have a skill issue).
Or, figure out if a tweak to the algorithms could save me some trouble...
-
@danderson I think I'd have to see it to make better suggestions.
What makes this a branch in Rust and not a branch in C? Are they all pointers to values in C, some of which are owned and others borrowed, depending on data structure semantics invisible to the Rust compiler?
-
@irenes Oh certainly, and there could be some type wrapping going on to keep the unsafety very constrained and surrounded by asserts.
... but, data structures that need to do cursed things to go fast _is_ one of the places where judicious unsafety is most common, for a reason
-
@danderson can’t the leaf also be an enum
-
@willglynn In C, nodes that carry values and "indirects" are both just pointers to the value, and the overall tree-of-trees data structure is responsible for not screwing up ownership and cleanup of this distributed state. It also uses pointer tagging to distinguish between pointer-to-value and pointer-to-child. So, the struct is just a sea of tagged pointers, and hopefully the programmer implemented stuff correctly
And yes, the ownership in C is who knows good luck don't segfault
-
@danderson another approach I’ve used is to give up on using pointers for representing edges in graph algorithms, and using indices into a backing array instead. This can work well as long as you never need to return a language-native pointer to an element. No idea of that’s an easy option here — I haven’t read the algorithm.
-
@danderson This sounds like the data structure does not need to distinguish between pointer-to-value-that-I-own and pointer-to-value-that-I-reference. Value pointers are value pointers, direct or indirect.
If that's the case, can you just… `clone()` and store the value twice instead of storing explicit indirection?