I think people really don't appreciate just how incomplete Linux kernel API docs are, and how Rust solves the problem.
-
I think people really don't appreciate just how incomplete Linux kernel API docs are, and how Rust solves the problem.
I wrote a pile of Rust abstractions for various subsystems. For practically every single one, I had to read the C source code to understand how to use its API.
Simply reading the function signature and associated doc comment (if any) or explicit docs (if you're lucky and they exist) almost never fully tells you how to safely use the API. Do you need to hold a lock? Does a ref counted arg transfer the ref or does it take its own ref?
When a callback is called are any locks held or do you need to acquire your own? What about free callbacks, are they special? What's the intended locking order? Are there special cases where some operations might take locks in some cases but not others?
Is a NULL argument allowed and valid usage, or not? What happens to reference counts in the error case? Is a returned ref counted pointer already incremented, or is it an implied borrow from a reference owned by a passed argument?
Is the return value always a valid pointer? Can it be NULL? Or maybe it's an ERR_PTR? Maybe both? What about pointers returned via indirect arguments, are those cleared to NULL on error or left alone? Is it valid to pass a NULL ** if you don't need that return pointer?
-
Asahi Lina (朝日リナ) // nullptr::livereplied to Asahi Lina (朝日リナ) // nullptr::live last edited by
Sometimes these requirements were reasonable, just unwritten. Sometimes they were a bit too flexible/wild and I had to make some opinionated decisions when writing the Rust abstractions to narrow it down to a safe usage.
Sometimes I had to add extra locking inside the abstraction in order to make it practical to make safe. Sometimes I had to make some small changes to the C side to make it more orthogonal or logical and usable, e.g. to expose an unlocked function to be used with a lock taken.
Sometimes the locking was subtle enough that while I was able to write a safe Rust abstraction, it came with a big doc comment explaining how you have to be careful with usage and drop order to avoid deadlocks (deadlocks are "safe", Rust doesn't inherently protect against them).
Sometimes it was a lost cause without making the C side more reasonable (drm_sched only, really).
However, most of the time the compromises made when writing the Rust abstraction point at issues with the C side design and how it could be improved.
In general the approach is "write the Rust side making as few changes as possible to the C side first to avoid conflict, then maybe propose changes to the C side based on lessons learned" (we haven't really gotten to the second part yet at all).
-
Asahi Lina (朝日リナ) // nullptr::livereplied to Asahi Lina (朝日リナ) // nullptr::live last edited by
But the end result of all this is that you CAN, in fact, just look a the Rust API and know how to use it correctly for the most part. You never have to worry about reference counts, about NULL pointers, about forgetting to check results, about dropping refs in error cases.
You never have to worry about holding the right locks, about accidentally forgetting to take a ref or dropping it twice. You never have to wonder how error returns are encoded.
Because if you make a mistake with these things, your code won't compile.
Of course you can still misuse APIs, but the worst that will happen is that you'll get an error return, or maybe a deadlock (deadlocks are easy to debug with lockdep and I wrote a really neat Arc<> integration to catch potential drop/decref related locking errors).
Even with APIs that mostly are fairly rigorously documented (OpenFirmware/Device Tree comes to mind), following all the rules in C is often tedious and error prone. Look at some random OF code in a driver and there's a good chance it leaks references.
(This doesn't really matter for most systems since they don't compile kernels with OF_DYNAMIC so ref counts are ignored, so this never gets noticed and fixed.)
But with my OF Rust abstractions? They do ref counting for you. You can just forget about it.
-
Asahi Lina (朝日リナ) // nullptr::livereplied to Asahi Lina (朝日リナ) // nullptr::live last edited by
In the end, coding kernel code in Rust is a huge change from coding C. With C you have two options:
- Wing it and either hope reviewers catch it or suffer debugging subtle oopses
- Spend hours understanding the code before you dare use it, and hope you caught everything.
This adds extra reviewer and maintainer workload too! It means that they need to review submissions to ensure they follow all these hidden rules that aren't documented. Sometimes they miss things. Sometimes the problem is major enough the code needs a big refactor.
All that just goes away with Rust. Poof. Gone. If it compiles it's safe and won't oops or leak references (except unsafe code, but then you only have to review THAT and the rule is it has to be carefully documented).
Of course we still need code reviews, and help from experts in specific subsystems. Rust doesn't magically make code perfect.
But it does get rid of all the silly low level problems and mistakes, so you can focus on the high level ones.
-
Asahi Lina (朝日リナ) // nullptr::livereplied to Asahi Lina (朝日リナ) // nullptr::live last edited by
To be clear, I don't blame Linux developers for the incomplete docs. For better or worse, the Linux kernel is very complex and has to deal with a lot of subtlety. Most userspace APIs have much simpler rules you have to follow to use them safely. Kernels are hard!
Even experienced kernel developers get these things wrong all the time. It's not a skill issue. It's simply not possible for humans to keep all of these complex rules in their head and get them right, every single time. We are not built for that.
We need tooling to help us.
The solution is called Rust. Encode all the rules in the code and type system once, and never have to worry about them again.
Just like the solution to coding style arguments is to encode all the rules in an auto formatter and never have to worry about them again (hint hint! ^^)
And then we can stop worrying about all the low-level safety, ownership, and locking problems, and start worrying about more important things like high-level driver and subsystem design.
-
Erin 💽✨replied to Asahi Lina (朝日リナ) // nullptr::live last edited by@lina honestly, I do blame them for the state of the documentation because it's abnormally bad.
The amount of times I've been trying to understand behaviours of major areas of the network stack and it's completely undocumented and I have ended up reading the source code is far too high. It really is shocking how often fancy features get introduced with zero documentation and to use them you have to reverse engineer either the kernel or the one user space application that uses it.
In a clean C codebase, there are often aspects you have questions about or which aren't initially clear. The Linux kernel isn't a clean codebase; it's a pig sty.
And this isn't an inherent requirement of anything that the developers are building. The BSDs are *not like this*. -
@lina I've seen some of the developers concerned about the lack of fresh blood in subsystem maintenance and similar roles and honestly it doesn't take much looking at the code to understand why, even if you ignore the culture problems.