Cancellations in async Rust

(sunshowers.io)

229 points | by todsacerdoti a day ago ago

77 comments

ossopite a day ago ago
I think the send/recv with a timeout example is very interesting, because in a language where futures start running immediately without being polled, I think the situation is likely to be the opposite way around. send with a timeout is probably safe (you may still send if the timeout happened, which you might be sad about, but the message isn't lost), while recv with a timeout is probably unsafe, because you might read the message out of the channel but then discard it because you selected the timeout completion instead. And the fix is similar, you want to select either the timeout or 'something is available' from the channel, and if you select the latter you can peek to get the available data.
[-]
- lionkor 14 hours ago ago
  Isn't this exactly what cancellation-safety is all about?
- sunshowers a day ago ago
  Thanks, that is a great point.
Matthias247 a day ago ago
Some other material that has been written by me on that topic:
- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1
- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)
- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token
alembic_fumes a day ago ago
I'm not understanding what the supposed problem with these futures getting cancelled is. Since futures are not tasks, as the post itself acknowledges, does it not logically follow that one should not expect futures to complete if the future is not driven to completion, for one reason or another? What else could even be expected to happen?
The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:
Example 1: one future cancelled on error in the other
let res = tokio::try_join!( do_stuff_async(), more_async_work(), );
Example 2: data not written out on cancellation
let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;
Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.
I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?
[-]
- sunshowers a day ago ago
  You're absolutely right! The problem is that this has introduced many bugs in our experience at Oxide. If you've already fully internalized the idea that futures are passive and can be cancelled at any await point, the talk is just a bunch of details.
  [-]
  - alembic_fumes a day ago ago
    I see. Do you suppose that the origin of these bugs is more about the difficulty of reasoning about the execution of deep async stacks, or does it come down to the developers holding an incorrect mental model of the Rust futures in their minds?
    I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.
    [-]
    - sunshowers a day ago ago
      Yeah, it's a combination of both in my experience. I think even to experienced async Rust programmers, things like Tokio mutexes being really hard to use correctly can be a bit surprising.
      Also, as another comment on the thread points out [1], languages where futures are active by default can have the opposite problem.
      [1] https://news.ycombinator.com/item?id=45467188
CodeBrad a day ago ago
This was one of my favorite talks from RustConf this year! The distinction between cancel safety and cancel correctness is really helpful.
Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.
[-]
- pornel a day ago ago
  "Cancel correctness" makes a lot of sense, because it puts the cancellation in some context.
  I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.
  Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.
  Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.
- sunshowers a day ago ago
  Thanks! I definitely prefer reading blog posts over watching talks as well.
foota a day ago ago
https://sunshowers.io/posts/cancelling-async-rust/#the-pain-... was the most interesting part of this for me, as I can totally see making mistakes like this.
[-]
- schmichael a day ago ago
  I'm a Go developer and this was still useful for me! Obviously Rust devs are more accustomed to more assistance from their tools than Go devs, but just about every gotcha listed is something that can happen in Go with goroutines, channels, select, and other shared concurrency primitives.
  [-]
  - jcgrillo 9 hours ago ago
    !m schmichael
Animats a day ago ago
In the initial example, it's not clear what the desired behavior is. If the queue is full, the basic options are drop something, block and wait, or panic. Timing out on a block is usually deadlock detection. He writes "It turns out that this code is often incorrect, because not all messages make their way to the channel." Well, yes. You're out of resources. Now what?
What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.
The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.
[-]
- sunshowers a day ago ago
  Ideally, what you would like to do is buffer up the message until there's space in the channel. I cover this later in the talk under "What can be done".
  [-]
  - Animats a day ago ago
    The double-loop thing effectively creates a blocking operation. Something you can do directly. Why all the complexity?
    [-]
    - sunshowers a day ago ago
      Agreed that in the narrow case of a timeout it doesn't buy you much (and things like network sockets often let you do timeouts in synchronous code). But often you do want the power to do selects and more complex state machines. I wrote a blog post a couple years ago talking about why a project I'm the author of, cargo-nextest, switched from sync Rust to async. https://sunshowers.io/posts/nextest-and-tokio/
      To this day I'm not aware of a better way to express what's become a set of increasingly complex state machines (the most recent improvement being to make the state machines responsive to user input). Nextest's runner loop is structured mostly like a GUI event loop, but without explicit state machines. It's quite nice being able to write code that's this complex in a bug-free manner.
  - ajross a day ago ago
    Is that ideal, though? I mean, the channel is the buffer. If you need more buffer, it should have been bigger to start with. Generally this reflects a resource exhaustion failure, which you don't handle by adding code. Fix the resource allocation.
    [-]
    - sunshowers a day ago ago
      It depends on how tolerant you are to losing messages under backpressure. In some cases at work we set a large channel size, and then panic if it's exceeded.
- leoedin a day ago ago
  It's in the example isn't it? The example is logging "No space for 5 seconds". It's just a helpful diagnostic that subtly turned into data loss.
  Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".
  [-]
  - sunshowers a day ago ago
    It's definitely a bit contrived, but to me it's also emblematic of the issues with async Rust.
    The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.
    [1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...
    [2] https://github.com/tokio-rs/tokio/pull/5947
    [3] https://rfd.shared.oxide.computer/rfd/0400
- AceJohnny2 a day ago ago
  > He writes
  They go by they/she
  https://sunshowers.io/about/
Panzerschrek a day ago ago
One should always keep in mind that await is always a potential return point. So, using await between two actions which always should be performed together should be avoided.
[-]
- cogman10 a day ago ago
  Wait, how does this work in practice?
  Let's say my code looks like this
```
    async fn a() {
      b().await
    }

    async fn b() {
      c().await
      d().await
    }

    async fn c() {
    }

    async fn d() {
    }
```
  Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?
  [-]
  - cogman10 a day ago ago
    Ah, I see it now in the article. I just missed it.
    `d` not being called would happen because of actions in `a`.
    If `a` were rewritten as
```
    async fn a() {
      try_join!(b(), c(), d())
    }
```
    Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.
    [-]
    - alfiedotwtf 11 hours ago ago
      Maybe I’m thick, but I’m not seeing what is the problem in your first codeblock?
      [-]
      - cogman10 5 hours ago ago
        There's nothing wrong in my first comment, it's the second that clarifies adding a `try_join` at the top of the stack can break things below (which is what I was trying to figure out in my initial comment).
        Because rust is ultimately constructing a state machine which is ran by the caller, the execution of that state machine can be interrupted or partially executed at any of the `await` points. Or more accurately the caller can simply not advance the state machine.
        So, the `try_join` macro can start work on the various functions and if any of them fail, the others are ultimately cancelled. Which can happen before those functions finish fully executing.
        This is particularly bad if there's a partial state change.
        I'm not entirely sure what that means for memory allocation.
- Spivak a day ago ago
  That… seems bad? Like I guess it is what it is and you just have to deal with it but what if your "critical section" has two await calls? The code can be paused between them but it's such that it must eventually resume. Say making a change in the database and emitting an audit edit for that change. Is your only option to either not do that or put a big do not cancel sign on the function docs?
  [-]
  - setr a day ago ago
    Even if you guaranteed the calling code would always logically continue running the function till completion, you wouldn’t have the guarantee the code would actually resume — eg the program crashes between the two calls, network dies, etc.
    If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.
  - diarrhea a day ago ago
    A DB action and audit emission have to run transactionally anyway.
    So on cancellation, the transaction times out and nothing is written. Bad but safe.
    The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.
    If it does not run transactionally you have a problem in any execution scenario.
    [-]
    - sunshowers a day ago ago
      So, regarding transactions, absolutely you can throw them away on cancellation. But there's an interesting wrinkle here: if you use a connection pool like most users, and you were going to do the ROLLBACK at the end of your future on error, then that ROLLBACK wouldn't run it the future is cancelled! Then future operations reusing the same connection would be stuck in transaction la-la land.
      (This is related to the fact that Rust doesn't have async drop — you can't run async code on drop, other than spawning a new task to do the cleanup.)
      This is prong 3 of my cancel correctness framework (that the cancellation violates a system property, in this case a cleanup property.) The solution here is to ensure the connection is in a pristine state before handing it out the next time it's used.
  - sunshowers a day ago ago
    In general I think people end up gravitating towards using message passing or the actor model for this.
tison a day ago ago
Rust's Future is somehow like move semantics in C++, where you may leave a Future in an invalid state after it finishes. Besides, Rust adopts a stackless coroutine design, so you need to maintain the state in your struct if you would like to implement a poll-based async structure manually.
These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).
When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.
Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.
[1] https://github.com/fast/mea
[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...
arifalkner a day ago ago
Great talk! One thing that would have been nice to call out for n00bs like myself is how in SOP Futures can't cancelled. I knew that .await took ownership of the future so that drop() could not be called on it, so given how futures are lazy, it wasn't clear to me how to cancel a future after .await had been called. I later researched how select! and Abortable() did this, but could be nice to include a callout in the beginning of your talk if you ever do it again. Otherwise, nice work!
[-]
- sunshowers a day ago ago
  Thanks! What does SOP mean in this context?
bryanlarsen a day ago ago
Timely! Was grumbling about this today as I added a "this function is cancel safe" to a new function's doc comment.
I really hope we get async drop soon.
[-]
- 0x1ceb00da 18 hours ago ago
  I'm curious. Can you talk a little about that function?
  [-]
  - bryanlarsen 5 hours ago ago
    Most common scenario in article: select!. I split out a "wait for X to be ready" from "X" so that the former could be on the left side of a select ARM, and the rest on the right side.
dxxvi a day ago ago
My Rust knowledge is too low to understand. However, thank you very much for the article. I hope AI will learn it and explain to me in a few years :-)
nofriend a day ago ago
doesn't rust have raii?
[-]
- sunshowers a day ago ago
  It does, but you can only run synchronous code on drop. This is what "async drop" is supposed to handle — things like issuing ROLLBACK statements to the database on cancellation.
  It also wouldn't help when you have no valid state to restore to, as in the mutex example in the post.
  [-]
  - 0x1ceb00da 18 hours ago ago
    tokio-postgres handles this by just dispatching the "ROLLBACK" command in impl Drop and ignoring the response. https://github.com/rust-postgres/rust-postgres/blob/a7a49a90...
    Is this not enough? What could go wrong? If the network connection dies or the task is cancelled, I'm assuming the database server cleans up the connection state and does a rollback automatically.
    And adding async Drop will probably add a whole new set of footguns.
    [-]
    - _davide_ 16 hours ago ago
      > What could go wrong?
      LoL, an insane amount of things. TCP connections are an illusion of safely, for the purpose of database commits use UDP packets as a model instead, it'll be much closer to reality.
      [-]
      - 0x1ceb00da 13 hours ago ago
        > an insane amount of things
        List a couple
        > TCP connections are an illusion of safely
        Why?
CaptainOfCoit a day ago ago
Less clickbaity title: Cancellations in async Rust.
It's really not about "cancelling async Rust" which is what I expected, even if it didn't make much sense.
[-]
- sunshowers a day ago ago
  As the author of the talk/blog post, I was definitely going for a bit of a moral valence in the title, in the sense that future cancellations are very hard to reason about and what I call the least Rusty part of Rust. But it admittedly is a bit clickbaity too.
  [-]
  - acedTrex a day ago ago
    I initially skipped reading it because i thought it was another drama post about maintainers a la all the nixos stuff lately.
    [-]
    - bigiain 21 hours ago ago
      To balance the universe a bit, I read it expecting a drama post - then read it right through because it was at least as interesting as the drama post I'd expected. I also discovered Oxide through this, which looks interesting, except for the complete lack of pricing I can find on their site - which puts in my head as probably in the "If you need to ask the price you can't afford it" category...
      [-]
      - jgord 17 hours ago ago
        I think Oxide should be renting out time on their hardware racks, as well as selling them to big orgs.
        Oxide looks to be superb engineering up and down the whole stack, and if it drives more rust code into linux all the better.
        Now that linode has been consumed by Akamai, we need an alternative.
- happytoexplain a day ago ago
  As in the pop-culture concept of cancelling? That's what you assumed the topic "cancelling async <language name>" was going to be about??
  Or am I missing context?
  [-]
  - fulafel 17 hours ago ago
    Async in Rust isn't exactly universally loved, partly because async Rust is perceived to spread progressively to Rust libraries making it less optional to use. See eg https://bitbashing.io/async-rust.html, "Async Rust Is A Bad Language"
  - hansvm a day ago ago
    That's what I assumed. I know languages have to handle cancellations in async code, but Rust has had a fair amount of drama over the years, and I assumed the title was accurate and reflected that some drama was happening.
    [-]
    - sunshowers a day ago ago
      Appreciate the feedback here — definitely don't want the title to overshadow the work itself. Will keep this in mind for next time.
  - thehamkercat a day ago ago
    they probably assumed something like some_running_async_task.cancel()
- benatkin a day ago ago
  It made sense to me, because I imagine a thread or coroutine as something that runs code as though it were interpreting something like psuedocode, whether it's doing that or not. So from my point of view an instance of async Rust is being cancelled - not the feature of the Rust project, but instances of code.
  This abstraction has served me well and facilitates stepping through code in a debugger, though I jump out of thinking it at that level when I need to think of it at a lower level.
- binary132 a day ago ago
  if only
  [-]
  - zackmorris a day ago ago
    +1 this.
    IMHO async is an anti-pattern, and probably the final straw that will prevent me from ever finishing learning Rust. Once one learns pass-by-value and copy-on-write semantics (Clojure, PHP arrays), the world starts looking like a spreadsheet instead of spaghetti code. I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory. Since that gets ever-less expensive, I'm just not willing to die on the hill of efficiency anymore. I predict that someday Rust will be relegated to porting scripting languages to a bare-metal runtime, but will not be recommended for new work.
    That said, I think that Rust would make a great teaching tool in an academic setting, as the epitome of imperative languages. Maybe something great will come of it, like Swift from Objective-C or Kotlin from Java. And having grown up on C++, I have a soft spot in my heart for solving the hard problems in the fastest way possible. Maybe a voxel game in Rust, I dunno.
    [-]
    - vlovich123 a day ago ago
      > Since that gets ever-less expensive,
      That kind of thinking made sense in the 90s when things followed Moore’s law. But DRAM was one of the first things to fail to keep up: https://ourworldindata.org/grapher/historical-cost-of-comput... and barely gets cheaper anymore. Thats why mobile phones still only have 16gb of memory despite having 4gib a decade ago.
      And there’s all sorts of problems that Rust doesn’t necessarily make a great fit for. But Rust’s target marketplace is where you’d otherwise use a low level language like C or C++. If you can just heap allocate everything and aggressively create copies all over the place, then why would you ever use those languages in the first place.
      And for what it’s worth Rust is finding a lot of success even replacing all the tooling in other language ecosystems like Ruby, Python, and JS precisely because the tools in those ecosystems written in the native language end up being horribly slow. And memory allocation and randomly deep copying arrays are the kinds of things that add up and make things slow (in addition to GC pauses, slow startups, interpreter costs etc).
      And you can always choose not to do async in Rust although personally I’m a huge fan as it makes it really clear where you have sprinkled in I/O in places you shouldn’t have.
      [-]
      - koito17 a day ago ago
        Before adopting Rust, I also found it silly for high-level tasks where e.g. Clojure or Java would suffice. However, the results of using Rust changed my mind.
        I used to write web backends in Clojure, and justified it with the fact that the JVM has some of the best profiling tools available (I still believe this), and the JVM itself exposes lots of knobs to not only fine-tune the GC, but even choose a GC! (This cannot be understated; garbage collectors tend to be deeply integrated into a language's runtime, and it's amazing to me that the Java platform manages to ship several garbage collectors, each of which are optimal in their own specific situations).
        After rewriting an NLP-heavy web app in Rust, I saw massive performance gains over the original Clojure version, even though both aggressively copy data and the Rust version is full of atomic refcounts (atomic refcounting is not the fastest GC out there...)
        The binary emitted by rustc is also much smaller. ~10 MB static binary vs. GraalVM's ~80 MB native images (and longer build times, since classpath analysis and reflection scanning require a lot of work)
        What surprised me the most is how high-level Rust feels in practice. I can use pattern matching, async/await, functional programming idioms, etc., and it ends up being fast anyway. Coming from Clojure, Rust syntax trying its best to be expression-oriented is a key differentiator from other languages in its target domain (notably, C++). I sometimes miss TypeScript's anonymous enums, but Rust's type system can express a lot of of runtime behavior, and it's partly why many jokingly state "if it compiles, it's likely correct". Then there's the little things, like how Rust's Futures don't immediately start in the background. In contrast, JavaScript Promises are immediately pushed to a microtask queue, so cancelling a Promise is impossible by design.
        Overall, it's the little things like this -- and the toolchain (cargo, clippy, rustfmt) -- that have kept me using Rust. I can write high-level code and still compile down to a ~5 MB binary and outperform idiomatic code in other languages I'm familiar with (e.g. Clojure, Java, and TypeScript).
        [-]
        sunshowers a day ago ago
        Speaking personally, that is what first attracted me to Rust — that you can write high-level idiomatic code and still get roughly optimal performance.
      - rafram 20 hours ago ago
        It isn’t as dramatic a decrease as other types of storage, but $4,000 to $1,000 per terabyte in a decade is still a big drop.
    - sunshowers a day ago ago
      Author here -- I'd recommend reading my blog post about how cargo-nextest uses Tokio + async Rust to handle very complex state machines: https://sunshowers.io/posts/nextest-and-tokio/
    - wongarsu a day ago ago
      The rust ecosystem is very invested into making every library that touches the network async. But if the program you are writing doesn't touch the network you don't have to think about async. Or you can banish network code onto one thread with an async runtime, and communicate via flume queues/channels with it from normal threaded code running in another thread
      [-]
      - bigstrat2003 a day ago ago
        > The rust ecosystem is very invested into making every library that touches the network async.
        Right, and that is one of the absolute worst things about the Rust ecosystem. Most programs don't benefit from async, and should use plain old threads because they are much easier to work with.
        [-]
        sunshowers a day ago ago
        There is a very reasonable argument that an entire language feature shouldn't be oriented towards making high-complexity state machines easy to write, since they're relatively rare in production. But speakingly purely selfishly, I'm happy I can write something like cargo-nextest using async Rust in a bug-free manner.
    - xmodem a day ago ago
      In your view, which languages / ecosystems have a better general approach for handling task cancellations than async rust?
      [-]
      - neillyons a day ago ago
        Elixir via the Task module https://hexdocs.pm/elixir/Task.html
      - GhosT078 a day ago ago
        Ada has very well thought out and proven tasking features, including clean methods of task cancellation.
    - sertsa a day ago ago
      There is a voxel game in Rust, btw: https://veloren.net/
    - airstrike a day ago ago
      This reads hella uninformed
    - dlahoda 15 hours ago ago
      lean4.
      it analyses code. if it finds raii/linearity/single-ownership, it does exactly like rust mem mgmt.
      but if it js not, it does rc.
      so it does what rust, but automagically without polluting code.
      so cow or pbw or 2mem are not only options to improve rust.
    - 63stack a day ago ago
      By allocating twice the memory of ...?
      [-]
      - SV_BubbleTime 19 hours ago ago
        Everything, everywhere, all the time! It’s so simple, why has no one ever thought of just increasing a finite resource!?
  - moggers123 17 hours ago ago
    I just don't really know where the ecosystem is going with async these days. I see a lot of changes in the language, many of which seem more complex than are typical for the justification, some of which have broader utility but generally wouldn't be done if it weren't for them being necessary for async... A hydra of complexity and honestly, where does it end? When will async be "solved"? What will the language look like when it is? Is it really all justified? Did we know that the road would be this long when we started it? For me, you could resolve this complexity by just letting me flip a switch and have things operate in a really basic blocking mode with no state machine or runtime and I'll sling this stuff on a thread if I have to, the way I used to. My use cases never needed this. I can get by just fine solving the circumstances I would need something like this with a purpose build mechanism, not a language native feature trying to solve all the different flavors of asynchronous problems I have + a million I don't.
    of course... Its obviously not as simple as "just give me a way to turn it off", but more importantly, I just don't see this concern being addressed by the Powers That Be. Am I just not looking hard enough? Did I miss the rust blog post titled "hey - so you didn't want to use async but the libraries that you did want to use ship with async so you're up shit creek.. Here's what our plan for that is"?
    I'm sorry. I generally lurk because I don't consider myself up to the caliber of others on this website, but nonetheless the few posts I make do end up being about async because it does make me feel quite hopeless at times. Hopefully someone can look passed my ignorance/incompetence/selfishness/immaturity and tell me its all going to be okay.
beanjuiceII a day ago ago
i am honestly glad i don't write rust anymore
nijaar a day ago ago
for a sec i thought DEI was going too far
is the title like that on purpose?