Debloating your async Rust even further

I have really enjoyed Debloat your async Rust from Tweede Golf.

One example shows that removing one .await point from an async function can reduce the size of the generated assembly code by 11.5%.

However, I thought we could go further. Spoiler: I couldn’t write a better optimized version!

Unoptimized version

This is the original unoptimized version, with 3 .await points. This causes the generated state machine to have 6 states (there are three default ones).

pub async fn process_command() {
    match get_command().await {
        CommandId::A => send_response(123).await,
        CommandId::B => send_response(456).await,
    }
}

Here is the output of a testing binary built to call this function, from cargo asm:

 0 "core::ops::function::FnOnce::call_once{{vtable.shim}}" [20]
 1 "core::ptr::drop_in_place<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future+Output = ()>>>" [87]
 2 "core::ptr::drop_in_place<debloating_async_rust_more::variant::process_command::{{closure}}>" [6]
 3 "debloating_async_rust_more::common::send_response" [7]
 4 "debloating_async_rust_more::main" [90]
 5 "debloating_async_rust_more::variant::process_command" [8]
 6 "debloating_async_rust_more::variant::process_command::{{closure}}" [186]
 7 "main" [18]
 8 "std::rt::lang_start" [28]
 9 "std::rt::lang_start::{{closure}}" [19]
10 "std::sys::backtrace::__rust_begin_short_backtrace" [21]

We will use it for comparison in the next section.

First optimization: removing one .await

Each .await point creates an additional state; removing one removes one state.

Here is the optimized code:

pub async fn process_command() {
    let response = match get_command().await {
        CommandId::A => 123,
        CommandId::B => 456,
    };
    send_response(response).await;
}

This brings the amount of states down to 5 (from 6).

Now, we can diff cargo asm’s output:

8c8
<  6 "debloating_async_rust_more::variant::process_command::{{closure}}" [186]
---
>  6 "debloating_async_rust_more::variant::process_command::{{closure}}" [148]

We can see that our function goes from 186 bytes to 148 bytes.

We can also look at the size of the output of cargo asm when inspecting these two closures:

Variant	Line count
Unoptimized	104
Optimized	82

Note that this is not the instructions count; this is just the length of the output of cargo asm which includes the ASM body. I could count the amount of instructions but I think that this is enough to showcase that there is indeed a real simplification being achieved by removing one .await point.

Attempting a final optimization

Example from the original article

In the original article, we see that we can sometimes remove async and make the function return a wrapped Future instead.

Here is the example from the article:

Before:

async fn foo() -> i32 {
    let a = quux();
    let num = bar(a).await;
    num * 2
}

After:

use futures::future::FutureExt;

fn foo() -> impl Future<Output = i32> {
    let a = quux();
    bar(a).map(|num| num * 2)
}

As you can see, we are using .map from the FutureExt trait, which wraps the future returned by bar.

The non-obvious side-effect is that in the optimized version, quux is eagerly evaluated (e.g. gets called before calling .await).

This is more efficient because instead of creating a new state machine, we wrap the Future’s poll implementation so that we call the |num| num * 2 closure on its output.

Under the hood .map() creates a futures::future::Map which implements Future with a simpler poll implementation which just executes the parent future and replaces its output by using the closure passed by the user. In our case, that closure is |num| num * 2.

Here is the source code of futures::future::Map::poll:

impl<Fut, F, T> Future for Map<Fut, F>
where
    Fut: Future,
    F: FnOnce1<Fut::Output, Output = T>,
{
    type Output = T;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<T> {
        match self.as_mut().project() {
            MapProj::Incomplete { future, .. } => {
                let output = ready!(future.poll(cx));
                match self.project_replace(Self::Complete) {
                    MapProjReplace::Incomplete { f, .. } => Poll::Ready(f.call_once(output)),
                    MapProjReplace::Complete => unreachable!(),
                }
            }
            MapProj::Complete => {
                panic!("Map must not be polled after it returned `Poll::Ready`")
            }
        }
    }
}

(link)

It looks a bit complicated, but the main parts are:

resolving the wrapped future: let output = ready!(future.poll(cx))
calling the closure on its output + returning: Poll::Ready(f.call_once(output))

Applying it to our code

Now that we understand this optimization in depth, we can continue optimizing the original function from this article.

Before

pub async fn process_command() {
    let response = match get_command().await {
        CommandId::A => 123,
        CommandId::B => 456,
    };
    send_response(response).await;
}

After

use futures::future::FutureExt;

pub fn process_command() -> impl Future<Output = ()> {
    get_command().then(|cmd| send_response(match cmd {
        CommandId::A => 123,
        CommandId::B => 456,
    }))
}

What’s happening here?

Similarly to the last example, we are:

removing the async keyword
explicitly stating that we return a Future with impl Future<Output = ()>
Using a FutureExt helper to wrap the initial future.

We remove indices and sort the outputs of cargo asm for the first optimization + this final optimization, then get a new diff:

< "core::ptr::drop_in_place<debloating_async_rust_more::variant::process_command::{{closure}}>" [6]
< "debloating_async_rust_more::common::send_response" [7]
---
> "core::ptr::drop_in_place<futures_util::future::future::Then<debloating_async_rust_more::variant::get_command::{{closure}},debloating_async_rust_more::variant::send_response::{{closure}},debloating_async_rust_more::variant::process_command::{{closure}}>>" [6]
6,7c5,7
< "debloating_async_rust_more::variant::process_command" [8]
< "debloating_async_rust_more::variant::process_command::{{closure}}" [148]
---
> "debloating_async_rust_more::variant::process_command" [10]
> "debloating_async_rust_more::variant::send_response" [7]
> "<futures_util::future::future::Then<Fut1,Fut2,F> as core::future::future::Future>::poll" [132]

We are losing 14 bytes in total, but it looks like our binary is actually a little bit more complex since we bring futures_util::future::future::Then<xxx>::poll into scope.

What’s for sure is that the last optimized version wins when it comes to code size since we’re losing 14 bytes.

However, what about raw performance? I wrote a quick benchmark to compare the three versions, and here are the results:

Variant	Time per iter	vs unoptimized
Unoptimized	10.1 ns	—
Optimized v1	6.3 ns	−38%
Optimized v2	8.7 ns	−15%

As you can see, the async version with two .await points wins!

Apparently, Then introduces a lot of overhead. It’s a little smaller, yes, but in the end, it’s not faster.

Lessons learned

futures::future::FutureExt::then is not faster than an async state machine with two await points.

I think that what we can learn is that it’s important to run benchmarks rather than following best practices blindly.

I wasn’t sure whether to publish this or not, since the results are not positive, but I figured it could be useful to someone else.