kaashif's blog

Programming, with some mathematics on the side

macroexpand-1 for C++ Coroutines

2024-07-27

I went to a talk recently about C++ coroutines, and I don't think it was very good. The talk went through some examples of C++ coroutines and had a surface-level handwavy explanation of what "really happens" when compilers see a coroutine.

But a non handwavy explanation is really easy - you can just look at what the compiler does to coroutine code to see what's really happening. No analogy, no handwave, just looking at real code.

How do we do that without looking at LLVM IR or something? Easy - compile the binary then decompile it into normal C++ and see what it looks like! So let's do it!

The title is a reference to macroexpand-1 from Lisp. Looking at the source code resulting from a coroutine kind of reminds me of expanding a macro in Lisp.

This post is not intended to actually be accessible to beginners or readable for anyone, but it does illustrate an approach to demystifying coroutines that I like.

Minimal coroutine example

A minimal coroutine example needs to demonstrate suspending and resuming execution at a minimum. It's not interesting if we just have a single co_return and it's all optimized away.

Here's my example in its entirety. First, the coroutine type.

#include <iostream>
#include <coroutine>

struct Coroutine {
    struct promise_type;
    std::coroutine_handle<promise_type> handle;

    Coroutine(std::coroutine_handle<promise_type>&& x) : handle(x) {}

    struct promise_type {
        Coroutine get_return_object() { return Coroutine{std::coroutine_handle<promise_type>::from_promise(*this)}; }
        void unhandled_exception() noexcept {}
        void return_void() noexcept { }
        std::suspend_never initial_suspend() noexcept { return {}; }
        std::suspend_never final_suspend() noexcept { return {}; }
    };
};

This post is not a coroutine tutorial, so I won't explain this code in depth. The point of this post is to look at the real code you get. A few points:

  • The coroutine_handle is what you can call resume() on to resume the execution of the coroutine. It must keep track of the state of the coroutine when it was suspended.

  • The other methods like return_void are just required by the standard and the compiler, but we don't really do anything interesting in them.

Here's the coroutine itself:

Coroutine test_coroutine() {
    std::cout << "started!\n";
    co_await std::suspend_always{};
    std::cout << "returning!\n";
    co_return;
}

Very simple conceptually - when we call the coroutine, we print something, suspend, then when we resume we print something else and return.

suspend_always is an awaitable that just has await_ready defined as false, so using co_await on it always suspends the coroutine.

Finally, main:

int main() {
    auto coro = test_coroutine();
    std::cout << "main\n";
    coro.handle.resume();
    return 0;
}

Again, very simple - call the coroutine, it suspends, we print something to demonstrate that the coroutine really was suspended in the middle, then we resume it.

Decompiling

This is the point at which someone might be tempted to wax lyrical about state machines, pseudocode, draw an analogy, etc etc. No.

Let's compile the above and feed it into https://dogbolt.org/ which lets you run various decompilers on any executable you upload. I'll do that and walk through the nicest looking output I can find.

Compile and run the example:

$ g++ -o a.out.clang -std=c++20 coro.cpp
$ ./a.out
started!
main
returning!

Perfect! It works. Let's decompile it. Upload a.out to https://dogbolt.org/ to follow along. I looked at the output of all of the decompilers and I think dewolf is the most informative and readable for this particular case.

Dewolf can be found here: https://github.com/fkie-cad/dewolf.

What's really happening?

Now we can walk through the decompiled code, which looks like C-ish code and not C++20 coroutine code. You'll notice some name mangling but it's actually very readable!

Let's start at main:

int main() {
    int var_0;
    long var_3;
    long var_4;
    var_4 = test_coroutine(/* frame_ptr */ var_0);
    std::operator<<<std::char_traits<char>_>(/* __out */ std::cout, /* __s */ "main\n");
    var_3 = var_4;
    std::__n4861::coroutine_handle<Coroutine::promise_type>::resume(/* this */ &var_3);
    return 0;
}

Something small and hard to notice happened here - where we wrote test_coroutine to take no arguments, here it appears to take an argument frame_ptr.

This is key to how coroutines work - when you call a coroutine, in this case test_coroutine, the compiler rewrites your code to add a frame pointer argument. In this coroutine frame, we keep track of where the coroutine was when it was suspended, and local variables. This allows us to resume the coroutine with the same state, from the same place as when it was suspended.

Let's look at test_coroutine:

Coroutine test_coroutine(_Z14test_coroutinev.Frame * frame_ptr) {
    long(void *) ** var_0;
    var_0 = operator_new(/* sz */ 40UL);
    *(var_0 + 34L) = 0x1;
    *var_0 = test_coroutine_actor;
    *(var_0 + 8L) = test_coroutine_destroy;
    *(var_0 + 32L) = 0x0;
    test_coroutine_actor(var_0);
    return Coroutine::promise_type::get_return_object(/* this */ var_0 + 16L);
}

This was rewritten significantly. Notice that no calls to std::cout appear here. The real work has been moved to test_coroutine_actor.

What's left in test_coroutine is just setting up the coroutine frame with:

  • A pointer to the function that does the real work test_coroutine_actor.

  • A pointer to the cleanup function test_coroutine_destroy.

  • The initial state of the coroutine, 0x0.

We then call test_coroutine_actor, which is where the real work is. This is where the handwaving about a state machine ends and we can actually look at the real state machine the compiler gives us.

long test_coroutine_actor(void * arg1) {
    long var_9;
    void * var_0;
    void * var_1;
    void * var_2;
    void * var_3;
    void * var_4;
    void * var_5;
    var_1 = arg1 + 32L;
    if ((*var_1 & 1) == 0) {
        var_2 = arg1 + 38L;
        if (((unsigned short)*var_1 <= 6) && ((unsigned short)*var_1 != 6)) {
            var_3 = arg1 + 37L;
            var_4 = arg1 + 16L;
        }
        if (((unsigned short)*var_1 <= 4) && ((unsigned short)*var_1 != 4)) {
            var_0 = arg1 + 24L;
            var_2 = arg1 + 35L;
            var_5 = arg1 + 36L;
        }
        switch((unsigned short) *(var_1)) {
        case 0:
            *var_0 = data_0x1904(/* __a */ arg1);
            *var_2 = 0x0;
            Coroutine::promise_type::initial_suspend(/* this */ var_4);
            std::__n4861::suspend_never::await_ready(/* this */ var_5);
        case 2:
            *var_2 = 0x1;
            std::__n4861::suspend_never::await_resume(/* this */ var_5);
            std::operator<<<std::char_traits<char>_>(/* __out */ std::cout, /* __s */ "started!\n");
            std::__n4861::suspend_always::await_ready(/* this */ var_3);
            *var_1 = 0x4;
            data_0x18de(/* this */ var_0);
            std::__n4861::suspend_always::await_suspend(/* this */ var_3);
            return sub_158e(&var_9);
            break;
        case 4:
            std::__n4861::suspend_always::await_resume(/* this */ var_3);
            std::operator<<<std::char_traits<char>_>(/* __out */ std::cout, /* __s */ "returning!\n");
            Coroutine::promise_type::return_void(/* this */ var_4);
            *arg1 = 0x0;
            Coroutine::promise_type::final_suspend(/* this */ var_4);
            std::__n4861::suspend_never::await_ready(/* this */ var_2);
            break;
        }
        std::__n4861::suspend_never::await_resume(/* this */ var_2);
    }
    if (((*var_1 & 1) == 0) || ((unsigned short)*var_1 == 7) || ((unsigned short)*var_1 == 1) || ((unsigned short)*var_1 == 5) || ((unsigned short)*var_1 == 3)) {
        if ((unsigned char)*(arg1 + 34L) != 0) {
            operator_delete(/* ptr */ arg1);
        }
        arg1 = var_0 + 40L;
        return *arg1 - *arg1;
    }
}

This looks exactly like a standard switch/case state machine you might find in any C or C++ codebase, except with really poorly named variables.

We can see that the state variable arg1 contains a member at arg1 + 32L which indicates the point the coroutine has reached. Initially it's 0 so we execute the first case.

Case by case:

  • 0: When writing the coroutine class, we set initial_suspend to return suspend_never - calling await_ready on that returns true and thus we don't suspend in the first case. No break means we fall through.

  • 2: suspend_never does nothing on resume, we print started!, then we need to construct a coroutine handle to return when we suspend. This is critical - saving our state is what lets us resume. Our handle really only needs to keep track of one thing - where to resume. We want to resume at 4, which is the next state. That's what *var_1 = 0x4 saves.

    We then return the coroutine handle.

    This is where the first call to test_coroutine_actor ends.

  • 4: When main calls handle.resume, handle.resume calls test_coroutine_actor with the saved frame from earlier, with state 4. That means the switch/case skips straight to case 4 and we print returning!. Next is a few lines of uninteresting cleanup.

The compiler isn't doing anything complex or clever here, it's just transforming your coroutine that uses co_* into a function that takes a state argument and has a switch/case.

Not magic at all!

Conclusion

Learning always benefits from motivation. When writing a state machine by hand to e.g. parse input or do something while waiting for non-blocking I/O, programmers often want something like language support for coroutines.

Problems with coroutines aside, I think a good coroutine talk would go something like this:

Start with the problem (whatever it is), show a pre C++20 solution, then show a C++20 solution, and finally show that C++20 coroutines are actually totally equivalent to something you could write yourself - coroutines are just syntactic sugar.

You can obviously write coroutines and use non blocking I/O or state machines even in C++98! It's just easier (sometimes) in 20.

Lots of coroutine talks do look like this, but some are beset by padding and nonsense that expand 20 minutes of content into an hour.

Again, no magic here. Don't handwave. You don't need to be a compiler engineer to understand this stuff and claiming otherwise is disingenuous. If you handwave and can't answer deeper questions, I'll lose respect for you. If half of your talk is filler but you can answer questions, that's fine but it's annoying.