Recently Peter Kofler wrote a blog post in which he looks at the Design of a Separable Transition-Diagram Compiler paper by Melvin E. Conway (which was published in 1963 and considered to be the first paper mentioning coroutines) and reimplements a basic coroutine in the modern assembly for Windows. Since I don’t have access to any Windows machines, Peter kindly agreed to pair up with me on porting the code from his blog post to macOS and 64-bit assembly. While doing this I ended up reading the paper and then cloning and refactoring the code to make it a bit simpler at the expense of the code not matching the paper anymore.
Below you will find my interpretation of the problem as described in the first two pages of the paper and solutions with/without coroutines in Kotlin and NASM (see this github repo for full source code and tests). It’s worth mentioning that NASM solutions and implementation of basic coroutines are based on Peter’s code.
Imagine a simple task of replacing all occurrences of
** (two asterisks) with
^ in a stream of data. For example, given
***abc, the output should be
^*abc. In the paper this process is called “squashing” and can be implemented in Kotlin as a one-liner:
However, there is an additional constraint which makes the problem a bit more interesting. In particular, squasher can only consume or output one character at a time. In Kotlin this can be expressed as an
Iterator<Char> which wraps another
Iterator<Char>. (In the paper the input is being read infinitely from punch cards but it’s safe to skip these details.)
Given the constraint, a straightforward squasher implementation might look like this:
Because of the requirement to output only one char at a time, the code has additional non-essential complexity of keeping the state of the iterator between invocations in
lastChar properties. In a way, we can think about the code block guarded by
hasLastChar as a callback or even as a state machine with only two states.
This is exactly the situation when coroutines can help by “returning” values in the middle of a function and resuming from that point later. The squasher implementation below doesn’t need
hasLastChar flag since this logic is already embedded in the control flow of the code.
I don’t think there is anything particularly revealing in the examples above but it feels nice to be able to translate such an old problem to a modern language and it, hopefully, explains the problem domain before diving into assembly code.
NASM implementation (with functions)
From this point, I’ll assume that you are familiar with x86 assembler enough to be able to get the gist of the following code snippets (which are written using NASM and can be found in this github repo). If not, you can try this intro, this guide or this article on 64-bit assembly.
For simplicity, the NASM implementation doesn’t attempt to mimic an
Iterator but rather reads a fixed size input from stdin.
main function is just a loop which for each input char calls
squasher and prints to stdout one character at a time:
squasher function uses
next_char to get the next input character and puts output character into
rax register before returning to
main. Similar to the Kotlin version without coroutines,
squasher has to keep global state and includes additional logic to return
As you may have noticed, the code above uses
_ret macros (instead of built-in
ret). This is to show implementation details of simplistic function calling convention.
NASM implementation (with coroutines)
Since x86 assembler doesn’t have coroutines, let’s implement them. The idea is to reserve additional memory for each coroutine to store address of its resuming point (initially set to the first instruction of coroutine) so when coroutine is called again, it continues from the where it left off.
To pass control to a coroutine we can use the following
co_call macro. The macro expects to be called from a coroutine so it updates resuming point of the current coroutine (in case the coroutine gets control back some time later). Then the macro looks up resuming point of the coroutine it’s about to call and jumps to that address. No doubt, this is a toy implementation which doesn’t save/restore registers or stack when switching between coroutines but, hopefully, it’s simple enough to illustrate the concept.
main function is the same as the version without coroutines except for the use of
co_call macro and having to prepare stack so the macro thinks it’s called from a coroutine.
There are more changes in
squasher though. The most interesting change is that there are no “returns” because conceptually
squasher is not any different from the
main coroutine and there is nowhere to return to, so
squasher can only call sub-functions or pass control to another coroutine. As a consequence the
squasher now is an infinite loop (see
jmp squasher instructions). Similar to the Kotlin version with coroutines,
squasher doesn’t need
hasLastChar variable anymore because the branching is stored as resuming point.
You can find the full source code for 64-bit Linux and macOS in this github repo.
How is that different from threads?
Fundamentally, OS kernel switching between threads has no magic and is based on the instruction pointer manipulation similar to the macro above (for example, see this stackoverflow question). The main difference compared to coroutines is that conceptually switching between threads is preemptive, i.e. the function which is being executed doesn’t decide when to switch to another function, instead, thread scheduler does it. Another difference is that thread scheduler normally doesn’t provide any guarantees about which CPU core thread will run on and, therefore, the need for “thread safety” which really is all about “multi-core safety”. (As an interesting side note there are no special assembly instructions for switching between CPU cores. Instead, several CPUs just run instructions independently with access to common shared memory. See The Unofficial SMP FAQ). Overall, even though the underlying mechanism might be similar, coroutines give you more control over context switch compared to threads.
Kotlin being a high-level language has a more sophisticated coroutine implementation, than the one showed above, with a lot of details related to other language features and type system but at the core it’s the same trick of storing the latest entry point of a coroutine and, when the coroutine is resumed, jumping to the right part of the code (you can find the implementation in CoroutineTransformerMethodVisitor which uses tableswitch instruction to do the jump).
Hopefully, this blog helps with understanding basic coroutine implementation details. For a description of coroutine flavours from the point of view of programming language user, you can try reading the following posts: