Squasher
Recently Peter Kofler wrote a blog post in which he looks at the Design of a Separable Transition-Diagram Compiler paper by Melvin E. Conway (which was published in 1963 and considered to be the first paper mentioning coroutines) and reimplements a basic coroutine in the modern assembly for Windows. Since I don’t have access to any Windows machines, Peter kindly agreed to pair up with me on porting the code from his blog post to macOS and 64-bit assembly. While doing this I ended up reading the paper and then cloning and refactoring the code to make it a bit simpler at the expense of the code not matching the paper anymore.
Below you will find my interpretation of the problem as described in the first two pages of the paper and solutions with/without coroutines in Kotlin and NASM (see this github repo for full source code and tests). It’s worth mentioning that NASM solutions and implementation of basic coroutines are based on Peter’s code.
The problem
Imagine a simple task of replacing all occurrences of **
(two asterisks) with ^
in a stream of data. For example, given ***abc
, the output should be ^*abc
. In the paper this process is called “squashing” and can be implemented in Kotlin as a one-liner:
However, there is an additional constraint which makes the problem a bit more interesting. In particular, squasher can only consume or output one character at a time. In Kotlin this can be expressed as an Iterator<Char>
which wraps another Iterator<Char>
. (In the paper the input is being read infinitely from punch cards but it’s safe to skip these details.)
Kotlin implementation
Given the constraint, a straightforward squasher implementation might look like this:
Because of the requirement to output only one char at a time, the code has additional non-essential complexity of keeping the state of the iterator between invocations in hasLastChar
and lastChar
properties. In a way, we can think about the code block guarded by hasLastChar
as a callback or even as a state machine with only two states.
This is exactly the situation when coroutines can help by “returning” values in the middle of a function and resuming from that point later. The squasher implementation below doesn’t need the hasLastChar
flag since this logic is already embedded in the control flow of the code.
I don’t think there is anything particularly revealing in the examples above but it feels nice to be able to translate such an old problem to a modern language and it, hopefully, explains the problem domain before diving into assembly code.
NASM implementation (with functions)
From this point, I’ll assume that you are familiar with x86 assembler enough to be able to get the gist of the following code snippets (which are written using NASM and can be found in this github repo). If not, you can try this intro, this guide or this article on 64-bit assembly.
For simplicity, the NASM implementation doesn’t attempt to mimic an Iterator
but rather reads a fixed size input from stdin.
The main
function is just a loop which for each input char calls squasher
and prints to stdout one character at a time:
The squasher
function uses next_char
to get the next input character and puts the output character into the rax
register before returning to main
. Similar to the Kotlin version without coroutines, squasher
has to keep global state and includes additional logic to return lastChar
.
As you may have noticed, the code above uses _call
and _ret
macros (instead of built-in call
and ret
). This is to show implementation details of simplistic function calling convention.
NASM implementation (with coroutines)
Since x86 assembler doesn’t have coroutines, let’s implement them. The idea is to reserve additional memory for each coroutine to store the address of its resuming point (initially set to the first instruction of coroutine) so when coroutine is called again, it continues from where it left off.
To pass control to a coroutine we can use the following co_call
macro. The macro expects to be called from a coroutine so it updates the resuming point of the current coroutine (in case the coroutine gets control back some time later). Then the macro looks up the resuming point of the coroutine it’s about to call and jumps to that address. No doubt, this is a toy implementation which doesn’t save/restore registers or stack when switching between coroutines but, hopefully, it’s simple enough to illustrate the concept.
This main
function is the same as the version without coroutines except for the use of co_call
macro and having to prepare the stack so the macro thinks it’s called from a coroutine.
There are more changes in squasher
though. The most interesting change is that there are no “returns” because conceptually squasher
is not any different from the main
coroutine and there is nowhere to return to, so squasher
can only call sub-functions or pass control to another coroutine. As a consequence the squasher
now is an infinite loop (see jmp squasher
instructions). Similar to the Kotlin version with coroutines, squasher
doesn’t need the hasLastChar
variable anymore because the branching is stored as a resuming point.
You can find the full source code for 64-bit Linux and macOS in this github repo.
How is that different from threads?
Fundamentally, OS kernel switching between threads has no magic and is based on the instruction pointer manipulation similar to the macro above (for example, see this stackoverflow question). The main difference compared to coroutines is that conceptually switching between threads is preemptive, i.e. the function which is being executed doesn’t decide when to switch to another function, instead, thread scheduler does it. Another difference is that thread scheduler normally doesn’t provide any guarantees about which CPU core thread will run on and, therefore, the need for “thread safety” which really is all about “multi-core safety”. (As an interesting side note there are no special assembly instructions for switching between CPU cores. Instead, several CPUs just run instructions independently with access to common shared memory. See The Unofficial SMP FAQ). Overall, even though the underlying mechanism might be similar, coroutines give you more control over context switches compared to threads.
Summary
Kotlin being a high-level language has a more sophisticated coroutine implementation, than the one shown above, with a lot of details related to other language features and type system but at the core it’s the same trick of storing the latest entry point of a coroutine and, when the coroutine is resumed, jumping to the right part of the code (you can find the implementation in CoroutineTransformerMethodVisitor which uses tableswitch instruction to do the jump).
Hopefully, this blog helps with understanding basic coroutine implementation details. For a description of coroutine flavours from the point of view of programming language user, you can try reading the following posts: