Recently Peter Kofler wrote a blog post in which he looks at the Design of a Separable Transition-Diagram Compiler paper by Melvin E. Conway (which was published in 1963 and considered to be the first paper mentioning coroutines) and reimplements a basic coroutine in the modern assembly for Windows. Since I don’t have access to any Windows machines, Peter kindly agreed to pair up with me on porting the code from his blog post to macOS and 64-bit assembly. While doing this I ended up reading the paper and then cloning and refactoring the code to make it a bit simpler at the expense of the code not matching the paper anymore.

Below you will find my interpretation of the problem as described in the first two pages of the paper and solutions with/without coroutines in Kotlin and NASM (see this github repo for full source code and tests). It’s worth mentioning that NASM solutions and implementation of basic coroutines are based on Peter’s code.

The problem

Imagine a simple task of replacing all occurrences of ** (two asterisks) with ^ in a stream of data. For example, given ***abc, the output should be ^*abc. In the paper this process is called “squashing” and can be implemented in Kotlin as a one-liner:

fun String.squash() = this.replace("**", "^")

However, there is an additional constraint which makes the problem a bit more interesting. In particular, squasher can only consume or output one character at a time. In Kotlin this can be expressed as an Iterator<Char> which wraps another Iterator<Char>. (In the paper the input is being read infinitely from punch cards but it’s safe to skip these details.)

Kotlin implementation

Given the constraint, a straightforward squasher implementation might look like this:

fun squasher(input: Iterator<Char>) = object : Iterator<Char> { var hasLastChar = false var lastChar = 0.toChar() override fun next() = if (hasLastChar) { hasLastChar = false lastChar } else { var char = if (char == '*') { lastChar = if (lastChar == '*') char = '^' else hasLastChar = true } char } override fun hasNext() = input.hasNext() || hasLastChar }

Because of the requirement to output only one char at a time, the code has additional non-essential complexity of keeping the state of the iterator between invocations in hasLastChar and lastChar properties. In a way, we can think about the code block guarded by hasLastChar as a callback or even as a state machine with only two states.

This is exactly the situation when coroutines can help by “returning” values in the middle of a function and resuming from that point later. The squasher implementation below doesn’t need the hasLastChar flag since this logic is already embedded in the control flow of the code.

fun squasher(input: Iterator<Char>): Iterator<Char> = sequence { while (input.hasNext()) { val char = if (char != '*') yield(char) else { val lastChar = if (lastChar == '*') { yield('^') } else { yield(char) yield(lastChar) } } } }.iterator()

I don’t think there is anything particularly revealing in the examples above but it feels nice to be able to translate such an old problem to a modern language and it, hopefully, explains the problem domain before diving into assembly code.

NASM implementation (with functions)

From this point, I’ll assume that you are familiar with x86 assembler enough to be able to get the gist of the following code snippets (which are written using NASM and can be found in this github repo). If not, you can try this intro, this guide or this article on 64-bit assembly.

For simplicity, the NASM implementation doesn’t attempt to mimic an Iterator but rather reads a fixed size input from stdin. The main function is just a loop which for each input char calls squasher and prints to stdout one character at a time:

main: _call read_input mov qword [hasLastChar], FALSE .loop: _call squasher ; squash char from input and put result into rax _call write ; print char from rax mov rax, [i] cmp rax, INPUT_SIZE jne .loop ; jump back if i != INPUT_SIZE

The squasher function uses next_char to get the next input character and puts the output character into the rax register before returning to main. Similar to the Kotlin version without coroutines, squasher has to keep global state and includes additional logic to return lastChar.

squasher: mov rax, [hasLastChar] cmp rax, FALSE je .no_lastChar mov qword [hasLastChar], FALSE mov rax, [lastChar] _ret .no_lastChar: _call next_char ; puts return value into rax cmp rax, '*' je .check_second_asterisk _ret .check_second_asterisk: mov rbx, rax ; temporary save first char to rbx _call next_char ; puts return value into rax cmp rax, '*' je .do_squashing mov [lastChar], rax mov qword [hasLastChar], TRUE ; remember to write lastChar next time mov rax, rbx ; load first char from rbx _ret .do_squashing: mov rax, '^' _ret

As you may have noticed, the code above uses _call and _ret macros (instead of built-in call and ret). This is to show implementation details of simplistic function calling convention.

%macro _call 1 mov rdx, %%_end push rdx ; save returning point on stack jmp %1 ; call sub-function %%_end: nop ; returning point %endmacro %macro _ret 0 pop rdx ; load returning point from stack jmp rdx ; jump back into caller function %endmacro

NASM implementation (with coroutines)

Since x86 assembler doesn’t have coroutines, let’s implement them. The idea is to reserve additional memory for each coroutine to store the address of its resuming point (initially set to the first instruction of coroutine) so when coroutine is called again, it continues from where it left off.

section .data instruction_at_main: dq main ; stores resuming point of main instruction_at_squasher: dq squasher ; stores resuming point of squasher

To pass control to a coroutine we can use the following co_call macro. The macro expects to be called from a coroutine so it updates the resuming point of the current coroutine (in case the coroutine gets control back some time later). Then the macro looks up the resuming point of the coroutine it’s about to call and jumps to that address. No doubt, this is a toy implementation which doesn’t save/restore registers or stack when switching between coroutines but, hopefully, it’s simple enough to illustrate the concept.

%macro co_call 1 ; update resuming point of the current coroutine pop rdx ; load address of the store mov rcx, %%_end ; mov [rdx], rcx ; update resuming point in the store ; pass control to the next coroutine mov rdx, instruction_at_%1 ; get address of the store push rdx ; save address of the store for later jmp [rdx] ; resume the next coroutine %%_end: nop ; resuming point %endmacro

This main function is the same as the version without coroutines except for the use of co_call macro and having to prepare the stack so the macro thinks it’s called from a coroutine.

main: call read_input mov rax, instruction_at_main push rax ; prepare stack for coroutine call .loop: co_call squasher ; squash char from input and put result into rax call write ; print char from rax mov rax, [i] cmp rax, INPUT_SIZE jne .loop ; jump back if i != INPUT_SIZE

There are more changes in squasher though. The most interesting change is that there are no “returns” because conceptually squasher is not any different from the main coroutine and there is nowhere to return to, so squasher can only call sub-functions or pass control to another coroutine. As a consequence the squasher now is an infinite loop (see jmp squasher instructions). Similar to the Kotlin version with coroutines, squasher doesn’t need the hasLastChar variable anymore because the branching is stored as a resuming point.

squasher: call next_char cmp rax, '*' je .check_second_asterisk co_call main jmp squasher .check_second_asterisk: mov rbx, rax ; temporary save first char to rbx call next_char cmp rax, '*' je .do_squashing mov [lastChar], rax ; save rax so its not erased by main mov rax, rbx ; load first char from rbx co_call main mov rax, [lastChar] co_call main jmp squasher .do_squashing: mov rax, '^' co_call main jmp squasher

You can find the full source code for 64-bit Linux and macOS in this github repo.

How is that different from threads?

Fundamentally, OS kernel switching between threads has no magic and is based on the instruction pointer manipulation similar to the macro above (for example, see this stackoverflow question). The main difference compared to coroutines is that conceptually switching between threads is preemptive, i.e. the function which is being executed doesn’t decide when to switch to another function, instead, thread scheduler does it. Another difference is that thread scheduler normally doesn’t provide any guarantees about which CPU core thread will run on and, therefore, the need for “thread safety” which really is all about “multi-core safety”. (As an interesting side note there are no special assembly instructions for switching between CPU cores. Instead, several CPUs just run instructions independently with access to common shared memory. See The Unofficial SMP FAQ). Overall, even though the underlying mechanism might be similar, coroutines give you more control over context switches compared to threads.


Kotlin being a high-level language has a more sophisticated coroutine implementation, than the one shown above, with a lot of details related to other language features and type system but at the core it’s the same trick of storing the latest entry point of a coroutine and, when the coroutine is resumed, jumping to the right part of the code (you can find the implementation in CoroutineTransformerMethodVisitor which uses tableswitch instruction to do the jump).

Hopefully, this blog helps with understanding basic coroutine implementation details. For a description of coroutine flavours from the point of view of programming language user, you can try reading the following posts:

  1. coroutines as threads
  2. yielding generators
  3. async await
  4. call with current continuation