Advanced Systems Programming H (2021-2022)

Lecture 8: Coroutines and Asynchronous Programming

Lecture 8 discusses coroutines and asynchronous programming. The limitations of multithreaded applications with blocking I/O are reviewed, and the use of asynchronous functions, coroutines, to multiplex several concurrent non-block I/O operations onto a single thread is discussed. The resulting asynchronous programming model is discussed, along with its benefits and limitations.

Part 1: Motivation

The first part of the lecture discusses the motivation of using coroutines and asynchronous programming. It reviews the problems caused by blocking I/O operations, how these lead to the use of multi-threaded code, and how the affect the structure of programs.

Slides for part 1


00:00:00.366 In this lecture,

00:00:01.300 I’ll move on from talking about concurrency,

00:00:03.466 and talk about something closely related,

00:00:05.566 which is coroutines and asynchronous programming.


00:00:09.566 In this part, I’ll start by talking

00:00:11.400 about why we might want asynchronous programming,

00:00:13.633 talk about some of the motivation.


00:00:15.966 In the second part, I’ll talk about

00:00:17.433 the idea of coroutines, and how they're

00:00:19.466 implemented in terms of the async and

00:00:21.433 await primitives, and finally, in the last

00:00:24.066 part of the lecture, I’ll talk about

00:00:25.533 some design patterns asynchronous code.


00:00:29.866 So what's the motivation? Why do we want coroutines?

00:00:34.100 Why do we want asynchronous code?


00:00:36.533 It’s all to do with overlapping I/O

00:00:38.833 and computation, it’s all to do with

00:00:40.566 avoiding multi-threading, and

00:00:43.200 building non-blocking alternatives to the I/O primitives.


00:00:49.566 If we look at regular I/O code,

00:00:53.633 you tend to have functions which look

00:00:56.066 like the example we have on the

00:00:58.300 top of the slide; the read_exact() function.


00:01:02.000 What this function is doing, is reading

00:01:04.166 an exact amount of data from some

00:01:06.466 input stream, and storing it in a buffer.


00:01:09.466 For example, it might be reading from

00:01:11.166 a file, or it might be reading from a socket.


00:01:14.233 If it's reading from a TCP socket, for example,

00:01:18.300 then it it's possible, depending on the


00:01:22.200 congestion control, depending on what the sender

00:01:24.200 is doing, it’s possible that the socket

00:01:26.100 doesn't have the requested amount of data

00:01:27.766 available to read.


00:01:31.000 And so this function has to work

00:01:32.633 in a loop, repeatedly reading from the

00:01:35.733 socket until it's received the amount of

00:01:37.766 data that that's been requested.


00:01:42.300 And this works, and it's simple and it's straightforward.


00:01:46.866 The problem with this type of code, though, is twofold.


00:01:52.266 Firstly, the I/O operations can be very slow.

00:01:56.033 The function can block for a

00:01:58.733 long time, while reading from the file descriptor.


00:02:03.166 And that could just be because the

00:02:04.633 disk is slow, or it could be

00:02:06.166 because the network is slow, or it

00:02:08.066 could be because it's blocked because it's

00:02:09.766 reading from the network and there's nothing

00:02:11.600 available to read and it has to

00:02:13.133 wait for the sender to send more data.


00:02:16.333 So calls to the read() function() can

00:02:18.033 take many millions of cycles. They can

00:02:20.600 be incredibly slow.


00:02:24.333 The other issue here is that when

00:02:27.566 the read() function is called, it blocks the thread.


00:02:31.833 The thread of execution stops while waiting

00:02:35.166 for that period of time, while the

00:02:37.666 read() call is completing.


00:02:39.766 And this prevents other computations from running,

00:02:42.700 and if the thread is also part

00:02:45.000 of your user interface, it disrupts the user experience.


00:02:48.966 So, ideally, we want to be able

00:02:50.633 to overlap the I/O and the computation.

00:02:52.966 We want to be able to run them concurrently.


00:02:56.233 Ideally, want to be able to allow

00:02:57.933 multiple concurrent I/O operations.


00:03:02.966 The way this has traditionally been solved,

00:03:05.966 of course, is by using multiple threads.


00:03:08.533 The usual solution to this, is to

00:03:10.800 move the blocking I/O operations out of

00:03:13.400 the main thread, and put them into

00:03:15.400 a separate thread which does the I/O,

00:03:17.500 and then reports back once the data is available.


00:03:22.066 And, in Rust, the way you might do this

00:03:25.066 is outlined in the code on the

00:03:27.833 slide. You create a channel, and you

00:03:30.800 spawn a thread to perform I/O.

00:03:33.133 And that thread then sends the results

00:03:34.733 back down the channel, and the rest

00:03:36.833 of the program continues, and overlaps with

00:03:39.500 the execution of the I/O operation.


00:03:44.900 And this is relatively simple, in that

00:03:48.333 it doesn't really require any new language

00:03:50.900 or runtime features. It’s the same blocking

00:03:54.033 calls. it's the same multithreading functions,

00:03:56.100 we have anyway.


00:03:58.033 And it doesn’t really change the way we do I/O.

00:04:02.066 It's not changing the fundamental model of

00:04:04.933 either I/O or threading.


00:04:08.566 But it does mean we have to

00:04:10.100 move the I/O code to a separate thread,

00:04:12.866 which is some programming overhead.


00:04:17.800 It also has the advantage, though,

00:04:19.633 that it can run in parallel if

00:04:20.900 the system really does have multiple cores.

00:04:23.233 And it's safe, especially in Rust,

00:04:25.333 because the ownership rules prevent data races

00:04:27.866 so it's easy, and relatively straightforward,

00:04:30.633 to push the code into a separate thread.


00:04:36.433 The disadvantages

00:04:39.233 are that it adds complexity.


00:04:43.233 Creating a thread and Rust isn't difficult.

00:04:46.400 But it's harder than not creating a thread.


00:04:51.033 Spawning a thread, partitioning the I/O operations

00:04:55.666 into a separate thread, isn't conceptually difficult

00:04:58.733 in Rust, but it complicates the code.


00:05:02.400 It's a more complex programming model.

00:05:05.366 It obfuscates the structure of the code.


00:05:09.800 It's relatively resource heavy.


00:05:14.700 You have the overheads of context switching

00:05:17.333 to the separate threads, and each thread

00:05:20.500 has its own stack, its own memory overhead.


00:05:24.133 And that's not an enormous overhead,

00:05:26.333 but it is an overhead. And it's

00:05:28.400 a much heavier overhead than just running

00:05:31.233 the read() call within the main thread.


00:05:34.066 And the fact that the threads can run concurrently,

00:05:37.466 the fact that the threads can run

00:05:38.800 in parallel if you have multicore hardware,

00:05:41.166 is actually a relatively limited benefit,

00:05:43.066 because the thread spends most of its

00:05:44.833 time blocked waiting for I/O anyway.


00:05:47.366 So it's a bit of a waste

00:05:49.333 starting a new thread, just to call

00:05:51.800 a read() function which then blocks 90% of its lifetime.


00:06:00.466 So we can certainly perform

00:06:02.533 blocking I/O using multiple threads.


00:06:08.033 But it's problematic.


00:06:10.700 It’s high overhead in a lot of

00:06:13.066 cases. You've got context switch overheads,

00:06:15.766 you've got the memory overheads due to the separate stack,

00:06:19.466 you've got the scheduler overheads, and there’s

00:06:22.900 not that much benefit of parallelism anyway,

00:06:26.200 because it blocks for most of the time.


00:06:30.800 And some systems allow you to avoid

00:06:32.966 this. In Erlang, for example, the threading

00:06:37.533 is pretty lightweight, but in most systems

00:06:39.366 this is not the case, and it's

00:06:41.066 a relatively high overhead to start multiple

00:06:43.066 threads to perform I/O.


00:06:48.033 We'd like we'd like something more lightweight

00:06:51.033 as an alternative. We'd like to somehow

00:06:52.900 be able to multiplex I/O operations in

00:06:55.833 a single thread,

00:06:57.633 we'd like to somehow

00:07:00.033 allow I/O operations to complete asynchronously,

00:07:02.866 without having to have separate threads of control.


00:07:06.666 So, it would be desirable to provide

00:07:09.233 a mechanism which allows us to start asynchronous I/O,

00:07:12.666 start an I/O operation, and let it

00:07:14.966 run in the background, and somehow allow

00:07:18.033 us to poll the kernel to see

00:07:19.500 if it has finished yet,

00:07:21.000 all running within a single application thread.


00:07:25.100 The idea would be that you start

00:07:26.933 the I/O, and then it continues in

00:07:29.800 the background while the

00:07:31.366 program continues performing other computations. And then,

00:07:36.433 at some point, once the data is available,

00:07:40.233 either there's a callback which gets executed

00:07:43.933 to provide the data, or the main

00:07:47.266 thread can just poll it and say

00:07:48.533 “has it finished?”, and if it has,

00:07:50.100 it can pull the data in.


00:07:55.166 And this is also a reasonably common

00:07:57.866 abstraction. If you’re a C programmer,

00:08:01.100 this is the select() function in the

00:08:03.266 Berkeley sockets API, for example.


00:08:08.233 And there's a bunch of new,

00:08:11.933 higher performance, versions of this, such as

00:08:14.600 epoll(), if you're a Linux or an

00:08:16.666 Android programmer, or the kqueue abstraction in

00:08:19.900 FreeBSD, or macOS, or on the iPhone.


00:08:23.266 Or, if you're a Windows programmer,

00:08:25.433 I/O completion ports do something very similar.


00:08:28.000 And this tends to get wrapped in

00:08:29.300 libraries, such as libevent, or libev,

00:08:33.266 or libuv, which was trying to provide

00:08:35.200 common portable APIs for them.


00:08:38.566 And, in Rust, the mio library also

00:08:43.166 provides a portable abstraction for this.


00:08:47.566 The functionality that these things provide,

00:08:49.866 is the ability to trigger non-blocking operations.


00:08:52.533 They provide you a way of saying

00:08:54.800 “read asynchronously” or “write asynchronously’ from a

00:08:57.366 file or a socket,

00:08:59.300 and they provide a poll() abstraction,

00:09:01.433 so you can periodically check to see

00:09:02.933 if it completed and retrieve the data.


00:09:07.533 And they're actually pretty efficient.


00:09:10.366 They meet the goals of efficiency,

00:09:14.333 they meet the goals of only running

00:09:16.033 in a single thread, and

00:09:17.900 they build on features of the operating

00:09:19.666 system kernels that provide asynchronous I/O.


00:09:23.433 The problem with them, is that they

00:09:25.600 require, again, restructuring the code to avoid blocking.


00:09:31.966 And this is an example. This is

00:09:35.500 network code using sockets and select() function

00:09:39.900 in C. And what we see here

00:09:43.733 is the select() call, where you pass it


00:09:46.833 the three parameters, the set of readable

00:09:49.433 file descriptors, the set of writeable file

00:09:51.533 descriptors, the set of file descriptors might

00:09:53.766 deliver errors, and a timeout.


00:09:57.600 You have to bundle up all the

00:09:59.866 file descriptors that may have outstanding asynchronous

00:10:02.666 I/O, fill them into these parameters,

00:10:04.900 call it, and then poll each of

00:10:07.500 these in turn to see if the

00:10:08.900 FD_ISSET calls, to see which of those

00:10:11.166 different file descriptors,

00:10:12.566 have data available to read or write.


00:10:16.533 And it's a relatively low-level API,

00:10:18.900 which is reasonably well suited to C programming,

00:10:22.066 but it does require quite a restructure

00:10:24.633 of the code. This is no longer

00:10:26.700 as simple as just calling read() as

00:10:28.466 part of a loop. It’s restructuring the

00:10:30.700 whole program as an event loop,

00:10:33.266 where you poll on different file descriptors,

00:10:35.266 different sockets.


00:10:37.633 And the alternative libraries I mentioned,

00:10:40.433 such as libuv for C programming,

00:10:43.366 such as mio for Rust programming,

00:10:46.333 make this more portable, and they're a

00:10:49.533 little bit higher level, and they remove

00:10:51.833 some of the boilerplate, but conceptually you're

00:10:53.866 doing the same thing.


00:10:55.300 Conceptually you have to restructure the program

00:10:57.733 as an event loop, where you trigger

00:11:00.733 the asynchronous I/O, and every so often

00:11:03.233 you poll it to see if it's

00:11:04.400 completed. And it involves restructuring the code.


00:11:12.166 Now, these approaches have the advantage that

00:11:15.633 they're very efficient.


00:11:17.233 Because the asynchrony is handled by the

00:11:20.600 operating system kernel,

00:11:24.066 a single thread can

00:11:26.833 very efficiently handle multiple sockets.

00:11:29.433 All it does is trigger the asynchronous operation,

00:11:31.800 and a kernel thread

00:11:34.366 handles all the rest.


00:11:37.266 The mechanisms to run these operations concurrently

00:11:42.266 are built into the kernel, and they're pretty efficient.


00:11:46.500 But it requires us to rewrite the application code.


00:11:50.400 It requires us to restructure the application

00:11:53.300 as something which looks different. As something

00:11:56.366 which has an event loop, which polls

00:11:59.000 the data sources and reassembles the data.


00:12:02.566 So, fundamentally, we have two choices.


00:12:06.166 We can structure the code as a

00:12:08.466 set of multiple threads, which involves spawning

00:12:11.933 a thread and restructuring the code as

00:12:14.366 a set of multiple threads which pass

00:12:16.633 the data back once we've successfully read the data.


00:12:20.000 Or we can restructure the code using

00:12:23.000 the asynchronous I/O primitives, which involves

00:12:28.433 turning it into an event loop with

00:12:31.666 polling and so on. Again, both ways

00:12:35.266 involve a fairly fundamental rewrite of the

00:12:39.533 code to get the these efficiency gains.


00:12:43.866 What we would like is to be

00:12:45.233 able to get this efficiency, get the

00:12:47.000 efficiency of non-blocking I/O, in a much

00:12:49.333 more usable manner.


00:12:52.166 And the idea is that coroutines and

00:12:54.600 asynchronous code are one way of doing

00:12:57.266 that, so that's what I'll talk about

00:12:58.800 in the next part.


00:13:01.866 So.


00:13:03.766 How do we overlap I/O and competition?


00:13:06.500 Well, if we want to do it

00:13:07.833 today, we have a spawn multiple threads,

00:13:10.133 or use the non-blocking I/O primitives provided

00:13:13.400 by the kernel through functions such as

00:13:15.300 select(), or mio, or libuv.


00:13:18.933 And these all work, but they introduce

00:13:21.000 complexity into the programming model.


00:13:23.533 Is there a better way?


00:13:25.800 Maybe. This is what asynchronous I/O and

00:13:28.600 coroutines are about, which we'll talk about

00:13:30.633 in the next part.

Part 2: async and await

The second part of the lecture discusses how coroutines and asynchronous code can be used to support I/O multiplexing on a single thread. It reviews what is a coroutine, and the way coroutines can execute concurrently to their caller by repeatedly yielding results; and it reviews how this is implemented. The way this can be used to support asynchronous I/O operations is outlined, leading to a description of asynchronous functions in Python and Rust. The need for runtime support is outlined.

Slides for part 2


00:00:00.500 In this part, I want to talk

00:00:03.000 about coroutines and asynchronous code, and the

00:00:04.900 runtime support needed to execute asynchronous code.


00:00:10.000 As we discussed in the previous part,

00:00:12.066 blocking I/O is problematic.


00:00:15.000 Code, like we see in the example

00:00:16.533 at the top of the slide,

00:00:18.533 that calls blocking I/O functions, such as

00:00:20.900 read(), stalls the execution of the program

00:00:23.700 while waiting for those calls to complete.


00:00:27.000 And the work-arounds for this, using multiple

00:00:29.300 threads or asynchronous I/O functions, such as

00:00:31.933 select(), require extensive restructuring of the code.


00:00:36.633 The goal of using coroutines with asynchronous

00:00:39.133 code is to allow I/O and computation

00:00:42.066 to be performed concurrently on a single

00:00:44.200 thread, without restructuring the code. The hope

00:00:48.533 is that this will avoid the overheads of multithreading,

00:00:51.200 while retaining the original code structure.


00:00:54.233 Essentially we hope to transform the blocking

00:00:56.800 code shown at the top of the

00:00:58.333 slide, into the asynchronous, non-blocking, code such

00:01:02.033 as that shown at the bottom.

00:01:04.166 And provide the language runtime with the

00:01:05.900 ability to execute those asynchronous functions concurrently

00:01:09.933 with low-overhead asynchronous I/O operations.


00:01:16.000 The programming model we’re considering structures the

00:01:18.633 code as a set of concurrent coroutines

00:01:21.333 that accept data from I/O sources and

00:01:23.800 yield in place of blocking.


00:01:27.000 The coroutines execute concurrently, overlapping I/O and

00:01:30.366 computation, all within a single thread of

00:01:33.200 execution, and so avoid the overhead of multithreading.


00:01:37.700 To understand this programming model, though,

00:01:40.266 we must first ask what is a coroutine?


00:01:44.266 Well, if we consider a normal function,

00:01:47.300 we see that it’s called, executes for

00:01:49.466 a while, and returns a result.


00:01:53.000 A coroutine, in contrast, has the ability

00:01:55.466 to pause its execution. Rather than return

00:01:58.866 a single value, or even a list

00:02:00.866 of values, it lazily generates a sequence of values.


00:02:05.800 The slide shows an example, written in

00:02:07.766 Python. In this case, the countdown() function

00:02:11.500 is a coroutine that yields a sequence

00:02:13.633 of integers, counting down to zero from

00:02:16.566 the value given as its argument.


00:02:19.100 We see that calling countdown() with the

00:02:21.033 parameter 5, yields the values 5,

00:02:23.833 4, 3, 2, and 1, and that

00:02:26.966 these can be processed by a for loop.


00:02:30.000 Importantly, the countdown() function is not returning

00:02:32.933 a single value, comprising a list of

00:02:35.033 numbers counting down. Rather, calling countdown() returns

00:02:39.166 a generator object.


00:02:42.000 By itself, a generator object does nothing.


00:02:46.000 But the generator object implements a next()

00:02:48.133 method. And the for loop protocol in

00:02:51.066 Python takes a generator object and repeatedly

00:02:53.900 calls that next() method.


00:02:56.900 And each time next() is called,

00:02:58.800 the function executes until it reaches the

00:03:00.766 yield statement, then returns the next value.


00:03:04.400 Or it executes until the function ends,

00:03:07.433 when it returns None to indicate that

00:03:09.466 the generator has completed.


00:03:12.766 Essentially, the function is turned into a

00:03:14.666 heap allocated generator object that maintains state,

00:03:18.466 executes lazily in response to calls to

00:03:20.900 next(), and repeatedly yields the different values.


00:03:27.000 Coroutines in Python do nothing until the

00:03:29.666 next() function is called on the generator

00:03:31.566 object representing the coroutine. Normally this happens

00:03:35.433 automatically, as part of the operation of

00:03:37.900 a for loop, but we can also

00:03:39.966 call it manually, as we see here.


00:03:43.200 In this example, the grep() function is

00:03:45.600 a coroutine, and the call to

00:03:48.066 g = grep(“python”) instantiates the generator object.


00:03:53.000 But instantiating the generator doesn’t cause it to run.


00:03:56.866 The print() call at the

00:03:58.266 start of the grep() function doesn’t execute

00:04:00.566 until we call the next() method,

00:04:02.166 for example, forcing the coroutine to run

00:04:05.733 until it yields.


00:04:08.000 In this case, the function yields to

00:04:09.966 consume a value, so we call send()

00:04:11.966 rather than next(), and pass in a

00:04:13.800 value. And we do this repeatedly,

00:04:16.433 passing in different values each time,

00:04:19.000 and each time causing a single iteration

00:04:21.100 of the while loop in the grep()

00:04:22.766 function to execute.


00:04:27.000 We see that the coroutine is a

00:04:28.633 function that executes concurrently to the rest

00:04:30.733 of the code.


00:04:32.533 It’s event driven. It only executes when

00:04:35.700 the runtime calls its next() or send()

00:04:37.833 method, which causes its execution to resume

00:04:40.933 until it next yields, at which point

00:04:42.900 control passes back to the runtime.


00:04:46.266 In the examples, we’ve only had a

00:04:48.233 single coroutine executing at once, but it’s

00:04:51.400 entirely possible to start several different coroutines,

00:04:54.466 and have the runtime loop, calling their

00:04:56.566 next() methods. This will cause the different

00:04:59.700 coroutines to execute concurrently, each one executing

00:05:02.566 for a while until it yields to the next.


00:05:06.033 It’s an approach that’s sometime known as

00:05:08.100 cooperative multitasking.


00:05:10.800 The system context switches each time a

00:05:12.966 coroutine yields the processor, and if it

00:05:15.533 doesn’t yield, it keeps running.


00:05:18.400 This is how Microsoft Windows 3.1,

00:05:21.033 and the Macintosh System 7, handled multitasking.


00:05:26.233 But it gives us a basis for

00:05:27.800 efficient I/O handling within a single thread.


00:05:32.000 We structure the code to execute within

00:05:33.966 a thread as a set of coroutines

00:05:36.766 that trigger an asynchronous I/O operation,

00:05:39.433 and yield rather than blocking on I/O.


00:05:43.566 And we label the functions as being

00:05:46.100 async. This is a label that tells

00:05:49.233 the language runtime that those functions are

00:05:51.033 coroutines that call asynchronous I/O operations.


00:05:55.700 And I/O operations that would normally block

00:05:59.066 are labelled in the code with an await tag.


00:06:02.800 This causes the coroutine to

00:06:04.333 trigger the asynchronous version of that I/O

00:06:06.400 operation, then yield, passing control to another

00:06:10.066 coroutine while the I/O is performed.


00:06:14.000 This provides concurrent I/O, without parallelism.


00:06:18.166 The coroutines operate concurrently,

00:06:20.566 but within a single thread.


00:06:23.000 The calls to await tell the kernel

00:06:25.566 to start an asynchronous I/O operation,

00:06:28.233 and yield the file descriptor representing that

00:06:30.600 operation to the runtime.


00:06:33.000 The runtime operates in a loop.


00:06:36.033 It repeatedly calls select(), and adds the

00:06:39.033 coroutines that yielded any readable or writable

00:06:42.166 file descriptors to the end of a

00:06:44.400 run queue. Then, it resumes the coroutine

00:06:47.666 at the head of the run queue,

00:06:49.000 calling its next() or send() method as appropriate.


00:06:52.300 And when that coroutine yields, and returns

00:06:55.033 a file descriptor, it’s moved to the

00:06:57.500 list of blocked tasks, and its file

00:07:00.233 descriptor is added to the set to be polled in future.


00:07:04.000 And the loop continues, calling select() again.


00:07:09.400 An async function is therefore a function

00:07:11.966 that performs asynchronous I/O operations and that

00:07:15.566 can operate as a coroutine. The slide

00:07:18.666 shows an example in Python.


00:07:21.500 The async functions are executed asynchronously by

00:07:24.400 the runtime, in response to I/O events,

00:07:27.333 and the program is written as a

00:07:29.233 set of async functions.


00:07:32.000 The main() function calls into the runtime

00:07:34.533 giving it the top-level async function to

00:07:36.366 run. This starts the asynchronous runtime,

00:07:39.733 starts it polling for I/O, and runs

00:07:42.133 until that async function completes.


00:07:45.566 It’s a widely supported model. Coroutines,

00:07:49.166 in the form of async functions,

00:07:51.133 exist in Python, JavaScript, C#, Rust,

00:07:54.466 and in other languages.


00:07:58.833 Within an async function, await statements cause

00:08:01.933 the function to yield control to the

00:08:03.566 runtime while an asynchronous I/O operation is performed.


00:08:08.800 Executing an await statement yields control to

00:08:11.733 the runtime. It puts the coroutine into

00:08:14.666 a queue to be woken at some

00:08:16.333 later time, when the I/O operation has completed.


00:08:20.300 If another coroutine is ready to execute,

00:08:23.100 then the runtime schedules the yielding function

00:08:25.800 to wake-up once the I/O completes,

00:08:28.333 and control passes to that other coroutine.


00:08:31.866 Otherwise, runtime blocks until either this,

00:08:34.600 or some other, I/O operation becomes ready,

00:08:37.333 then passes control back to the corresponding

00:08:39.466 async function.


00:08:44.000 The resulting asynchronous code

00:08:46.666 follows structure of the blocking code.


00:08:50.000 If we look at the async version

00:08:52.433 of the read_exact() function in Rust,

00:08:54.500 for example, we see that the only

00:08:56.900 differences are the async and await annotations,

00:08:59.366 and that the input is declared to

00:09:01.066 be something that implements the AsyncRead trait,

00:09:03.033 rather than the Read trait.


00:09:06.000 The code structure remains unchanged, aside from

00:09:09.300 the call to main that wraps the

00:09:10.733 async functions into a call to start

00:09:12.333 the runtime. And the compiler and runtime

00:09:15.300 work together to generate code that efficiently

00:09:17.766 executes the asynchronous I/O operations.


00:09:24.000 How is this implemented in Rust?


00:09:27.000 Well, in the Python code we saw

00:09:28.966 earlier, the coroutine was instantiated as a

00:09:31.566 generator object with a next() method that

00:09:34.200 allowed it to run and yield the next value.


00:09:37.933 Rust does something similar. The async functions

00:09:42.166 are compiled into instances of structs that

00:09:44.966 maintain the function state, and that implement

00:09:47.333 a trait known as Future. The Future

00:09:51.400 trait has a member type that describes

00:09:54.300 the return value, and defines a poll()

00:09:56.966 function that runs the function until it

00:09:59.533 yields an instance of an enum.

00:10:01.266 And that’s either Ready, with the yielded

00:10:03.533 value, or Pending to indicate that the

00:10:05.966 async function is waiting for I/O.


00:10:09.000 The details differ between Rust and Python,

00:10:11.866 as you might expect, but the concepts are the same.


00:10:19.000 And that concludes this discussion of coroutines

00:10:21.066 and asynchronous code.


00:10:23.000 In the next part, I’ll talk briefly

00:10:25.100 about how to use async functions,

00:10:27.100 and about the advantages and disadvantages of

00:10:29.700 this approach to structuring code.

Part 3: Design Patterns for Asynchronous Code

The final part of the lecture discusses how to structure and compose asynchronous functions, and reviews the need to avoid blocking operations and long-running calculations in asynchronous functions. The benefits and problems of the asynchronous programming model are discussed.

Slides for part 3


00:00:00.366 In this final part, I want to

00:00:02.000 talk about some design patterns asynchronous code,

00:00:04.900 and about the advantages and disadvantages

00:00:06.966 of asynchronous programming.


00:00:09.700 I’ll start by talking about some of the design patterns.

00:00:12.100 I’ll talk about how you compose future values,

00:00:14.900 about the need to avoid blocking I/O,

00:00:17.000 and the need to avoid long-running computations

00:00:19.400 in asynchronous code.


00:00:23.233 So in writing async functions, I think

00:00:25.800 the goal should be to make these

00:00:27.266 functions as small, and as limited scope, as possible.


00:00:30.966 An async function should perform a single,

00:00:33.366 well-defined task. It should read and parse

00:00:36.400 a file, or it should read,

00:00:38.200 parse, process, and respond to a network

00:00:41.100 request, for example.


00:00:43.333 And it functions are structured in this

00:00:45.100 way, they tend to be fairly straightforward

00:00:47.566 and written a fairly natural style,

00:00:49.900 and compose pretty straightforwardly.


00:00:53.866 The Rust async libraries provide some combinator

00:00:57.900 functions that can help compose futures.

00:01:02.400 That can help come combine future values

00:01:04.600 and produce a new value.


00:01:06.766 And there are functions, such as read_exact(),

00:01:09.400 which allow it to read a an

00:01:11.500 exact number of bytes, such as select()

00:01:14.333 to allow it to respond to different

00:01:17.866 Futures which

00:01:21.366 are operating concurrently, and functions such as

00:01:23.833 for_each() and and_then(), which can ease the

00:01:26.266 composition of the asynchronous functions.


00:01:29.433 And sometimes these are helpful, sometimes they

00:01:31.600 just obfuscate the code. But there are

00:01:34.366 a number of functions that can work

00:01:36.600 with, and combine, Futures in the cases that they’re useful.


00:01:42.733 When writing asynchronous code, there’s two fundamental

00:01:47.166 constraints that you need to be aware of.


00:01:51.266 The first, due to the nature of

00:01:54.533 asynchronous code, is that it multiplexes multiple

00:01:58.600 I/O operations onto a single thread.


00:02:02.333 And the runtime has to provide asynchronous-aware

00:02:05.300 versions of all of these different I/O operations.


00:02:09.366 It has to provide asynchronous reads and

00:02:11.833 writes to files, asynchronous reads and writes

00:02:14.800 to the network, TCP, UDP, and Unix

00:02:17.966 sockets, and to other types of network protocols.


00:02:21.500 And in all of these cases,

00:02:22.833 it has to provide a non-blocking version

00:02:24.866 of the I/O operation, that returns a

00:02:27.100 Future that can interact with the runtime.


00:02:29.900 Rather than natively calling the blocking function

00:02:34.066 provided by the operating system, it has

00:02:36.900 to wrap the underlying the operating system

00:02:39.300 provided async I/O operations,

00:02:42.600 and wrap them into Futures.


00:02:47.633 And, importantly,

00:02:48.666 it doesn't interact well with blocking I/O.


00:02:52.233 If you call the synchronous version of

00:02:55.700 read(), for example, to read from a

00:02:57.600 file, and that blocks, it will block

00:02:59.933 the entire runtime, because the runtime is

00:03:02.800 operating within the context of a single thread,

00:03:06.500 and the underlying system doesn't know about Futures.


00:03:11.533 The programmer has to have the discipline

00:03:15.233 to avoid calling the blocking functions of

00:03:17.600 the code, otherwise the entire asynchronous runtime

00:03:20.766 grinds to a halt while that blocking

00:03:23.733 operation completes.


00:03:28.466 And this, to some extent, runs the

00:03:30.566 risk of fragmenting the ecosystem. It means

00:03:33.433 that libraries which are supposed to be

00:03:35.166 used in an asynchronous context have to

00:03:38.233 be written to use asynchronous functions,

00:03:40.533 have to be written to call await(),

00:03:42.266 have to be written to use the

00:03:43.833 asynchronous versions of the I/O libraries.


00:03:46.333 And if anyone, any of the library

00:03:49.333 authors, implementing any of those libraries forgets,

00:03:51.966 then you run the risk of that

00:03:54.000 blocking the runtime,

00:03:55.600 losing concurrency, losing the performance.


00:03:59.900 And this is a potential source of

00:04:01.666 bugs, because the Rust compiler, the Rust

00:04:05.500 language, can't catch this sort of behaviour.


00:04:09.033 The programmer is required to have the

00:04:11.333 discipline to avoid the blocking I/O operations.


00:04:16.600 Similarly, the programmer has to avoid long

00:04:20.600 running computations.


00:04:23.500 Control passing between different Futures, between different

00:04:27.933 async functions, is explicit. Control passing happens

00:04:31.833 when you call it await, to wait

00:04:34.400 for an I/O operation.


00:04:36.200 At that point, the next runnable Future,

00:04:38.500 the next runnable async function, is scheduled.


00:04:44.466 But in the same way that calling

00:04:46.133 blocking functions is problematic, because it causes

00:04:48.900 the runtime to stop while that function

00:04:51.166 blocks, if instead you don't call await,

00:04:54.866 and you perform some long running computation,

00:04:58.266 the runtime won't know, won’t be able

00:05:01.466 to switch away from that computation,

00:05:04.133 and it will starve the other tasks from running.


00:05:07.533 If you have a long running computation,

00:05:09.900 you need to spawn a separate thread,

00:05:13.766 and explicitly pass messages to and from

00:05:16.900 that thread, to avoid

00:05:19.466 starving the other asynchronous computations.


00:05:23.200 And again, the language, the runtime,

00:05:25.700 doesn't help with this. The programmer needs

00:05:28.366 to be aware that you must avoid

00:05:30.233 long running computations, as well as blocking

00:05:32.500 I/O, in the context of an asynchronous runtime.


00:05:36.433 So we're getting good performance, but we're

00:05:38.666 getting good performance by limiting what the

00:05:41.666 programmer can do in the async runtime,

00:05:44.600 and by requiring them to have the

00:05:46.100 discipline to make sure that they follow those limitations.


00:05:54.400 So, is the asynchronous approach to programming

00:05:57.833 a good approach?


00:05:59.633 Should we all be switching to asynchronous

00:06:01.700 code in the future?


00:06:06.100 Well…


00:06:09.900 The use of async and await lets

00:06:12.933 us structure the code in a way

00:06:17.733 that allows us to efficiently multiplex large

00:06:20.466 numbers of I/O operations on a single thread.


00:06:25.566 And this can give a very natural

00:06:27.433 programming model when you're performing operations which

00:06:30.533 are very heavily I/O bound.


00:06:33.400 It lets us structure code, which performs

00:06:35.866 asynchronous non-blocking I/O, in a way that

00:06:38.933 looks very similar to code that uses blocking I/O.


00:06:42.366 That can efficiently multiplex multiple I/O operations

00:06:46.500 onto a single thread,

00:06:49.066 efficiently allow them to run concurrently,

00:06:51.666 without the overheads of starting-up multiple threads.


00:06:56.566 So for I/O bound tasks, this can

00:06:59.333 be very, very efficient, and very natural.


00:07:05.166 But.


00:07:07.600 It's problematic, if there are blocking operations,

00:07:11.233 as we saw a minute ago,

00:07:13.266 because the blocking operations lock-up the entire

00:07:16.000 runtime, and not just that one task.


00:07:19.000 And it means all the libraries,

00:07:21.500 that use blocking calls need to be

00:07:23.033 updated to use these asynchronous I/O operations.


00:07:27.933 And this either means everything has to

00:07:30.100 use asynchronous I/O, or people need to

00:07:32.633 build two versions all the libraries.

00:07:34.666 Some that are synchronous, and use the

00:07:36.466 blocking operations, and some which are asynchronous.


00:07:40.666 It’s also problematic when it comes to

00:07:42.666 long running computations. Again, as we saw,

00:07:45.966 they starve the other tasks, because the

00:07:48.133 runtime only switches away when you call

00:07:51.466 asynchronous operations, when you call await.


00:07:55.433 And so, if you're trying to mix

00:07:57.800 long-running, compute-heavy,

00:07:59.100 functions with asynchronous I/O functions

00:08:01.800 and I/O-bound tasks, it gets to be

00:08:04.700 problematic, and tends to starve to I/O tasks.


00:08:08.866 And this is a problem which is

00:08:10.666 familiar to anyone who wrote code on

00:08:12.300 Windows 3.1 or on the MacIntosh System 7


00:08:17.300 And it led to real interactivity problems

00:08:20.600 with those applications, where applications would just

00:08:25.366 not yield the CPU, and it would

00:08:27.833 prevent the multitasking from working for a while.


00:08:31.166 I worry that by promoting asynchronous code,

00:08:35.466 we're just introducing the same hard to debug

00:08:39.433 problems with task starvation, into the next

00:08:43.200 generation of applications.


00:08:46.600 The asynchronous I/O works really well when

00:08:49.500 you have a lot of

00:08:51.833 I/O-bound tasks. It doesn't mix well with

00:08:54.966 compute-bound tasks.


00:09:01.733 And, to some extent, I wonder whether

00:09:03.800 we really need the asynchronous I/O? Whether

00:09:07.033 we really need this for performance?


00:09:10.000 It's certainly true that threads are more

00:09:12.766 expensive than async functions, and async tasks,

00:09:17.466 in the runtime.


00:09:20.966 But threads are not that expensive.


00:09:24.600 A properly configured modern machine can run

00:09:28.066 many, many thousands of threads,

00:09:29.633 without any great difficulty.


00:09:32.966 The laptop I’m recording this lecture on,

00:09:35.233 for example, which is a low-end MacBook

00:09:38.533 Air with a Core i5 processor,

00:09:41.566 is running about 2200 threads in normal everyday use.


00:09:47.700 And If you look up the documentation

00:09:50.233 for, for example, the Varnish web cache,

00:09:53.000 which is a caching web proxy that’s

00:09:56.000 quite popular in data centres,

00:09:59.033 the documentation says it's common to configure

00:10:02.566 this with 500 or 1000 threads,

00:10:05.066 at minimum, but they rarely recommend running

00:10:08.800 more than 5000 threads.


00:10:12.766 And, I think, unless you're doing something

00:10:14.866 very, very unusual, it's likely you can

00:10:17.866 just spawn a thread, or use a

00:10:20.100 pre-configured thread pool, and perform blocking I/O,

00:10:25.666 and just communicate using channels, and the

00:10:28.166 performance will be just fine.


00:10:30.833 Even if this means spawning up thousands of threads.


00:10:35.866 Modern servers can run thousands, or tens

00:10:39.500 of thousands, of simultaneous threads without any

00:10:42.000 great difficulty.


00:10:44.833 Threading is not that expensive these days.


00:10:49.400 And asynchronous I/O can give a performance benefit.


00:10:54.000 But that performance benefit is usually not that great.


00:10:59.533 So my recommendation is,

00:11:01.600 choose asynchronous programming because you prefer the

00:11:04.300 programming style, if you like.


00:11:06.800 But don't choose it for performance reasons,

00:11:12.066 unless you really sure that it will

00:11:13.766 improve performance. Threading is not as expensive

00:11:16.433 as you think.


00:11:21.466 And that concludes our discussion of coroutines

00:11:24.200 and asynchronous I/O.


00:11:26.566 As we've seen, blocking I/O can be

00:11:29.166 problematic. It's problematic because it forces you

00:11:33.000 to write multi-threaded code, which has

00:11:35.333 a reputation as being high overhead.


00:11:38.400 Or it's problematic because you have to

00:11:40.500 structure your code as a select loop.


00:11:43.000 And in both of these cases there's

00:11:45.466 a restructuring of the code needed.


00:11:49.033 The use of coroutines and asynchronous code,

00:11:52.100 in the best cases, can give you

00:11:55.100 a structure for I/O-heavy code, which allows

00:11:59.066 you to perform asynchronous I/O operations without

00:12:02.166 greatly restructuring the code, and just involves

00:12:05.333 some small number of annotations,

00:12:07.900 and switching to use the async version of the runtime.


00:12:12.733 And, in those cases it's quite a

00:12:14.566 natural programming model, and it works very well.


00:12:18.500 It does, though, run the risk of

00:12:20.133 fragmenting the ecosystem into async aware,

00:12:22.666 and non-async aware functions, and I think

00:12:26.000 it runs the risk of

00:12:28.100 introducing hard to find bugs with blocking

00:12:30.733 code, and with CPU hogging code.


00:12:35.833 Is it worth it?


00:12:38.600 I don't know, maybe.


00:12:41.033 It gives very natural code in some

00:12:42.866 cases, and very high performance code in

00:12:45.600 some cases, I think it can work

00:12:47.600 very well in those cases.


00:12:50.100 In other cases, in other applications,

00:12:53.533 the multi-threaded, blocking, version of the code

00:12:56.966 is just as natural to write,

00:12:59.666 and also scales very well.


00:13:02.933 Different applications have different requirements, and I

00:13:05.966 would encourage experimentation, rather than just diving

00:13:09.500 straight in to using asynchronous functions and await.


00:13:17.166 In the next lecture, we’ll move on,

00:13:19.233 and instead of talking about concurrency,

00:13:21.466 we'll move on to talk about security.


Lecture 8 discussed coroutines and asynchronous programming. It started by explaining why blocking I/O is problematic, because I/O operations are slow and block execution of the thread will they're performed. The lecture then outlined the two traditional approach to addressing these concerns: either spawning an additional thread to perform the I/O, or introducing asynchronous I/O primitives, backed by kernel threads, to perform I/O on behalf of the user program. It noted that both of these approaches work, but threads have relatively high overhead, and both approach require significant code restructuring.

The second part of the lecture then discussed the use of coroutines and asynchronous code, in the form of the async and await primitives. These allow asynchronous execution of I/O operations with low overhead and only limited restructuring of the code, to label functions as being able to execute as asynchronous coroutines and the label scheduling points where the system may block due to I/O. The result is widely used, efficient, and, in many cases, requires only limited code changes.

The lecture also discussed, briefly, the runtime support needed to support asynchronous code in the form of coroutines, and outlined how this is implemented in Rust using polling and a runtime such as Tokio. It noted the limitations around avoiding blocking operations and long-running computations, and noted the similarity of these to the cooperative multitasking used in Mac System 7 and Windows 3.1.

Finally, the lecture considered the overheads and costs of using asynchronous I/O versus multiple threads.

Discussion will focus on the need for asynchronous I/O operations using coroutines, and the claimed performance benefits versus the costs of restructuring the code.