Advanced Systems Programming H (2021-2022)
Lecture 8: Coroutines and Asynchronous Programming
Lecture 8 discusses coroutines and asynchronous programming. The limitations of multithreaded applications with blocking I/O are reviewed, and the use of asynchronous functions, coroutines, to multiplex several concurrent non-block I/O operations onto a single thread is discussed. The resulting asynchronous programming model is discussed, along with its benefits and limitations.
Part 1: Motivation
The first part of the lecture discusses the motivation of using coroutines and asynchronous programming. It reviews the problems caused by blocking I/O operations, how these lead to the use of multi-threaded code, and how the affect the structure of programs.
00:00:00.366 In this lecture,
00:00:01.300 I’ll move on from talking about concurrency,
00:00:03.466 and talk about something closely related,
00:00:05.566 which is coroutines and asynchronous programming.
00:00:09.566 In this part, I’ll start by talking
00:00:11.400 about why we might want asynchronous programming,
00:00:13.633 talk about some of the motivation.
00:00:15.966 In the second part, I’ll talk about
00:00:17.433 the idea of coroutines, and how they're
00:00:19.466 implemented in terms of the async and
00:00:21.433 await primitives, and finally, in the last
00:00:24.066 part of the lecture, I’ll talk about
00:00:25.533 some design patterns asynchronous code.
00:00:29.866 So what's the motivation? Why do we want coroutines?
00:00:34.100 Why do we want asynchronous code?
00:00:36.533 It’s all to do with overlapping I/O
00:00:38.833 and computation, it’s all to do with
00:00:40.566 avoiding multi-threading, and
00:00:43.200 building non-blocking alternatives to the I/O primitives.
00:00:49.566 If we look at regular I/O code,
00:00:53.633 you tend to have functions which look
00:00:56.066 like the example we have on the
00:00:58.300 top of the slide; the read_exact() function.
00:01:02.000 What this function is doing, is reading
00:01:04.166 an exact amount of data from some
00:01:06.466 input stream, and storing it in a buffer.
00:01:09.466 For example, it might be reading from
00:01:11.166 a file, or it might be reading from a socket.
00:01:14.233 If it's reading from a TCP socket, for example,
00:01:18.300 then it it's possible, depending on the
00:01:22.200 congestion control, depending on what the sender
00:01:24.200 is doing, it’s possible that the socket
00:01:26.100 doesn't have the requested amount of data
00:01:27.766 available to read.
00:01:31.000 And so this function has to work
00:01:32.633 in a loop, repeatedly reading from the
00:01:35.733 socket until it's received the amount of
00:01:37.766 data that that's been requested.
00:01:42.300 And this works, and it's simple and it's straightforward.
00:01:46.866 The problem with this type of code, though, is twofold.
00:01:52.266 Firstly, the I/O operations can be very slow.
00:01:56.033 The function can block for a
00:01:58.733 long time, while reading from the file descriptor.
00:02:03.166 And that could just be because the
00:02:04.633 disk is slow, or it could be
00:02:06.166 because the network is slow, or it
00:02:08.066 could be because it's blocked because it's
00:02:09.766 reading from the network and there's nothing
00:02:11.600 available to read and it has to
00:02:13.133 wait for the sender to send more data.
00:02:16.333 So calls to the read() function() can
00:02:18.033 take many millions of cycles. They can
00:02:20.600 be incredibly slow.
00:02:24.333 The other issue here is that when
00:02:27.566 the read() function is called, it blocks the thread.
00:02:31.833 The thread of execution stops while waiting
00:02:35.166 for that period of time, while the
00:02:37.666 read() call is completing.
00:02:39.766 And this prevents other computations from running,
00:02:42.700 and if the thread is also part
00:02:45.000 of your user interface, it disrupts the user experience.
00:02:48.966 So, ideally, we want to be able
00:02:50.633 to overlap the I/O and the computation.
00:02:52.966 We want to be able to run them concurrently.
00:02:56.233 Ideally, want to be able to allow
00:02:57.933 multiple concurrent I/O operations.
00:03:02.966 The way this has traditionally been solved,
00:03:05.966 of course, is by using multiple threads.
00:03:08.533 The usual solution to this, is to
00:03:10.800 move the blocking I/O operations out of
00:03:13.400 the main thread, and put them into
00:03:15.400 a separate thread which does the I/O,
00:03:17.500 and then reports back once the data is available.
00:03:22.066 And, in Rust, the way you might do this
00:03:25.066 is outlined in the code on the
00:03:27.833 slide. You create a channel, and you
00:03:30.800 spawn a thread to perform I/O.
00:03:33.133 And that thread then sends the results
00:03:34.733 back down the channel, and the rest
00:03:36.833 of the program continues, and overlaps with
00:03:39.500 the execution of the I/O operation.
00:03:44.900 And this is relatively simple, in that
00:03:48.333 it doesn't really require any new language
00:03:50.900 or runtime features. It’s the same blocking
00:03:54.033 calls. it's the same multithreading functions,
00:03:56.100 we have anyway.
00:03:58.033 And it doesn’t really change the way we do I/O.
00:04:02.066 It's not changing the fundamental model of
00:04:04.933 either I/O or threading.
00:04:08.566 But it does mean we have to
00:04:10.100 move the I/O code to a separate thread,
00:04:12.866 which is some programming overhead.
00:04:17.800 It also has the advantage, though,
00:04:19.633 that it can run in parallel if
00:04:20.900 the system really does have multiple cores.
00:04:23.233 And it's safe, especially in Rust,
00:04:25.333 because the ownership rules prevent data races
00:04:27.866 so it's easy, and relatively straightforward,
00:04:30.633 to push the code into a separate thread.
00:04:36.433 The disadvantages
00:04:39.233 are that it adds complexity.
00:04:43.233 Creating a thread and Rust isn't difficult.
00:04:46.400 But it's harder than not creating a thread.
00:04:51.033 Spawning a thread, partitioning the I/O operations
00:04:55.666 into a separate thread, isn't conceptually difficult
00:04:58.733 in Rust, but it complicates the code.
00:05:02.400 It's a more complex programming model.
00:05:05.366 It obfuscates the structure of the code.
00:05:09.800 It's relatively resource heavy.
00:05:14.700 You have the overheads of context switching
00:05:17.333 to the separate threads, and each thread
00:05:20.500 has its own stack, its own memory overhead.
00:05:24.133 And that's not an enormous overhead,
00:05:26.333 but it is an overhead. And it's
00:05:28.400 a much heavier overhead than just running
00:05:31.233 the read() call within the main thread.
00:05:34.066 And the fact that the threads can run concurrently,
00:05:37.466 the fact that the threads can run
00:05:38.800 in parallel if you have multicore hardware,
00:05:41.166 is actually a relatively limited benefit,
00:05:43.066 because the thread spends most of its
00:05:44.833 time blocked waiting for I/O anyway.
00:05:47.366 So it's a bit of a waste
00:05:49.333 starting a new thread, just to call
00:05:51.800 a read() function which then blocks 90% of its lifetime.
00:06:00.466 So we can certainly perform
00:06:02.533 blocking I/O using multiple threads.
00:06:08.033 But it's problematic.
00:06:10.700 It’s high overhead in a lot of
00:06:13.066 cases. You've got context switch overheads,
00:06:15.766 you've got the memory overheads due to the separate stack,
00:06:19.466 you've got the scheduler overheads, and there’s
00:06:22.900 not that much benefit of parallelism anyway,
00:06:26.200 because it blocks for most of the time.
00:06:30.800 And some systems allow you to avoid
00:06:32.966 this. In Erlang, for example, the threading
00:06:37.533 is pretty lightweight, but in most systems
00:06:39.366 this is not the case, and it's
00:06:41.066 a relatively high overhead to start multiple
00:06:43.066 threads to perform I/O.
00:06:48.033 We'd like we'd like something more lightweight
00:06:51.033 as an alternative. We'd like to somehow
00:06:52.900 be able to multiplex I/O operations in
00:06:55.833 a single thread,
00:06:57.633 we'd like to somehow
00:07:00.033 allow I/O operations to complete asynchronously,
00:07:02.866 without having to have separate threads of control.
00:07:06.666 So, it would be desirable to provide
00:07:09.233 a mechanism which allows us to start asynchronous I/O,
00:07:12.666 start an I/O operation, and let it
00:07:14.966 run in the background, and somehow allow
00:07:18.033 us to poll the kernel to see
00:07:19.500 if it has finished yet,
00:07:21.000 all running within a single application thread.
00:07:25.100 The idea would be that you start
00:07:26.933 the I/O, and then it continues in
00:07:29.800 the background while the
00:07:31.366 program continues performing other computations. And then,
00:07:36.433 at some point, once the data is available,
00:07:40.233 either there's a callback which gets executed
00:07:43.933 to provide the data, or the main
00:07:47.266 thread can just poll it and say
00:07:48.533 “has it finished?”, and if it has,
00:07:50.100 it can pull the data in.
00:07:55.166 And this is also a reasonably common
00:07:57.866 abstraction. If you’re a C programmer,
00:08:01.100 this is the select() function in the
00:08:03.266 Berkeley sockets API, for example.
00:08:08.233 And there's a bunch of new,
00:08:11.933 higher performance, versions of this, such as
00:08:14.600 epoll(), if you're a Linux or an
00:08:16.666 Android programmer, or the kqueue abstraction in
00:08:19.900 FreeBSD, or macOS, or on the iPhone.
00:08:23.266 Or, if you're a Windows programmer,
00:08:25.433 I/O completion ports do something very similar.
00:08:28.000 And this tends to get wrapped in
00:08:29.300 libraries, such as libevent, or libev,
00:08:33.266 or libuv, which was trying to provide
00:08:35.200 common portable APIs for them.
00:08:38.566 And, in Rust, the mio library also
00:08:43.166 provides a portable abstraction for this.
00:08:47.566 The functionality that these things provide,
00:08:49.866 is the ability to trigger non-blocking operations.
00:08:52.533 They provide you a way of saying
00:08:54.800 “read asynchronously” or “write asynchronously’ from a
00:08:57.366 file or a socket,
00:08:59.300 and they provide a poll() abstraction,
00:09:01.433 so you can periodically check to see
00:09:02.933 if it completed and retrieve the data.
00:09:07.533 And they're actually pretty efficient.
00:09:10.366 They meet the goals of efficiency,
00:09:14.333 they meet the goals of only running
00:09:16.033 in a single thread, and
00:09:17.900 they build on features of the operating
00:09:19.666 system kernels that provide asynchronous I/O.
00:09:23.433 The problem with them, is that they
00:09:25.600 require, again, restructuring the code to avoid blocking.
00:09:31.966 And this is an example. This is
00:09:35.500 network code using sockets and select() function
00:09:39.900 in C. And what we see here
00:09:43.733 is the select() call, where you pass it
00:09:46.833 the three parameters, the set of readable
00:09:49.433 file descriptors, the set of writeable file
00:09:51.533 descriptors, the set of file descriptors might
00:09:53.766 deliver errors, and a timeout.
00:09:57.600 You have to bundle up all the
00:09:59.866 file descriptors that may have outstanding asynchronous
00:10:02.666 I/O, fill them into these parameters,
00:10:04.900 call it, and then poll each of
00:10:07.500 these in turn to see if the
00:10:08.900 FD_ISSET calls, to see which of those
00:10:11.166 different file descriptors,
00:10:12.566 have data available to read or write.
00:10:16.533 And it's a relatively low-level API,
00:10:18.900 which is reasonably well suited to C programming,
00:10:22.066 but it does require quite a restructure
00:10:24.633 of the code. This is no longer
00:10:26.700 as simple as just calling read() as
00:10:28.466 part of a loop. It’s restructuring the
00:10:30.700 whole program as an event loop,
00:10:33.266 where you poll on different file descriptors,
00:10:35.266 different sockets.
00:10:37.633 And the alternative libraries I mentioned,
00:10:40.433 such as libuv for C programming,
00:10:43.366 such as mio for Rust programming,
00:10:46.333 make this more portable, and they're a
00:10:49.533 little bit higher level, and they remove
00:10:51.833 some of the boilerplate, but conceptually you're
00:10:53.866 doing the same thing.
00:10:55.300 Conceptually you have to restructure the program
00:10:57.733 as an event loop, where you trigger
00:11:00.733 the asynchronous I/O, and every so often
00:11:03.233 you poll it to see if it's
00:11:04.400 completed. And it involves restructuring the code.
00:11:12.166 Now, these approaches have the advantage that
00:11:15.633 they're very efficient.
00:11:17.233 Because the asynchrony is handled by the
00:11:20.600 operating system kernel,
00:11:24.066 a single thread can
00:11:26.833 very efficiently handle multiple sockets.
00:11:29.433 All it does is trigger the asynchronous operation,
00:11:31.800 and a kernel thread
00:11:34.366 handles all the rest.
00:11:37.266 The mechanisms to run these operations concurrently
00:11:42.266 are built into the kernel, and they're pretty efficient.
00:11:46.500 But it requires us to rewrite the application code.
00:11:50.400 It requires us to restructure the application
00:11:53.300 as something which looks different. As something
00:11:56.366 which has an event loop, which polls
00:11:59.000 the data sources and reassembles the data.
00:12:02.566 So, fundamentally, we have two choices.
00:12:06.166 We can structure the code as a
00:12:08.466 set of multiple threads, which involves spawning
00:12:11.933 a thread and restructuring the code as
00:12:14.366 a set of multiple threads which pass
00:12:16.633 the data back once we've successfully read the data.
00:12:20.000 Or we can restructure the code using
00:12:23.000 the asynchronous I/O primitives, which involves
00:12:28.433 turning it into an event loop with
00:12:31.666 polling and so on. Again, both ways
00:12:35.266 involve a fairly fundamental rewrite of the
00:12:39.533 code to get the these efficiency gains.
00:12:43.866 What we would like is to be
00:12:45.233 able to get this efficiency, get the
00:12:47.000 efficiency of non-blocking I/O, in a much
00:12:49.333 more usable manner.
00:12:52.166 And the idea is that coroutines and
00:12:54.600 asynchronous code are one way of doing
00:12:57.266 that, so that's what I'll talk about
00:12:58.800 in the next part.
00:13:03.766 How do we overlap I/O and competition?
00:13:06.500 Well, if we want to do it
00:13:07.833 today, we have a spawn multiple threads,
00:13:10.133 or use the non-blocking I/O primitives provided
00:13:13.400 by the kernel through functions such as
00:13:15.300 select(), or mio, or libuv.
00:13:18.933 And these all work, but they introduce
00:13:21.000 complexity into the programming model.
00:13:23.533 Is there a better way?
00:13:25.800 Maybe. This is what asynchronous I/O and
00:13:28.600 coroutines are about, which we'll talk about
00:13:30.633 in the next part.
Part 2: async and await
The second part of the lecture discusses how coroutines and asynchronous code can be used to support I/O multiplexing on a single thread. It reviews what is a coroutine, and the way coroutines can execute concurrently to their caller by repeatedly yielding results; and it reviews how this is implemented. The way this can be used to support asynchronous I/O operations is outlined, leading to a description of asynchronous functions in Python and Rust. The need for runtime support is outlined.
00:00:00.500 In this part, I want to talk
00:00:03.000 about coroutines and asynchronous code, and the
00:00:04.900 runtime support needed to execute asynchronous code.
00:00:10.000 As we discussed in the previous part,
00:00:12.066 blocking I/O is problematic.
00:00:15.000 Code, like we see in the example
00:00:16.533 at the top of the slide,
00:00:18.533 that calls blocking I/O functions, such as
00:00:20.900 read(), stalls the execution of the program
00:00:23.700 while waiting for those calls to complete.
00:00:27.000 And the work-arounds for this, using multiple
00:00:29.300 threads or asynchronous I/O functions, such as
00:00:31.933 select(), require extensive restructuring of the code.
00:00:36.633 The goal of using coroutines with asynchronous
00:00:39.133 code is to allow I/O and computation
00:00:42.066 to be performed concurrently on a single
00:00:44.200 thread, without restructuring the code. The hope
00:00:48.533 is that this will avoid the overheads of multithreading,
00:00:51.200 while retaining the original code structure.
00:00:54.233 Essentially we hope to transform the blocking
00:00:56.800 code shown at the top of the
00:00:58.333 slide, into the asynchronous, non-blocking, code such
00:01:02.033 as that shown at the bottom.
00:01:04.166 And provide the language runtime with the
00:01:05.900 ability to execute those asynchronous functions concurrently
00:01:09.933 with low-overhead asynchronous I/O operations.
00:01:16.000 The programming model we’re considering structures the
00:01:18.633 code as a set of concurrent coroutines
00:01:21.333 that accept data from I/O sources and
00:01:23.800 yield in place of blocking.
00:01:27.000 The coroutines execute concurrently, overlapping I/O and
00:01:30.366 computation, all within a single thread of
00:01:33.200 execution, and so avoid the overhead of multithreading.
00:01:37.700 To understand this programming model, though,
00:01:40.266 we must first ask what is a coroutine?
00:01:44.266 Well, if we consider a normal function,
00:01:47.300 we see that it’s called, executes for
00:01:49.466 a while, and returns a result.
00:01:53.000 A coroutine, in contrast, has the ability
00:01:55.466 to pause its execution. Rather than return
00:01:58.866 a single value, or even a list
00:02:00.866 of values, it lazily generates a sequence of values.
00:02:05.800 The slide shows an example, written in
00:02:07.766 Python. In this case, the countdown() function
00:02:11.500 is a coroutine that yields a sequence
00:02:13.633 of integers, counting down to zero from
00:02:16.566 the value given as its argument.
00:02:19.100 We see that calling countdown() with the
00:02:21.033 parameter 5, yields the values 5,
00:02:23.833 4, 3, 2, and 1, and that
00:02:26.966 these can be processed by a for loop.
00:02:30.000 Importantly, the countdown() function is not returning
00:02:32.933 a single value, comprising a list of
00:02:35.033 numbers counting down. Rather, calling countdown() returns
00:02:39.166 a generator object.
00:02:42.000 By itself, a generator object does nothing.
00:02:46.000 But the generator object implements a next()
00:02:48.133 method. And the for loop protocol in
00:02:51.066 Python takes a generator object and repeatedly
00:02:53.900 calls that next() method.
00:02:56.900 And each time next() is called,
00:02:58.800 the function executes until it reaches the
00:03:00.766 yield statement, then returns the next value.
00:03:04.400 Or it executes until the function ends,
00:03:07.433 when it returns None to indicate that
00:03:09.466 the generator has completed.
00:03:12.766 Essentially, the function is turned into a
00:03:14.666 heap allocated generator object that maintains state,
00:03:18.466 executes lazily in response to calls to
00:03:20.900 next(), and repeatedly yields the different values.
00:03:27.000 Coroutines in Python do nothing until the
00:03:29.666 next() function is called on the generator
00:03:31.566 object representing the coroutine. Normally this happens
00:03:35.433 automatically, as part of the operation of
00:03:37.900 a for loop, but we can also
00:03:39.966 call it manually, as we see here.
00:03:43.200 In this example, the grep() function is
00:03:45.600 a coroutine, and the call to
00:03:48.066 g = grep(“python”) instantiates the generator object.
00:03:53.000 But instantiating the generator doesn’t cause it to run.
00:03:56.866 The print() call at the
00:03:58.266 start of the grep() function doesn’t execute
00:04:00.566 until we call the next() method,
00:04:02.166 for example, forcing the coroutine to run
00:04:05.733 until it yields.
00:04:08.000 In this case, the function yields to
00:04:09.966 consume a value, so we call send()
00:04:11.966 rather than next(), and pass in a
00:04:13.800 value. And we do this repeatedly,
00:04:16.433 passing in different values each time,
00:04:19.000 and each time causing a single iteration
00:04:21.100 of the while loop in the grep()
00:04:22.766 function to execute.
00:04:27.000 We see that the coroutine is a
00:04:28.633 function that executes concurrently to the rest
00:04:30.733 of the code.
00:04:32.533 It’s event driven. It only executes when
00:04:35.700 the runtime calls its next() or send()
00:04:37.833 method, which causes its execution to resume
00:04:40.933 until it next yields, at which point
00:04:42.900 control passes back to the runtime.
00:04:46.266 In the examples, we’ve only had a
00:04:48.233 single coroutine executing at once, but it’s
00:04:51.400 entirely possible to start several different coroutines,
00:04:54.466 and have the runtime loop, calling their
00:04:56.566 next() methods. This will cause the different
00:04:59.700 coroutines to execute concurrently, each one executing
00:05:02.566 for a while until it yields to the next.
00:05:06.033 It’s an approach that’s sometime known as
00:05:08.100 cooperative multitasking.
00:05:10.800 The system context switches each time a
00:05:12.966 coroutine yields the processor, and if it
00:05:15.533 doesn’t yield, it keeps running.
00:05:18.400 This is how Microsoft Windows 3.1,
00:05:21.033 and the Macintosh System 7, handled multitasking.
00:05:26.233 But it gives us a basis for
00:05:27.800 efficient I/O handling within a single thread.
00:05:32.000 We structure the code to execute within
00:05:33.966 a thread as a set of coroutines
00:05:36.766 that trigger an asynchronous I/O operation,
00:05:39.433 and yield rather than blocking on I/O.
00:05:43.566 And we label the functions as being
00:05:46.100 async. This is a label that tells
00:05:49.233 the language runtime that those functions are
00:05:51.033 coroutines that call asynchronous I/O operations.
00:05:55.700 And I/O operations that would normally block
00:05:59.066 are labelled in the code with an await tag.
00:06:02.800 This causes the coroutine to
00:06:04.333 trigger the asynchronous version of that I/O
00:06:06.400 operation, then yield, passing control to another
00:06:10.066 coroutine while the I/O is performed.
00:06:14.000 This provides concurrent I/O, without parallelism.
00:06:18.166 The coroutines operate concurrently,
00:06:20.566 but within a single thread.
00:06:23.000 The calls to await tell the kernel
00:06:25.566 to start an asynchronous I/O operation,
00:06:28.233 and yield the file descriptor representing that
00:06:30.600 operation to the runtime.
00:06:33.000 The runtime operates in a loop.
00:06:36.033 It repeatedly calls select(), and adds the
00:06:39.033 coroutines that yielded any readable or writable
00:06:42.166 file descriptors to the end of a
00:06:44.400 run queue. Then, it resumes the coroutine
00:06:47.666 at the head of the run queue,
00:06:49.000 calling its next() or send() method as appropriate.
00:06:52.300 And when that coroutine yields, and returns
00:06:55.033 a file descriptor, it’s moved to the
00:06:57.500 list of blocked tasks, and its file
00:07:00.233 descriptor is added to the set to be polled in future.
00:07:04.000 And the loop continues, calling select() again.
00:07:09.400 An async function is therefore a function
00:07:11.966 that performs asynchronous I/O operations and that
00:07:15.566 can operate as a coroutine. The slide
00:07:18.666 shows an example in Python.
00:07:21.500 The async functions are executed asynchronously by
00:07:24.400 the runtime, in response to I/O events,
00:07:27.333 and the program is written as a
00:07:29.233 set of async functions.
00:07:32.000 The main() function calls into the runtime
00:07:34.533 giving it the top-level async function to
00:07:36.366 run. This starts the asynchronous runtime,
00:07:39.733 starts it polling for I/O, and runs
00:07:42.133 until that async function completes.
00:07:45.566 It’s a widely supported model. Coroutines,
00:07:49.166 in the form of async functions,
00:07:54.466 and in other languages.
00:07:58.833 Within an async function, await statements cause
00:08:01.933 the function to yield control to the
00:08:03.566 runtime while an asynchronous I/O operation is performed.
00:08:08.800 Executing an await statement yields control to
00:08:11.733 the runtime. It puts the coroutine into
00:08:14.666 a queue to be woken at some
00:08:16.333 later time, when the I/O operation has completed.
00:08:20.300 If another coroutine is ready to execute,
00:08:23.100 then the runtime schedules the yielding function
00:08:25.800 to wake-up once the I/O completes,
00:08:28.333 and control passes to that other coroutine.
00:08:31.866 Otherwise, runtime blocks until either this,
00:08:34.600 or some other, I/O operation becomes ready,
00:08:37.333 then passes control back to the corresponding
00:08:39.466 async function.
00:08:44.000 The resulting asynchronous code
00:08:46.666 follows structure of the blocking code.
00:08:50.000 If we look at the async version
00:08:52.433 of the read_exact() function in Rust,
00:08:54.500 for example, we see that the only
00:08:56.900 differences are the async and await annotations,
00:08:59.366 and that the input is declared to
00:09:01.066 be something that implements the AsyncRead trait,
00:09:03.033 rather than the Read trait.
00:09:06.000 The code structure remains unchanged, aside from
00:09:09.300 the call to main that wraps the
00:09:10.733 async functions into a call to start
00:09:12.333 the runtime. And the compiler and runtime
00:09:15.300 work together to generate code that efficiently
00:09:17.766 executes the asynchronous I/O operations.
00:09:24.000 How is this implemented in Rust?
00:09:27.000 Well, in the Python code we saw
00:09:28.966 earlier, the coroutine was instantiated as a
00:09:31.566 generator object with a next() method that
00:09:34.200 allowed it to run and yield the next value.
00:09:37.933 Rust does something similar. The async functions
00:09:42.166 are compiled into instances of structs that
00:09:44.966 maintain the function state, and that implement
00:09:47.333 a trait known as Future. The Future
00:09:51.400 trait has a member type that describes
00:09:54.300 the return value, and defines a poll()
00:09:56.966 function that runs the function until it
00:09:59.533 yields an instance of an enum.
00:10:01.266 And that’s either Ready, with the yielded
00:10:03.533 value, or Pending to indicate that the
00:10:05.966 async function is waiting for I/O.
00:10:09.000 The details differ between Rust and Python,
00:10:11.866 as you might expect, but the concepts are the same.
00:10:19.000 And that concludes this discussion of coroutines
00:10:21.066 and asynchronous code.
00:10:23.000 In the next part, I’ll talk briefly
00:10:25.100 about how to use async functions,
00:10:27.100 and about the advantages and disadvantages of
00:10:29.700 this approach to structuring code.
Part 3: Design Patterns for Asynchronous Code
The final part of the lecture discusses how to structure and compose asynchronous functions, and reviews the need to avoid blocking operations and long-running calculations in asynchronous functions. The benefits and problems of the asynchronous programming model are discussed.
00:00:00.366 In this final part, I want to
00:00:02.000 talk about some design patterns asynchronous code,
00:00:04.900 and about the advantages and disadvantages
00:00:06.966 of asynchronous programming.
00:00:09.700 I’ll start by talking about some of the design patterns.
00:00:12.100 I’ll talk about how you compose future values,
00:00:14.900 about the need to avoid blocking I/O,
00:00:17.000 and the need to avoid long-running computations
00:00:19.400 in asynchronous code.
00:00:23.233 So in writing async functions, I think
00:00:25.800 the goal should be to make these
00:00:27.266 functions as small, and as limited scope, as possible.
00:00:30.966 An async function should perform a single,
00:00:33.366 well-defined task. It should read and parse
00:00:36.400 a file, or it should read,
00:00:38.200 parse, process, and respond to a network
00:00:41.100 request, for example.
00:00:43.333 And it functions are structured in this
00:00:45.100 way, they tend to be fairly straightforward
00:00:47.566 and written a fairly natural style,
00:00:49.900 and compose pretty straightforwardly.
00:00:53.866 The Rust async libraries provide some combinator
00:00:57.900 functions that can help compose futures.
00:01:02.400 That can help come combine future values
00:01:04.600 and produce a new value.
00:01:06.766 And there are functions, such as read_exact(),
00:01:09.400 which allow it to read a an
00:01:11.500 exact number of bytes, such as select()
00:01:14.333 to allow it to respond to different
00:01:17.866 Futures which
00:01:21.366 are operating concurrently, and functions such as
00:01:23.833 for_each() and and_then(), which can ease the
00:01:26.266 composition of the asynchronous functions.
00:01:29.433 And sometimes these are helpful, sometimes they
00:01:31.600 just obfuscate the code. But there are
00:01:34.366 a number of functions that can work
00:01:36.600 with, and combine, Futures in the cases that they’re useful.
00:01:42.733 When writing asynchronous code, there’s two fundamental
00:01:47.166 constraints that you need to be aware of.
00:01:51.266 The first, due to the nature of
00:01:54.533 asynchronous code, is that it multiplexes multiple
00:01:58.600 I/O operations onto a single thread.
00:02:02.333 And the runtime has to provide asynchronous-aware
00:02:05.300 versions of all of these different I/O operations.
00:02:09.366 It has to provide asynchronous reads and
00:02:11.833 writes to files, asynchronous reads and writes
00:02:14.800 to the network, TCP, UDP, and Unix
00:02:17.966 sockets, and to other types of network protocols.
00:02:21.500 And in all of these cases,
00:02:22.833 it has to provide a non-blocking version
00:02:24.866 of the I/O operation, that returns a
00:02:27.100 Future that can interact with the runtime.
00:02:29.900 Rather than natively calling the blocking function
00:02:34.066 provided by the operating system, it has
00:02:36.900 to wrap the underlying the operating system
00:02:39.300 provided async I/O operations,
00:02:42.600 and wrap them into Futures.
00:02:47.633 And, importantly,
00:02:48.666 it doesn't interact well with blocking I/O.
00:02:52.233 If you call the synchronous version of
00:02:55.700 read(), for example, to read from a
00:02:57.600 file, and that blocks, it will block
00:02:59.933 the entire runtime, because the runtime is
00:03:02.800 operating within the context of a single thread,
00:03:06.500 and the underlying system doesn't know about Futures.
00:03:11.533 The programmer has to have the discipline
00:03:15.233 to avoid calling the blocking functions of
00:03:17.600 the code, otherwise the entire asynchronous runtime
00:03:20.766 grinds to a halt while that blocking
00:03:23.733 operation completes.
00:03:28.466 And this, to some extent, runs the
00:03:30.566 risk of fragmenting the ecosystem. It means
00:03:33.433 that libraries which are supposed to be
00:03:35.166 used in an asynchronous context have to
00:03:38.233 be written to use asynchronous functions,
00:03:40.533 have to be written to call await(),
00:03:42.266 have to be written to use the
00:03:43.833 asynchronous versions of the I/O libraries.
00:03:46.333 And if anyone, any of the library
00:03:49.333 authors, implementing any of those libraries forgets,
00:03:51.966 then you run the risk of that
00:03:54.000 blocking the runtime,
00:03:55.600 losing concurrency, losing the performance.
00:03:59.900 And this is a potential source of
00:04:01.666 bugs, because the Rust compiler, the Rust
00:04:05.500 language, can't catch this sort of behaviour.
00:04:09.033 The programmer is required to have the
00:04:11.333 discipline to avoid the blocking I/O operations.
00:04:16.600 Similarly, the programmer has to avoid long
00:04:20.600 running computations.
00:04:23.500 Control passing between different Futures, between different
00:04:27.933 async functions, is explicit. Control passing happens
00:04:31.833 when you call it await, to wait
00:04:34.400 for an I/O operation.
00:04:36.200 At that point, the next runnable Future,
00:04:38.500 the next runnable async function, is scheduled.
00:04:44.466 But in the same way that calling
00:04:46.133 blocking functions is problematic, because it causes
00:04:48.900 the runtime to stop while that function
00:04:51.166 blocks, if instead you don't call await,
00:04:54.866 and you perform some long running computation,
00:04:58.266 the runtime won't know, won’t be able
00:05:01.466 to switch away from that computation,
00:05:04.133 and it will starve the other tasks from running.
00:05:07.533 If you have a long running computation,
00:05:09.900 you need to spawn a separate thread,
00:05:13.766 and explicitly pass messages to and from
00:05:16.900 that thread, to avoid
00:05:19.466 starving the other asynchronous computations.
00:05:23.200 And again, the language, the runtime,
00:05:25.700 doesn't help with this. The programmer needs
00:05:28.366 to be aware that you must avoid
00:05:30.233 long running computations, as well as blocking
00:05:32.500 I/O, in the context of an asynchronous runtime.
00:05:36.433 So we're getting good performance, but we're
00:05:38.666 getting good performance by limiting what the
00:05:41.666 programmer can do in the async runtime,
00:05:44.600 and by requiring them to have the
00:05:46.100 discipline to make sure that they follow those limitations.
00:05:54.400 So, is the asynchronous approach to programming
00:05:57.833 a good approach?
00:05:59.633 Should we all be switching to asynchronous
00:06:01.700 code in the future?
00:06:09.900 The use of async and await lets
00:06:12.933 us structure the code in a way
00:06:17.733 that allows us to efficiently multiplex large
00:06:20.466 numbers of I/O operations on a single thread.
00:06:25.566 And this can give a very natural
00:06:27.433 programming model when you're performing operations which
00:06:30.533 are very heavily I/O bound.
00:06:33.400 It lets us structure code, which performs
00:06:35.866 asynchronous non-blocking I/O, in a way that
00:06:38.933 looks very similar to code that uses blocking I/O.
00:06:42.366 That can efficiently multiplex multiple I/O operations
00:06:46.500 onto a single thread,
00:06:49.066 efficiently allow them to run concurrently,
00:06:51.666 without the overheads of starting-up multiple threads.
00:06:56.566 So for I/O bound tasks, this can
00:06:59.333 be very, very efficient, and very natural.
00:07:07.600 It's problematic, if there are blocking operations,
00:07:11.233 as we saw a minute ago,
00:07:13.266 because the blocking operations lock-up the entire
00:07:16.000 runtime, and not just that one task.
00:07:19.000 And it means all the libraries,
00:07:21.500 that use blocking calls need to be
00:07:23.033 updated to use these asynchronous I/O operations.
00:07:27.933 And this either means everything has to
00:07:30.100 use asynchronous I/O, or people need to
00:07:32.633 build two versions all the libraries.
00:07:34.666 Some that are synchronous, and use the
00:07:36.466 blocking operations, and some which are asynchronous.
00:07:40.666 It’s also problematic when it comes to
00:07:42.666 long running computations. Again, as we saw,
00:07:45.966 they starve the other tasks, because the
00:07:48.133 runtime only switches away when you call
00:07:51.466 asynchronous operations, when you call await.
00:07:55.433 And so, if you're trying to mix
00:07:57.800 long-running, compute-heavy,
00:07:59.100 functions with asynchronous I/O functions
00:08:01.800 and I/O-bound tasks, it gets to be
00:08:04.700 problematic, and tends to starve to I/O tasks.
00:08:08.866 And this is a problem which is
00:08:10.666 familiar to anyone who wrote code on
00:08:12.300 Windows 3.1 or on the MacIntosh System 7
00:08:17.300 And it led to real interactivity problems
00:08:20.600 with those applications, where applications would just
00:08:25.366 not yield the CPU, and it would
00:08:27.833 prevent the multitasking from working for a while.
00:08:31.166 I worry that by promoting asynchronous code,
00:08:35.466 we're just introducing the same hard to debug
00:08:39.433 problems with task starvation, into the next
00:08:43.200 generation of applications.
00:08:46.600 The asynchronous I/O works really well when
00:08:49.500 you have a lot of
00:08:51.833 I/O-bound tasks. It doesn't mix well with
00:08:54.966 compute-bound tasks.
00:09:01.733 And, to some extent, I wonder whether
00:09:03.800 we really need the asynchronous I/O? Whether
00:09:07.033 we really need this for performance?
00:09:10.000 It's certainly true that threads are more
00:09:12.766 expensive than async functions, and async tasks,
00:09:17.466 in the runtime.
00:09:20.966 But threads are not that expensive.
00:09:24.600 A properly configured modern machine can run
00:09:28.066 many, many thousands of threads,
00:09:29.633 without any great difficulty.
00:09:32.966 The laptop I’m recording this lecture on,
00:09:35.233 for example, which is a low-end MacBook
00:09:38.533 Air with a Core i5 processor,
00:09:41.566 is running about 2200 threads in normal everyday use.
00:09:47.700 And If you look up the documentation
00:09:50.233 for, for example, the Varnish web cache,
00:09:53.000 which is a caching web proxy that’s
00:09:56.000 quite popular in data centres,
00:09:59.033 the documentation says it's common to configure
00:10:02.566 this with 500 or 1000 threads,
00:10:05.066 at minimum, but they rarely recommend running
00:10:08.800 more than 5000 threads.
00:10:12.766 And, I think, unless you're doing something
00:10:14.866 very, very unusual, it's likely you can
00:10:17.866 just spawn a thread, or use a
00:10:20.100 pre-configured thread pool, and perform blocking I/O,
00:10:25.666 and just communicate using channels, and the
00:10:28.166 performance will be just fine.
00:10:30.833 Even if this means spawning up thousands of threads.
00:10:35.866 Modern servers can run thousands, or tens
00:10:39.500 of thousands, of simultaneous threads without any
00:10:42.000 great difficulty.
00:10:44.833 Threading is not that expensive these days.
00:10:49.400 And asynchronous I/O can give a performance benefit.
00:10:54.000 But that performance benefit is usually not that great.
00:10:59.533 So my recommendation is,
00:11:01.600 choose asynchronous programming because you prefer the
00:11:04.300 programming style, if you like.
00:11:06.800 But don't choose it for performance reasons,
00:11:12.066 unless you really sure that it will
00:11:13.766 improve performance. Threading is not as expensive
00:11:16.433 as you think.
00:11:21.466 And that concludes our discussion of coroutines
00:11:24.200 and asynchronous I/O.
00:11:26.566 As we've seen, blocking I/O can be
00:11:29.166 problematic. It's problematic because it forces you
00:11:33.000 to write multi-threaded code, which has
00:11:35.333 a reputation as being high overhead.
00:11:38.400 Or it's problematic because you have to
00:11:40.500 structure your code as a select loop.
00:11:43.000 And in both of these cases there's
00:11:45.466 a restructuring of the code needed.
00:11:49.033 The use of coroutines and asynchronous code,
00:11:52.100 in the best cases, can give you
00:11:55.100 a structure for I/O-heavy code, which allows
00:11:59.066 you to perform asynchronous I/O operations without
00:12:02.166 greatly restructuring the code, and just involves
00:12:05.333 some small number of annotations,
00:12:07.900 and switching to use the async version of the runtime.
00:12:12.733 And, in those cases it's quite a
00:12:14.566 natural programming model, and it works very well.
00:12:18.500 It does, though, run the risk of
00:12:20.133 fragmenting the ecosystem into async aware,
00:12:22.666 and non-async aware functions, and I think
00:12:26.000 it runs the risk of
00:12:28.100 introducing hard to find bugs with blocking
00:12:30.733 code, and with CPU hogging code.
00:12:35.833 Is it worth it?
00:12:38.600 I don't know, maybe.
00:12:41.033 It gives very natural code in some
00:12:42.866 cases, and very high performance code in
00:12:45.600 some cases, I think it can work
00:12:47.600 very well in those cases.
00:12:50.100 In other cases, in other applications,
00:12:53.533 the multi-threaded, blocking, version of the code
00:12:56.966 is just as natural to write,
00:12:59.666 and also scales very well.
00:13:02.933 Different applications have different requirements, and I
00:13:05.966 would encourage experimentation, rather than just diving
00:13:09.500 straight in to using asynchronous functions and await.
00:13:17.166 In the next lecture, we’ll move on,
00:13:19.233 and instead of talking about concurrency,
00:13:21.466 we'll move on to talk about security.
Lecture 8 discussed coroutines and asynchronous programming. It started by explaining why blocking I/O is problematic, because I/O operations are slow and block execution of the thread will they're performed. The lecture then outlined the two traditional approach to addressing these concerns: either spawning an additional thread to perform the I/O, or introducing asynchronous I/O primitives, backed by kernel threads, to perform I/O on behalf of the user program. It noted that both of these approaches work, but threads have relatively high overhead, and both approach require significant code restructuring.
The second part of the lecture then discussed the use of coroutines and asynchronous code, in the form of the async and await primitives. These allow asynchronous execution of I/O operations with low overhead and only limited restructuring of the code, to label functions as being able to execute as asynchronous coroutines and the label scheduling points where the system may block due to I/O. The result is widely used, efficient, and, in many cases, requires only limited code changes.
The lecture also discussed, briefly, the runtime support needed to support asynchronous code in the form of coroutines, and outlined how this is implemented in Rust using polling and a runtime such as Tokio. It noted the limitations around avoiding blocking operations and long-running computations, and noted the similarity of these to the cooperative multitasking used in Mac System 7 and Windows 3.1.
Finally, the lecture considered the overheads and costs of using asynchronous I/O versus multiple threads.
Discussion will focus on the need for asynchronous I/O operations using coroutines, and the claimed performance benefits versus the costs of restructuring the code.