Networked Systems H (2021-2022)
Lecture 6: Lowering Latency
This lecture discusses some of the factors that affect the latency of a TCP congestion. It considers TCP congestion control, the TCP Reno and Cubic congestion control algorithms, and their behaviour and performance in terms of throughput and latency. It then considers alternative congestion control, such as the TCP Vegas and BBR algorithms, and the use of explicit congestion notification (ECN), as options to lower latency. Finally, it considers the impact of sub-optimal Internet paths on latency, and the rationale for deploying low-Earth orbit satellite constellations to reduce latency of Internet paths.
Part 1: TCP Congestion Control
This first part of the lecture outlines the principle of congestion control. It discusses packet loss as a congestion signal, conservation of packets in flight, and the additive increase, multiplicative decrease requirements for stability.
00:00:00.566 In this lecture I’d like to move
00:00:02.400 on from talking about how to transfer
00:00:04.800 data reliably, and talk about mechanisms and
00:00:07.866 means by which transport protocols go about
00:00:10.366 lowering the latency of the communication.
00:00:15.466 One of the key limiting factors of
00:00:17.966 performance of network systems, as we've discussed
00:00:20.633 in some of the previous lectures, is latency.
00:00:25.000 Part of that is the latency for
00:00:26.800 establishing connections, and we've spoken about that
00:00:29.166 in detail already, where a lot of
00:00:31.566 the issue is the number of round
00:00:33.933 trip times needed to set up a connection.
00:00:37.400 And, especially when secure connections are in
00:00:40.700 use, if you're using TCP and TLS,
00:00:43.766 for example, as we discussed, there’s a
00:00:46.033 large number of round trips needed to
00:00:47.766 actually get to the point where you
00:00:49.366 can establish a connection, negotiate security parameters,
00:00:53.266 and start to exchange data.
00:00:55.366 And we've already spoken about how the
00:00:58.166 QUIC Transport Protocol
00:01:00.566 has been developed to try and improve
00:01:03.233 latency in terms of establishing a connection.
00:01:06.166 The other aspects of latency, and reducing
00:01:08.566 the latency of communications, is actually in
00:01:10.966 terms of data transfer.
00:01:13.133 How you deliver data across the network
00:01:15.833 in a way which doesn't lead to
00:01:19.233 excessive delays, and how you can gradually
00:01:23.033 find ways of reducing the latency,
00:01:25.733 and making the network better suited to
00:01:29.200 real time applications, such as telephony,
00:01:31.833 and video conferencing, and gaming, and high
00:01:34.500 frequency trading, and
00:01:38.000 Internet of Things, and control applications.
00:01:43.666 A large aspect of that is in
00:01:44.966 terms of how you go about building
00:01:46.866 congestion control, and a lot of the
00:01:48.800 focus in this lecture is going to
00:01:50.700 be on how TCP
00:01:52.300 congestion control works, and how other protocols
00:01:54.700 do congestion, to deliver data in a
00:01:57.533 low latency manner.
00:01:59.400 But I’ll also talk a bit about
00:02:01.666 explicit congestion notification, and changes to the
00:02:04.466 way queuing happen in the network,
00:02:07.600 and about services such as SpaceX’s StarLink
00:02:09.700 which are changing the way the network
00:02:12.500 is built to reduce latency.
00:02:17.500 I want to start by talking about congestion control,
00:02:20.300 and TCP congestion control in particular.
00:02:26.800 And, what I want to do in
00:02:28.666 this part, is talk about some of
00:02:30.566 the principles of congestion
00:02:32.233 control. And talk about what is the
00:02:34.333 problem that's being solved, and how can
00:02:36.700 we go about adapting the rate at
00:02:39.533 which a TCP connection delivers data over
00:02:42.133 the network
00:02:43.633 to make best use of the network
00:02:45.366 capacity, and to do so in a
00:02:47.400 way which doesn't build up queues in
00:02:49.600 the network and induce too much latency.
00:02:52.300 So in this part I’ll talk about
00:02:54.566 congestion control principles. In the next part
00:02:57.066 I move on to talk about loss-based
00:02:59.433 congestion control, and talk about TCP Reno
00:03:02.566 and TCP Cubic,
00:03:04.233 which are ways of making very effective
00:03:06.400 use of the overall network capacity,
00:03:08.900 and then move on to talk about
00:03:11.200 ways of lowering latency.
00:03:13.533 I’ll talk about latency reducing congestion control
00:03:15.866 algorithms, such as TCP Vegas or Google's
00:03:18.933 TCP BBR proposal. And then I’ll finish
00:03:22.033 up by talking a little bit about
00:03:24.433 Explicit Congestion Notification
00:03:26.400 in one of the later parts of the lecture.
00:03:31.166 TCP is a
00:03:33.666 complex and very highly optimised protocol,
00:03:38.400 especially when it comes to congestion control
00:03:41.666 and loss recovery mechanisms.
00:03:44.733 I'm going to attempt to give you
00:03:47.000 a flavour of the way congestion control
00:03:49.500 works in this lecture, but be aware
00:03:52.033 that this is a very simplified review
00:03:54.333 of some quite complex issues.
00:03:56.833 The document listed on the slide is
00:03:59.966 entitled “A roadmap for TCP Specification Documents”,
00:04:03.900 and it's the latest IETF standard that describes
00:04:08.833 how TCP works, and points to the
00:04:11.700 details of the different proposals.
00:04:15.966 This is a very long and complex
00:04:19.733 document. It’s about, if I remember right,
00:04:22.133 60 or 70 pages long.
00:04:24.133 And all it is, is a list
00:04:26.066 of references to other specifications, with one
00:04:28.533 paragraph about each one describing why that
00:04:30.833 specification is important.
00:04:32.666 And the complete specification for TCP is
00:04:35.100 several thousand pages of text. This is
00:04:37.800 a complex protocol with a lot of
00:04:40.933 features in it, and I’m necessarily giving
00:04:43.700 a simplified overview.
00:04:47.400 I’m going to talk about TCP.
00:04:50.066 I’m not going to talk much,
00:04:52.066 if at all, about QUIC in this lecture.
00:04:54.666 That's not because QUIC isn't interesting,
00:04:57.533 it's because QUIC essentially adopts the same
00:05:00.433 congestion control mechanisms as TCP.
00:05:03.433 The QUIC version one standard says to
00:05:07.233 use TCP Reno, use the same congestion
00:05:10.166 control algorithm as TCP Reno.
00:05:13.300 And, in practice, most of the QUIC
00:05:15.500 implementations use the Cubic or the BBR
00:05:19.033 congestion control algorithms,
00:05:20.933 which we'll talk about later on.
00:05:22.500 QUIC is basically adopting the same mechanisms
00:05:24.566 as does TCP, and for that reason
00:05:27.366 that I’m not going to talk about
00:05:30.433 them too much separately.
00:05:36.966 So what is the goal of congestion
00:05:39.666 control? What are the principles of congestion control?
00:05:43.633 Well, the idea of congestion control is
00:05:46.600 to find the right transmission rate for
00:05:49.966 a connection.
00:05:51.466 We're trying to find the fastest sending
00:05:53.600 rate which you can send at to
00:05:56.100 match the capacity of the network,
00:05:58.233 and to do so in a way
00:05:59.833 that doesn't build up queues, doesn't overload,
00:06:02.766 doesn't congest the network.
00:06:05.066 So we're looking to adapt the transmission
00:06:07.333 rate of a flow of TCP traffic
00:06:09.933 over the network, to match the available
00:06:12.100 network capacity.
00:06:13.800 And as the network capacity changes,
00:06:16.100 perhaps because other flows of traffic start
00:06:19.466 up, or perhaps because you're on a
00:06:21.533 mobile device and you move into an
00:06:24.333 area with different radio coverage,
00:06:26.500 the speed at which the TCP is
00:06:29.800 delivering the data should adapt to match
00:06:31.600 the changes and available capacity.
00:06:35.966 The fundamental principles of congestion control,
00:06:41.433 as applied in TCP,
00:06:43.300 were first described by Van Jacobson,
00:06:46.500 who we see on the picture on
00:06:49.200 the top right of the slide,
00:06:51.033 in the paper “Congestion Avoidance and Control”.
00:06:56.366 And those principles are that TCP responds
00:06:59.166 to packet loss as a congestion signal.
00:07:01.933 It treats the loss of a packet,
00:07:04.966 because the Internet is a best effort
00:07:07.300 packet network, and it loses, it discards
00:07:09.666 packets, if it can't deliver them,
00:07:11.900 and TCP treats that discard, that loss
00:07:14.700 of a packet, as a congestion signal,
00:07:16.900 and as a signal of it's sending
00:07:19.233 too fast and should slow down.
00:07:21.866 It relies on the principle of conservation
00:07:23.833 of packets. It tries to keep the
00:07:26.133 number of packets, which are traversing the
00:07:28.433 network roughly constant,
00:07:29.800 assuming nothing changes in the network.
00:07:32.966 And it relies on the principles of
00:07:34.966 additive increase, multiplicative decrease.
00:07:37.166 If it has to increase its sending
00:07:39.233 rate, it does so relatively slowly,
00:07:41.333 an additive increase in the rate.
00:07:43.466 And if it has to reduce its
00:07:44.666 sending rate, it does so quickly, a multiplicative decrease.
00:07:49.466 And these are the fundamental principles that
00:07:51.833 Van Jacobson elucidated for TCP congestion control,
00:07:55.633 and for congestion control in general.
00:07:58.800 And it was Van Jacobson who did
00:08:01.866 the initial implementation of these into TCP
00:08:05.366 in the mid-1980s, about 1984, ’85, or so.
00:08:12.300 Since then, the algorithms, the congestion control
00:08:16.333 algorithms, for TCP in general have been
00:08:18.500 maintained by a large number of people.
00:08:20.900 A lot of people have developed this.
00:08:23.600 Probably one of the leading people in
00:08:26.733 this space for the last 20 years
00:08:30.700 or so, is Sally Floyd who was
00:08:33.166 very much responsible for taking
00:08:35.533 the TCP standards, making them robust,
00:08:39.166 pushing them through the IETF to get
00:08:41.533 them standardised, and making sure they work,
00:08:43.400 and making sure they work and get really high performance.
00:08:46.600 And she very much drove the development
00:08:48.800 to make these robust, and effective,
00:08:51.100 and high performance standards, and to make
00:08:53.766 TCP work as well as it does today.
00:08:57.266 And Sally sadly passed away a year
00:09:00.900 or so back, which is a tremendous
00:09:03.933 shame, but we're grateful for her legacy
00:09:08.766 in moving things forward.
00:09:13.833 So to go back to the principles.
00:09:17.366 The first principle of congestion control in
00:09:20.233 the Internet, and in TCP, is that
00:09:22.833 packet loss is an indication that the
00:09:24.700 network is congested.
00:09:28.500 Data flowing across the Internet flows from
00:09:31.433 the sender to the receiver through a
00:09:33.666 series of routers. The IP routers connect
00:09:37.866 together the different links that comprise the network.
00:09:41.766 And routers perform two functions:
00:09:44.500 they perform a routing function, and a forwarding function.
00:09:50.166 The purpose of the routing function is
00:09:52.566 to figure out how packets should get
00:09:55.166 to their destination. They receive a packet
00:09:57.766 from some network link, look at the
00:09:59.733 destination IP address, and decide which direction
00:10:02.333 to forward that packet. They’re responsible for
00:10:05.100 finding the right path through the network.
00:10:08.500 But they're also responsible for forwarding,
00:10:10.566 which is actually putting the packets into
00:10:13.233 the queue of outgoing traffic for the
00:10:15.900 link, and managing that queue of packets
00:10:18.566 to actually transmit the packets across the network.
00:10:22.033 And routers in the network have a
00:10:25.366 set of different links; the whole point
00:10:28.133 of a router is to connect different
00:10:30.266 links. And at each link, they have
00:10:32.200 a queue of packets, which are enqueued
00:10:34.100 to be delivered on that link.
00:10:36.900 And, perhaps obviously, if packets are arriving
00:10:39.333 faster than the link can deliver those
00:10:41.933 packets, then the queue gradually builds up.
00:10:44.466 More and more packets get enqueued in
00:10:47.200 the router waiting to be delivered.
00:10:48.800 And if packets are arriving slower than
00:10:51.433 they can be forwarded,
00:10:54.000 then the queue gradually empties as the
00:10:57.133 packets get transmitted.
00:11:00.066 Obviously the router has a limited amount
00:11:02.133 of memory, and at some point it's
00:11:04.633 going to run out of space to
00:11:06.200 enqueue packets. So, if packets are being
00:11:08.300 delivered faster than they,
00:11:10.200 if packets arriving at the router faster
00:11:12.833 than they can be delivered down the
00:11:14.633 link, the queue will build up and
00:11:16.500 gradually fill, until it reaches its maximum
00:11:18.833 size. At that point, the router has
00:11:21.133 no space to keep the newly arrived
00:11:23.666 packets, and so it discards the packets.
00:11:28.133 And this is what TCP is using
00:11:30.333 as the congestion signal. It’s using the
00:11:32.666 fact that the queue of packets on
00:11:35.100 an outgoing link at a router has
00:11:37.333 filled up. It's using that as an indication that
00:11:41.066 the queue fills up, the packet gets
00:11:43.566 lost, it uses that packet loss as
00:11:45.566 an indication that it's sending too fast.
00:11:47.933 It’s sending faster than the packets can
00:11:50.300 be delivered, and as a result the
00:11:52.666 queue has overflowed, a packet has been
00:11:55.000 lost, and so it needs to slow down.
00:11:57.966 And that's the fundamental congestion signal in
00:12:00.666 the network. Packet loss is interpreted as
00:12:03.533 a sign that devices are sending too
00:12:06.366 fast, and should go slower. And if
00:12:10.133 they slow down, the queues will gradually
00:12:12.066 empty, and packets will stop being lost.
00:12:15.366 So that's the first fundamental principle.
00:12:21.033 The second principle is that
00:12:24.800 we want to keep the number of
00:12:27.000 packets in the network roughly constant.
00:12:31.000 TCP, as we saw in the last
00:12:33.266 lecture, sends acknowledgments for packets. When a
00:12:35.866 packet is transmitted it has a sequence
00:12:38.366 number, and the response will come back
00:12:40.500 from the receiver acknowledging receipt of that
00:12:42.566 sequence number.
00:12:44.733 The general approach for TCP, once the
00:12:47.766 connection has got going, is that every
00:12:50.866 time it gets an acknowledgement, it uses
00:12:53.733 that as a signal that a packet
00:12:55.666 has been received.
00:12:57.533 And if a packet has been received,
00:12:59.233 something has left the network. One of
00:13:01.333 the packets sent into the network has
00:13:03.466 reached the other side, and has been
00:13:05.466 removed from the network at the receiver.
00:13:07.833 That means there should be space to
00:13:10.866 put another packet into the network.
00:13:13.900 And it's an approach that’s called ACK
00:13:15.866 clocking. Every time a packet arrives at
00:13:18.133 the receiver, and you get an acknowledgement
00:13:20.833 back saying it was received, that indicates
00:13:22.733 you can put another packet in.
00:13:24.766 So the total number of packets in
00:13:27.000 transit across the network ends up being
00:13:28.766 roughly constant. One packet out, you put
00:13:31.866 another packet in.
00:13:34.433 And it has the advantage that if
00:13:38.300 you're clocking out new packets in receipt
00:13:41.266 of acknowledgments, if, for some reason,
00:13:44.200 the network gets congested, and it takes
00:13:46.966 longer for acknowledgments to come back,
00:13:49.166 because it's taking longer for them to
00:13:50.700 work their way across the network,
00:13:53.566 then that will automatically slow down the
00:13:56.066 rate at which you send. Because it
00:13:58.466 takes longer for the next acknowledgment to
00:14:00.266 come back, therefore it's longer before you
00:14:02.066 send your next packet.
00:14:03.466 So, as the network starts to get
00:14:05.266 busy, as the queue starts to build
00:14:07.333 up, but before the queue has overflowed,
00:14:09.933 it takes longer for the acknowledgments to
00:14:12.233 come back, because the packets are queued
00:14:14.733 up in the intermediate links, and that
00:14:17.133 gradually slows down the behaviour of TCP.
00:14:20.166 It reduces the rate at which you can send.
00:14:23.366 So it’s, to at least some extent,
00:14:25.500 self adjusting. The network gets busier,
00:14:28.133 the ACKs come back slower, therefore you
00:14:30.066 send a little bit slower.
00:14:31.933 And that's the second principle: conservation of
00:14:34.600 packets. One out, one in.
00:14:41.300 And the principle of conservation of packets
00:14:44.866 is great, provided the network is in
00:14:48.333 the steady state.
00:14:50.166 But you also need to be able
00:14:51.733 to adapt the rate at which you're sending.
00:14:54.633 The way TCP adapts is very much
00:14:57.300 focused on starting slowly and gradually increasing.
00:15:04.900 When it needs to increase it’s sending
00:15:07.100 rate, TCP increases linearly. It adds a
00:15:10.866 small amounts to the sending rate each round trip time.
00:15:15.433 So it just gradually, slowly, increases the
00:15:18.000 sending rating. It gradually
00:15:19.500 pushes up the rate
00:15:23.233 until it spots a loss. Until it
00:15:26.566 loses a packet. Until it overflows a queue.
00:15:29.633 And then it responds to congestion by
00:15:32.166 rapidly decreasing its rate. If a congestion
00:15:36.233 event happens, if a packet is lost,
00:15:38.500 TCP halves its rate. It responds faster
00:15:42.266 than it increases, it slows down faster than it increases.
00:15:45.966 And this is the final principle,
00:15:48.166 what’s known as additive increase, multiplicative decrease.
00:15:50.866 The goal is to keep the network
00:15:53.066 stable. The goal is to not overload the network.
00:15:57.733 If you can, keep going at a
00:16:00.366 steady rate. Follow the ACK clocking approach.
00:16:03.300 Gradually, just slowly, increase the rate a
00:16:05.900 bit. Keep pushing, just in case there’s
00:16:08.366 more capacity than you think. So just
00:16:10.800 gradually keep probing to increase the rate.
00:16:13.733 If you overload the network, if you
00:16:16.333 cause congestion, if you overflow the queues,
00:16:18.566 cause a packet to be lost,
00:16:19.766 slow down rapidly. Halve your sending rate,
00:16:22.733 and gradually build up again.
00:16:24.866 The fact that you slow down faster
00:16:27.000 than you speed up, the fact that
00:16:28.966 you follow the one in, one out approach,
00:16:31.900 keeps the network stable. It makes sure
00:16:34.433 it doesn't overload the network, and it
00:16:36.166 means that if the network does overload,
00:16:38.000 it responds and recovers quickly The goal
00:16:40.866 is to keep the traffic moving.
00:16:42.500 And TCP is very effective at doing this.
00:16:47.366 So those are the fundamental principles of
00:16:49.800 TCP congestion control. Packet loss as an
00:16:52.866 indication of congestion.
00:16:54.800 Conservation of packets, and ACK clocking.
00:16:57.500 One in, one out, where possible.
00:17:00.366 If you need to increase the sending
00:17:03.233 rate, increase slowly. If a problem happens,
00:17:06.100 decrease quickly. And that will keep the network stable.
00:17:10.200 In the next part I’ll talk about
00:17:12.466 TCP Reno, which is one of the
00:17:15.100 more popular approaches for doing this in practice.
Part 2: TCP Reno
The second part of the lecture discusses TCP Reno congestion control. It outlines the principles of window based congestion control, and describes how they are implemented in TCP. The choice of initial window, and how the recommended initial window has changed over time, is discussed, along with the slow start algorithm for finding the path capacity and the congestion avoidance algorithm for adapting the congestion window.
00:00:00.666 In the previous part, I spoke about
00:00:02.366 the principles of TCP congestion control in
00:00:04.733 general terms. I spoke about the idea
00:00:07.666 of packet loss as a congestion signal,
00:00:10.300 about the conservation of packets, and about
00:00:12.800 the idea of additive increase multiplicative decrease
00:00:15.500 – increase slowly, decrease the sending quite
00:00:17.666 quickly as a way of achieving stability.
00:00:20.333 In this part I want to talk
00:00:21.900 about TCP Reno, and some of the
00:00:24.066 details of how TCP congestion control works in practice.
00:00:27.500 I’ll talk about the basic TCP congestion
00:00:29.866 control algorithm, how the sliding window algorithm
00:00:33.033 works to adapt the sending rate,
00:00:36.100 and the slow start and congestion avoidance
00:00:39.566 phases of congestion control.
00:00:44.600 TCP is what's known as a window
00:00:48.166 based congestion control protocol.
00:00:51.100 That is, it maintains what's known as
00:00:54.633 a sliding window of data which is
00:00:56.700 available to be sent over the network.
00:00:59.800 And the sliding window determines what range
00:01:02.100 of sequence numbers can be sent by
00:01:04.500 TCP onto the network.
00:01:06.933 It uses the additive increase multiplicative decrease
00:01:11.100 approach to grow and shrink the window.
00:01:13.300 And that determines, at any point,
00:01:15.666 how much data TCP sender can send
00:01:17.700 onto the network.
00:01:19.666 It augments these with algorithms known as
00:01:22.000 slow start and congestion avoidance. Slow start
00:01:24.933 being the approach TCP uses to get
00:01:28.900 a connection going in a safe way,
00:01:31.533 and congestion avoidance being the approach it
00:01:33.800 uses to maintain the sending rate once
00:01:36.733 the flow has got started.
00:01:39.633 The fundamental goal of TCP is that
00:01:42.233 if you have several TCP flows sharing
00:01:45.533 a link, sharing a bottleneck link in the network,
00:01:50.300 each of those flows should get an
00:01:52.733 approximately equal share of the bandwidth.
00:01:55.900 So, if you have four TCP flows
00:01:57.666 sharing a link, they should each get
00:01:59.733 approximately one quarter of the capacity of that link.
00:02:03.866 And TCP does this reasonably well.
00:02:06.666 It’s not perfect. It, to some extent,
00:02:10.100 biases against long distance flows,
00:02:13.433 and shorter flows tend to win out
00:02:15.900 a little over long distance flows.
00:02:18.066 But, in general, it works pretty well,
00:02:19.900 and does give flows roughly a roughly
00:02:22.866 equal share of the bandwidth.
00:02:26.100 The basic algorithm it uses to do
00:02:28.066 this, the basic congestion control algorithm,
00:02:30.600 is an approach known as TCP Reno.
00:02:32.233 And this is the state of the
00:02:35.200 art in TCP as of about 1990.
00:02:42.866 TCP is an ACK based protocol.
00:02:46.700 You send a packet, and sometime later
00:02:48.933 an acknowledgement comes back telling you that
00:02:52.000 the packet arrived, and indicating the sequence
00:02:54.366 number of the next packet which is expected.
00:02:59.300 The simplest way you might think that
00:03:01.300 would work, is you send a packet.
00:03:03.466 You wait for the acknowledgment. You send
00:03:05.533 another packet You wait for the acknowledgement. And so on.
00:03:09.500 The problem with that, is that it
00:03:11.166 tends to perform very poorly.
00:03:14.200 It takes a certain amount of time
00:03:16.366 to send a packet down a link.
00:03:18.666 That depends on the size of the
00:03:20.033 packet, and the link bandwidth.
00:03:23.566 The size of the packet is expressed
00:03:25.600 as some number of bits to be sent.
00:03:27.700 The link bandwidth is expressed in some
00:03:29.800 number of bits it can deliver each
00:03:31.300 second. And if you did divide the
00:03:33.233 packet size by the bandwidth, that gives
00:03:35.300 you the number of seconds it takes to send each packet.
00:03:39.166 It takes a certain amount of time
00:03:41.300 for that packet to propagate down the
00:03:43.733 link to the receiver, and for the
00:03:45.733 acknowledgment come back to you, depending on
00:03:48.900 the round trip of the link.
00:03:51.100 And you can measure the round trip time of the link.
00:03:54.833 And you can divide one by the other.
00:03:57.533 You can take the time it takes to send a packet, and the
00:04:00.333 time it takes for the acknowledgment to
00:04:02.066 come back, and divide one by the
00:04:03.900 other, to get the link utilisation.
00:04:07.200 And, ideally, you want that fraction be
00:04:08.900 close to one. You want to be
00:04:11.166 spending most of the time sending packets,
00:04:13.466 and not much time waiting for the
00:04:15.166 acknowledgments to come back before you can
00:04:16.900 send the next packet.
00:04:19.866 The problem is that's often not the case.
00:04:23.566 For example, if we assume we're trying
00:04:25.566 to send data, and we have a
00:04:27.366 gigabit link, which is connecting the machine
00:04:30.100 we're sending data from, and we’re trying
00:04:31.800 to go from Glasgow to London.
00:04:33.966 And this might be the case you would find if you had a one
00:04:37.133 of the machines in the Boyd Orr
00:04:39.233 labs, which is connected to the University's
00:04:42.166 gigabit Ethernet, and the University has a
00:04:44.666 10 gigabit per second link to the
00:04:46.666 rest of the Internet, so the bottleneck is that Ethernet.
00:04:51.100 If you're talking to a machine in London,
00:04:54.100 let's make some assumptions on how long this will take.
00:04:59.166 You’re sending using Ethernet, and the biggest
00:05:01.533 packet an Ethernet can deliver is 1500
00:05:03.566 bytes. So 1500 bytes, multiplied by eight
00:05:06.833 bits per byte, gives you a number
00:05:09.066 of bits in the packet. And it’s
00:05:11.366 a gigabit Ethernet, so it's sending a
00:05:13.233 billion bits per second.
00:05:15.400 So 1500 bytes, times eight bits,
00:05:17.866 divided by a billion bits per second.
00:05:21.133 It will take 12 microseconds, 0.000012 of
00:05:26.866 a second, 12 microseconds to send a
00:05:29.266 packet down the link. And that’s just
00:05:31.800 the time it takes to physically serialise
00:05:34.066 1500 bytes down a gigabit per second link.
00:05:39.400 The round trip time to London, if you measure it, is about
00:05:44.566 a 100th of a second, about 10 milliseconds.
00:05:47.833 If you divide one by the other,
00:05:50.200 you find that the utilisation is 0.0012.
00:05:54.700 0.12% of the link is in use.
00:05:59.266 The time it takes to send a
00:06:00.933 packet is tiny compared to the time
00:06:02.833 it takes to get a response.
00:06:04.566 So if you're just sending one packet,
00:06:06.433 and waiting for a response, the link
00:06:08.166 is idle 99.9% of the time.
00:06:14.166 The idea of a sliding window protocol
00:06:16.733 is to not just send one packet
00:06:18.566 and wait for an acknowledgement.
00:06:20.133 It’s to send several packets,
00:06:22.566 and wait for the acknowledgments. And the
00:06:25.466 window is the number of packets that
00:06:27.266 can be outstanding before the acknowledgement comes back.
00:06:31.000 The ideas is, you can start several
00:06:33.133 packets going, and eventually the acknowledgement comes
00:06:36.566 back, and that starts triggering the next
00:06:38.466 packets to be clocked out. This idea
00:06:40.233 is to improve the utilisation by sending
00:06:42.500 more than one packet before you get an acknowledgment.
00:06:47.200 And this is the fundamental approach to
00:06:49.266 sliding window protocols. The sender starts sending
00:06:51.833 data packets, and there's what's known as
00:06:54.433 a congestion window that's that specifies how
00:06:57.000 many packets that's it’s allowed to send
00:06:59.600 before it gets an acknowledgement.
00:07:02.033 And, in this example, the congestion window is six packets.
00:07:06.133 And the sender starts. It sends the
00:07:08.300 first data packet, and that gets sent
00:07:11.100 and starts its way traveling down the link.
00:07:14.533 And at some point later it sends
00:07:16.566 the next packet, and then the next packet, and so.
00:07:20.433 After a certain amount of time that
00:07:22.366 first packet arrives at the receiver,
00:07:24.400 and the receiver generates the acknowledgments which
00:07:26.800 comes back towards the sender.
00:07:28.933 And while this is happening, the sender
00:07:30.633 is sending more of the packets from its window.
00:07:33.966 And the receiver’s gradually receiving those and
00:07:36.266 sending the acknowledgments. And, at some point later,
00:07:38.966 the acknowledgement makes it back to the sender.
00:07:42.666 And in this case we've set the
00:07:44.733 window size to be six packets.
00:07:46.700 And it just so happens that the
00:07:48.500 acknowledgement for the first packet arrives back
00:07:51.733 at the sender, just as it has finished sending packet six.
00:07:57.700 And that triggers the window to increase.
00:07:59.866 That triggers the window to slide along.
00:08:02.066 So instead of being allowed to send packets one through six,
00:08:05.533 we're now allowed to send packets two
00:08:07.366 through seven. Because one packet has arrived,
00:08:09.833 that's opened up the window to allow
00:08:11.400 us to send one more packet.
00:08:13.733 And the acknowledgement indicates that packet one
00:08:16.200 has arrived. So just as we'd run
00:08:19.133 out of packets to send, just as
00:08:20.600 we've sent our six packets which are
00:08:22.533 allowed by the window, the acknowledgement arrives,
00:08:25.033 slides the window a long one,
00:08:26.566 tells us we can now send one more.
00:08:29.566 And the idea is that you size
00:08:31.600 the window such that you send just
00:08:33.366 enough packets that by the time the
00:08:35.600 acknowledgement comes back, you're ready to slide
00:08:37.800 the window along. You've sent everything that
00:08:40.000 was in your window.
00:08:41.766 And each acknowledgement releases the next packet
00:08:44.833 for transmission, if you get the window sized right.
00:08:48.766 And if there's a problem, if the acknowledgments
00:08:51.233 don't come back because something got lost,
00:08:54.033 then it stalls. You hadn't sent too
00:08:56.966 many excess packets, you're not just keeping
00:08:59.200 sending without getting acknowledgments,
00:09:01.466 you're just sending enough
00:09:02.933 that the acknowledgments come back, just as
00:09:04.966 you run out of things to send.
00:09:06.966 And everything just keeps it sort-of balanced.
00:09:09.300 Every acknowledgement triggers the next packet to
00:09:11.466 be sent, and it rolls along.
00:09:14.500 How big should the window be? Well,
00:09:17.166 it should be sized to match the
00:09:18.466 bandwidth times the delay on the path.
00:09:20.900 And you work it out in bytes.
00:09:23.266 It's the bandwidth of the path,
00:09:24.600 a gigabit in the previous example,
00:09:26.933 times the latency,
00:09:28.300 100th of a second, and you multiply
00:09:30.833 those together and that tells you how
00:09:32.366 many bytes can be in flight.
00:09:33.733 And you divide that by the packet
00:09:35.366 size, and that tells you how many packets you can send.
00:09:39.633 The problem is, the sender doesn't know
00:09:42.333 the bandwidth of the path, and it
00:09:44.700 doesn't know that latency. It doesn't know
00:09:46.633 the round trip time.
00:09:49.366 It can measure the round trip time,
00:09:51.433 but not until after it started sending.
00:09:53.733 Once it’s sent a packet, it can
00:09:55.800 wait for an acknowledgement to come back
00:09:57.566 and get an estimate of the round
00:09:58.966 trip time. But it can't do that
00:10:00.566 at the point where it starts sending.
00:10:02.566 And it can't know what is the
00:10:04.300 bandwidth. It knows the password for the
00:10:06.500 link it's connected to, but it doesn't
00:10:08.033 know the bandwidth for the rest of
00:10:09.500 the links throughout the network.
00:10:11.366 It doesn't know how many other TCP
00:10:13.566 flows it’s sharing the traffic with,
00:10:15.133 so it doesn't know how much of
00:10:16.600 that capacity it's got available.
00:10:19.000 And that this is the problem with
00:10:21.366 the sliding window algorithms. If you get
00:10:24.033 the window size right,
00:10:26.100 It allows you to do the ACK
00:10:27.900 clocking, it allows you to clock out
00:10:29.566 the packets at the right time,
00:10:31.100 just in time for the next packet to become available.
00:10:34.166 But, in order to pick the right
00:10:35.500 window size, you need to know the
00:10:36.833 bandwidth and the delay, and you don't
00:10:38.666 know either of those at the start of the connection.
00:10:44.333 TCP follows the sliding window approach.
00:10:47.700 TCP Reno is very much a sliding
00:10:51.266 window protocol, and it's optimised for not
00:10:53.966 knowing what the window sizes are.
00:10:58.466 And the challenge with TCP is to
00:11:01.033 pick what should be the initial window.
00:11:02.933 To pick how many packets you should
00:11:04.566 send, before you know anything about the
00:11:06.600 round trip time, or anything about bandwidth.
00:11:09.700 And how to find the path capacity,
00:11:11.633 how to figure out at what point
00:11:13.700 you've got the right size window.
00:11:15.866 And then how to adapt the window
00:11:18.833 to cope with changes in the capacity.
00:11:23.600 So there's two fundamental problems with TCP
00:11:26.766 Reno congestion control. Picking the initial window size
00:11:31.666 for the first set of packets you send.
00:11:34.833 And then, adapting that initial window size
00:11:37.500 to find the bottleneck capacity, and to
00:11:39.733 adapt to changes in that bottleneck capacity.
00:11:42.366 If you get the window size right,
00:11:44.500 you can make effective use of the
00:11:46.033 network capacity. If you get it wrong
00:11:48.633 you’ll either send too slowly, and end
00:11:50.900 up wasting capacity. Or you'll send too
00:11:53.033 quickly, and overload the network, and cause
00:11:55.200 packets to be lost because the queues fill.
00:12:01.800 So, how does TCP find the initial window?
00:12:05.966 Well, to start with, you have no
00:12:07.766 information. When you're making a TCP connection
00:12:10.900 to a host you haven't communicated with
00:12:12.966 before, you don't know the round trip
00:12:15.100 time to that host, you don’t know
00:12:16.633 how long it will take to get
00:12:17.800 a response, and you don't know the network capacity.
00:12:21.133 So you have no information to know
00:12:23.666 what an appropriately sized window should be.
00:12:27.966 The only safe thing you can do.
00:12:30.600 The only thing which is safe in
00:12:32.133 all circumstances, is to send one packet,
00:12:34.766 and see if it arrives, see if you get an ACK.
00:12:38.433 And if it works, send a little
00:12:39.933 bit faster next time.
00:12:42.500 And then gradually increase the rate at which you send.
00:12:46.100 The only safe thing to do
00:12:48.033 is to start at the lowest possible rate,
00:12:50.400 equivalent of stop-and-wait, and then gradually
00:12:53.700 increase your rate from there, once you know that it works.
00:12:58.366 The problem is, of course, that's pessimistic,
00:13:00.433 in most cases.
00:13:02.000 Most links are not the slowest possible link.
00:13:04.500 Most links, you can send faster than that.
00:13:09.233 What TCP has traditionally done, and the
00:13:12.466 traditional approach in TCP Reno, is declared
00:13:15.300 the initial window to be three packets.
00:13:18.533 So you can send three packets,
00:13:20.300 without getting any acknowledgments back.
00:13:23.300 And, by the time the third packet
00:13:24.800 has been sent, you should be just
00:13:27.033 about to get the acknowledgement back,
00:13:28.566 which will open it up for you to send the fourth.
00:13:30.933 And at that point, it starts ACK clocking.
00:13:34.700 And why is it three packets?
00:13:37.066 Because someone did some measurements,
00:13:38.933 and decided that was what safe.
00:13:42.500 More recently, I guess, about 10 years
00:13:45.666 ago now, Nandita Dukkipati and her group
00:13:49.333 at Google did another set of measurements,
00:13:52.333 and showed that was actually pessimistic.
00:13:55.066 The networks had gotten a lot faster
00:13:57.233 in the time since TCP was first
00:13:59.833 standardised, and they came to the conclusion,
00:14:02.900 based on the measurements of browsers accessing
00:14:05.733 the Google site, that about 10 packets
00:14:08.600 was a good starting point.
00:14:11.500 And the idea here is that 10
00:14:13.133 packets, you can send 10 packets at
00:14:15.533 the start of a connection, and after
00:14:18.500 you’ve sent 10 packets you should have
00:14:20.266 got an acknowledgement back.
00:14:22.666 Why ten?
00:14:24.633 Again, it's a balance between safety and
00:14:27.233 performance. If you send too many packets
00:14:31.633 onto a network which can't cope with
00:14:33.333 them, those packets will get queued up
00:14:35.533 and, in the best case, it’ll just
00:14:37.566 add latency because they're all queued up
00:14:39.666 somewhere. And in the worst case they'll
00:14:41.466 overflow the queues, and cause packet loss,
00:14:43.500 and you'll have to re-transmit them.
00:14:45.900 So you don't want to send too
00:14:47.733 fast. Equally, you don't want to send
00:14:49.700 too slow, because that just wastes capacity.
00:14:52.733 And the measurements that Google came up with
00:14:56.000 at this point, which was around 10
00:14:58.133 years ago, was that about 10 packets
00:15:00.433 was a good starting point for most connections.
00:15:03.466 It was unlikely to cause congestion in
00:15:06.800 most cases, and was also unlikely to
00:15:08.966 waste too much bandwidth.
00:15:11.900 And I think what we'd expect to
00:15:13.333 see, is that over time the initial
00:15:14.900 window will gradually increase, as network connections
00:15:17.233 around the world gradually get faster.
00:15:19.566 And it's balancing making good use of
00:15:22.766 connections in well-connected
00:15:25.133 first-world parts of the world, where there’s
00:15:28.633 good infrastructure,
00:15:30.800 against not overloading connections in parts of
00:15:34.333 the world where the infrastructure at less well developed.
00:15:40.233 The initial window lets you send something.
00:15:43.266 With a modern TCP, it lets you send 10 packets.
00:15:48.266 And you can send those 10 packets,
00:15:50.166 or whatever the initial window is,
00:15:52.200 without waiting for an acknowledgement to come back.
00:15:55.733 But it's probably not the right size;
00:15:58.333 it’s probably not the right window size.
00:16:01.300 If you're on a very fast connection,
00:16:04.200 in a well-connected part of the world,
00:16:06.033 you probably want a much bigger window than 10 packets.
00:16:09.033 And if you're on a poor quality
00:16:11.500 mobile connection, or in a part of
00:16:13.433 the world where the infrastructure is less
00:16:15.133 well developed, you probably want a smaller window.
00:16:18.433 So you need to somehow adapt
00:16:20.000 to match the network capacity.
00:16:23.466 And there's two parts to this.
00:16:25.700 What's called slow start, where you try
00:16:28.200 to quickly find the appropriate initial window,
00:16:32.366 where starting from initial window, you quickly
00:16:34.900 converge on what the right window is.
00:16:37.266 And congestion avoidance, where you adapt in
00:16:39.800 the long term to match changes in
00:16:42.633 capacity once the thing is running.
00:16:47.300 So how does slow start work?
00:16:49.400 Well, this is the phase at the beginning of the connection.
00:16:52.766 It's easiest to illustrate if you assume
00:16:55.066 that the initial window is one packet.
00:16:57.600 If the initial window is one packet,
00:16:59.966 you send one packet, and at some
00:17:02.066 point later an acknowledgement comes back.
00:17:05.200 And the way slow start works is
00:17:07.066 that each acknowledgment you get back
00:17:09.433 increases the window by one.
00:17:13.733 So if you send one packet,
00:17:15.833 and get one packet back, that increases
00:17:18.466 the window from one to two,
00:17:20.133 so you can send two packets the next time.
00:17:23.133 And you send those two packets,
00:17:25.333 and you get two acknowledgments back.
00:17:27.066 And each acknowledgments increases the window by
00:17:29.233 one, so it goes to three,
00:17:30.800 and then to four. So you can
00:17:32.166 send four packets the next time.
00:17:35.233 And then you get four acknowledgments back,
00:17:37.666 each of which increases the window,
00:17:39.433 so your window is now eight.
00:17:42.133 And, as we are all, I think,
00:17:45.400 painfully aware after the pandemic, this is
00:17:47.966 exponential growth.
00:17:50.233 The window is doubling each time.
00:17:52.300 So it's called slow start because it
00:17:54.366 starts very slow, with one packet or
00:17:56.500 three packets or 10 packets, depending on
00:17:58.600 the version of TCP you have.
00:18:00.466 But each round trip time the window doubles.
00:18:03.666 It doubles it's sending rate each time.
00:18:06.866 And this carries on until it loses
00:18:09.533 a packet. This carries on until it
00:18:11.766 fills the queues and overflows the capacity
00:18:14.300 of the network somewhere.
00:18:16.333 At which points it halves back to
00:18:18.266 its previous value, and drops out of
00:18:19.866 the slow start phase.
00:18:23.733 If we look at this graphically,
00:18:26.133 what we see on the graph at
00:18:27.800 the bottom of the slide, we have
00:18:29.600 time on the X axis, and the
00:18:31.666 congestion window, the size of the congestion
00:18:33.800 window, on the y axis.
00:18:35.700 And we're assuming an initial window of
00:18:37.433 one packet. We see that, on the
00:18:39.933 first round trip it sends the one
00:18:41.566 packet, gets the acknowledgement back. The second
00:18:44.766 round trip it sends two packets.
00:18:46.700 And then four, and then eight,
00:18:48.166 and then 16. And each time it
00:18:50.400 doubles it's sending rate.
00:18:52.366 So you have this exponential growth phase,
00:18:54.500 starting at whatever the initial window is,
00:18:57.466 and doubling each time until it reaches
00:18:59.366 the network capacity.
00:19:01.500 And eventually it fills the network.
00:19:03.600 Eventually some queue, somewhere in the network,
00:19:05.766 is full. And it overflows and the packet gets lost.
00:19:10.266 At that point the connection halves it’s
00:19:12.200 rate, back to the value just before
00:19:14.466 it last increased. In this example,
00:19:17.233 we see that it got up to
00:19:19.333 an initial window of 16, and then
00:19:21.900 something got lost, and then it halved
00:19:23.433 back down to a window of eight.
00:19:26.266 At that point TCP enters what's known
00:19:28.466 as the congestion avoidance phase.
00:19:33.500 The goal of congestion avoidance is to
00:19:37.500 adapt to changes in capacity.
00:19:41.300 After the slow start phase, you know
00:19:43.366 you've got approximately the right size window
00:19:45.466 for the path. It's telling you roughly
00:19:47.366 how many packets you should be sending
00:19:48.900 each round trip time. The goal,
00:19:51.266 once you’re in congestion avoidance, is to adapt to changes.
00:19:55.666 Maybe the capacity of the path changes.
00:19:58.900 Maybe you're on a mobile device,
00:20:00.900 with a wireless connection, and the quality
00:20:04.033 of the wireless connection changes.
00:20:06.400 Maybe the amount of cross traffic changes.
00:20:09.466 Maybe additional people start sharing the link
00:20:12.266 with you, and you have less capacity
00:20:14.033 because you’re sharing with more TCP flows.
00:20:16.666 Or maybe some of the cross traffic
00:20:18.033 goes away, and the amount of capacity
00:20:20.100 you have available increases because there's less
00:20:22.133 competing traffic.
00:20:24.433 And the congestion avoidance phase follows an
00:20:27.200 additive increase, multiplicative decrease,
00:20:29.300 approach to adapting
00:20:30.633 the congestion window when that happens.
00:20:34.866 So, in congestion avoidance,
00:20:38.166 if it successfully manages to send a
00:20:40.466 complete window of packets, and gets acknowledgments
00:20:43.300 back for each of those packets.
00:20:45.333 So it's sent out
00:20:47.900 eight packets, for example, and gets eight
00:20:50.600 acknowledgments back,
00:20:52.366 it knows the network can support that sending rate.
00:20:55.766 So it increases its window by one.
00:20:59.133 So the next time, it sends out nine packets
00:21:02.600 and expects to get nine acknowledgments back
00:21:05.333 over the next round trip cycle.
00:21:08.233 And if it successfully does that,
00:21:09.966 it increases the window again.
00:21:12.500 And it sends 10 packets, and expects
00:21:15.400 to get 10 acknowledgments back.
00:21:17.800 And we see that each round trip
00:21:20.000 it gradually increases the sending rate by
00:21:22.166 one. So it sends 8 packets,
00:21:24.566 then 9, then 10, then 11,
00:21:26.333 and 12, and keeps gradually, linearly,
00:21:29.166 increasing its rate.
00:21:31.900 Up until the point that something gets lost.
00:21:36.966 And if a packet gets lost?
00:21:40.300 You’ll be able to detect that because,
00:21:43.100 as we saw in the previous lecture,
00:21:44.733 you'll get a triple duplicates acknowledgement.
00:21:46.833 And that indicates that one of the
00:21:49.433 packets got lost, but the rest of
00:21:50.933 the data in the window was received.
00:21:54.666 And what you do at that point,
00:21:56.500 is you do a multiplicative decrease in
00:21:58.566 the window. You halve the window.
00:22:02.300 So, in this case, the sender was
00:22:04.533 sending with a window of
00:22:07.133 12 packets, and it successfully sent that.
00:22:10.200 And then it tried to send,
00:22:13.500 tried to increase its rate, realised it
00:22:17.066 didn't work, realised something got lost,
00:22:19.133 and so it halved its window back down to six.
00:22:23.500 And then it gradually switches back,
00:22:25.466 it switches back, and goes back to
00:22:27.400 the gradual additive increase.
00:22:29.733 And it follows this sawtooth pattern.
00:22:32.433 Gradual linear increase, one packet more each
00:22:35.666 round trip time.
00:22:37.633 Until it sends too fast, causes a
00:22:40.166 packet to be lost because it overflows
00:22:41.966 a queue, halves it’s sending rate,
00:22:44.133 and then gradually starts increasing it again.
00:22:47.833 It follows this sawtooth pattern. Gradual increase,
00:22:51.500 quick back-off; gradual increase, quick back-off.
00:22:57.433 The other way TCP can detect the
00:22:59.633 loss is by what’s known as a
00:23:01.266 time out. It’s sending the packets,
00:23:04.500 and suddenly the acknowledgements stop coming back entirely.
00:23:09.633 And this means that either the receiver
00:23:11.833 has crashed, the receiving system has gone
00:23:14.933 away, or perhaps more likely the network has failed.
00:23:18.733 And the data it’s sending is either
00:23:21.600 not reaching the sender, or the reverse path has failed,
00:23:24.766 and the acknowledgments are not coming back.
00:23:29.200 At that point, after nothing has come back for a while,
00:23:33.333 it assumes a timeout has happened,
00:23:37.466 and resets the window down to the initial window.
00:23:41.833 And in the example we see on
00:23:43.866 the slide, at time 14 we've got
00:23:45.933 a timeout, and it resets and the
00:23:48.500 initial window goes back to one packet.
00:23:51.566 At that point, it re-enters slow start.
00:23:53.633 It starts again from the beginning.
00:23:55.966 And whether your initial window is one
00:23:58.066 packet, or three packets, or ten packets,
00:24:00.233 it starts in the beginning, and it
00:24:02.066 re-enters slow start, and it tries again
00:24:04.100 for the connection.
00:24:06.466 And if this was a transient failure,
00:24:08.500 that will probably succeed. If it wasn’t,
00:24:11.366 it may end up in yet another
00:24:13.900 timeout, while it takes time for the
00:24:15.600 network to recover, or
00:24:17.933 for the system you're talking to,
00:24:19.866 to recover, and it will be a
00:24:21.266 while before it can successfully send a
00:24:22.966 packet. But, when it does, when the
00:24:24.766 network recovers, it starts sending again,
00:24:26.866 and resets the connection from the beginning.
00:24:30.366 How long, should the timeout be?
00:24:33.533 Well, the standard says a maximum of
00:24:37.200 one second, or the average round trip
00:24:39.900 time plus four times the statistical variance
00:24:42.200 in the round trip time.
00:24:45.200 And, if you're a statistician, you’ll recognise
00:24:47.666 that the RTT plus four times the
00:24:49.766 variance, if you're assuming a normal distribution of
00:24:54.233 round trip time samples, accounts for 99%
00:24:57.733 of the samples falling within range.
00:25:01.266 So it's finding the 99th percentile of
00:25:04.466 the expected time to get an acknowledgement back.
00:25:12.700 Now, TCP follows this saw tooth behaviour,
00:25:16.866 with gradual additive increase in the sending
00:25:19.466 rate, and then a back-off, halving it’s
00:25:22.333 sending rate, and then a gradual increase again.
00:25:25.633 And we see this in the top
00:25:27.166 graph on the slide which is showing a
00:25:29.766 measured congestion window for a real TCP flow.
00:25:34.166 And, after dynamics of the slow start
00:25:36.266 at the beginning, we see it follows this sawtooth pattern.
00:25:41.366 How does that affect the rest of the network?
00:25:45.033 Well, the packets are, at some point,
00:25:48.133 getting queued up at whatever the bottleneck link is.
00:25:53.733 And the second graph we see on
00:25:55.466 the left, going down, is the size of the queue.
00:25:58.866 And we see that as the sending
00:26:00.766 rate increases, the queue gradually builds up.
00:26:04.200 Initially the queue is empty, and as
00:26:06.566 it starts sending faster, the queue gradually gets fuller.
00:26:11.333 And at some point the queue gets full, and overflows.
00:26:17.866 And when the queue gets full,
00:26:19.633 when the queue overflows, when packets gets
00:26:21.800 lost, TCP halves it’s sending rate.
00:26:24.700 And that causes the queue to rapidly
00:26:27.166 empty, because there's less packets coming in,
00:26:29.566 so the queue drains.
00:26:31.466 But what we see is that just
00:26:33.266 as the queue is getting to empty,
00:26:35.666 the rate is starting to increase again.
00:26:38.566 Just as the queue gets the point
00:26:40.200 where it would have nothing to send,
00:26:41.833 the rate starts picking up, such that
00:26:44.033 the queue starts to gradually refill.
00:26:46.600 So the queues in the routers also
00:26:48.600 follow a sawtooth pattern. They gradually fill
00:26:51.500 up until they get to a full point,
00:26:55.200 And then the rate halves, the queue
00:26:58.433 empties rapidly because
00:27:00.133 there's much less traffic coming back,
00:27:02.133 and as it's emptying the rate at
00:27:04.233 which the sender is sending is gradually
00:27:06.500 filling up, and the queue size oscillates.
00:27:09.266 And we see the same thing happens
00:27:11.066 with the round trip time, in the
00:27:13.766 third of the graphs, as the queue gradually
00:27:17.000 fills up, the round trip time goes
00:27:18.900 up, and up, and up, it's taking
00:27:20.733 longer for the packets because they're queued up somewhere.
00:27:23.366 And then the rate reduces, the queue
00:27:26.266 drops, the round trip time drops.
00:27:28.733 And it gradually, as the rate picks up afterwards
00:27:33.066 back into congestion avoidance, the queue gradually
00:27:35.666 fills, the round trip time gradually increases.
00:27:38.466 So, both window size, and the queue
00:27:40.666 size, and the round trip time,
00:27:42.266 all follow this characteristic sawtooth pattern.
00:27:47.066 What's interesting though, if we look at
00:27:50.100 the fourth graph down on the left,
00:27:52.800 is we're looking at the rate at
00:27:54.333 which packets are arriving at the receiver.
00:27:56.966 And we see that the rate at
00:27:58.800 which packets are arriving at the receiver
00:28:00.533 is pretty much constant.
00:28:03.300 What's happening is that the packets are
00:28:05.266 being queued up at the link,
00:28:07.400 and as the queue fills there's more
00:28:09.833 and more packets queued up
00:28:11.900 at the bottleneck link. And when TCP
00:28:15.366 backs-off, when it reduces it's window,
00:28:19.000 that lets the queue drain. But the
00:28:21.866 queue never quite empties. We just see
00:28:25.133 very occasional drops where the queue gets
00:28:27.566 empty, but typically the queue always has
00:28:30.033 something in it.
00:28:31.800 It's emptying rapidly, it’s getting less and
00:28:34.166 less data in it, but the queue,
00:28:37.666 if the buffer is sized right,
00:28:39.866 if the window is chosen right, never quite empties.
00:28:43.800 So the TCP sender is following this
00:28:46.433 sawtooth pattern, with its sending window,
00:28:49.600 which is gradually filling up the queues.
00:28:51.966 And then the queues are gradually draining
00:28:53.966 when TCP backs-off and halves its rate,
00:28:56.933 but the queue never quite empties.
00:28:58.933 It always has some data to send,
00:29:00.633 so the receiver is always receiving data.
00:29:03.700 So, even though the sender's following the
00:29:05.766 sawtooth pattern, the receiver receives constant rate
00:29:08.266 data the whole time,
00:29:10.233 at approximately the bottleneck bandwidth.
00:29:13.866 And that's the genius of TCP.
00:29:16.566 It manages, by following this additive increase,
00:29:20.066 multiplicative decrease, approach, it manages to adapt
00:29:24.333 the rate such that the buffer never
00:29:27.200 quite empties, and the data continues to be delivered.
00:29:32.233 And for that to work, it needs
00:29:34.433 the router to have enough buffering capacity
00:29:37.400 in it. And the amount of buffering
00:29:39.600 the router needs, is the bandwidth times
00:29:42.166 the delay of the path. And too
00:29:44.333 little buffering in the router
00:29:47.033 leads to
00:29:49.933 the queue overflowing, and it not quite
00:29:52.633 managing to sustain the rate. Too much,
00:29:55.500 you just get what’s known as buffer bloat.
00:29:59.366 It's safe, I mean in terms of
00:30:00.700 throughput, it keeps receiving the data.
00:30:02.766 But the queues get very big,
00:30:04.800 and they never get anywhere near empty,
00:30:07.466 so the amount of data queued up
00:30:09.766 increases, and you just get increased latency.
00:30:15.033 So that's TCP Reno. It's really effective
00:30:18.100 at keeping the bottleneck fully utilised.
00:30:20.466 But it trades latency for throughput.
00:30:22.866 It tries to fill the queue,
00:30:24.766 it's continually pushing, it’s continually queuing up data.
00:30:28.066 Making sure the queue is never empty.
00:30:30.800 Making sure the queue is never empty,
00:30:32.500 so provided there’s enough buffering in the
00:30:34.800 network there are always packets being delivered.
00:30:37.566 And that's great, if your goal is
00:30:39.966 to maximise the rate at which information
00:30:42.400 is delivered. TCP is really good at
00:30:45.466 keeping the bottleneck link fully utilised.
00:30:47.800 It’s really, really good at delivering data
00:30:49.900 as fast as the network can support it.
00:30:52.333 But it trades that off for latency.
00:30:56.500 It's also really good at making sure
00:30:59.166 there are queues in the network,
00:31:01.066 and making sure that the network is
00:31:03.466 not operating at its lowest possible latency.
00:31:06.300 There's always some data queued up.
00:31:11.733 There are two other limitations,
00:31:13.966 other than increased latency.
00:31:16.700 First, is that TCP assumes that losses
00:31:19.066 are due to congestion.
00:31:21.600 And historically that's been true. Certainly in
00:31:24.466 wired links, packet loss is almost always
00:31:27.566 caused by a queue filling up,
00:31:30.433 overflowing, and a router not having space
00:31:34.133 to enqueue a packet.
00:31:36.666 In certain types of wireless links,
00:31:39.366 in 4G or in WiFi links,
00:31:41.500 that's not always the case, and you
00:31:43.733 do get packet loss due to corruption.
00:31:46.533 And TCP will treat this as a
00:31:49.000 signal to slow down. Which means that
00:31:51.166 TCP sometimes behaves sub-optimally on wireless links.
00:31:55.366 And there's a mechanism called Explicit Congestion
00:31:57.966 Notification, which we'll talk about in one
00:32:00.400 of the later parts of this lecture,
00:32:01.900 which tries to address that.
00:32:04.400 The other, is that the congestion avoidance
00:32:07.433 phase can take a long time to ramp up.
00:32:10.600 On very long distance links, very high capacity
00:32:16.133 links, it can take a long time
00:32:17.666 to get up to, after packet loss,
00:32:20.300 it can take a very long time
00:32:21.433 to get back up to an appropriate rate.
00:32:23.766 And there are some occasions with very
00:32:26.333 fast long distance links, where it performs
00:32:28.300 poorly, because of the way the congestion
00:32:31.066 avoidance works.
00:32:32.933 And there's an algorithm known as TCP
00:32:34.800 Cubic, which i'll talk about in the
00:32:36.500 next part, which tries to address that.
00:32:40.333 And that's the basics of TCP.
00:32:42.600 The basic TCP congestion control algorithm is
00:32:45.333 a sliding window algorithm, where the window
00:32:48.500 indicates how many packets you’re allowed to
00:32:50.800 send before getting an acknowledgement.
00:32:53.766 The goal of the slow start and
00:32:56.333 the congestion avoidance phases, and the additive
00:32:59.266 increase, multiplicative decrease, is to adapt the
00:33:02.166 size of the window to match the network capacity.
00:33:05.133 It always tries to match the size
00:33:07.166 of the window exactly to the capacity,
00:33:09.633 so it's making the most use of the network resources.
00:33:14.733 In the next part, I’ll move on
00:33:16.933 and talk about an extension to the
00:33:20.033 TCP Reno algorithm, known as TCP Cubic,
00:33:23.066 which is intended to improve performance on
00:33:25.533 very fast and long distance networks.
00:33:27.966 And then, in the later parts,
00:33:29.466 we'll talk about extensions to reduce latency,
00:33:32.600 and to work on wireless links where
00:33:35.933 there are non-congestive losses.
Part 3: TCP Cubic
The third part of the lecture talks about the TCP Cubic congestion control algorithm, a widely used extension to TCP that improves its performance on fast, long-distance, networks. The lecture discusses the limitations of TCP Reno that led to the development of Cubic, and outlines how Cubic congestion control improves performance but retains fairness with Reno.
00:00:00.833 In the previous part, I spoke about TCP Reno.
00:00:04.133 TCP Reno is the default congestion control
00:00:07.033 algorithms for TCP, but it's actually not
00:00:09.566 particularly widely used in practice these days.
00:00:12.566 What most modern TCP versions use is,
00:00:14.966 instead, an algorithm known as TCP Cubic.
00:00:18.600 And the goal of TCP cubic is
00:00:20.666 to improve TCP performance on fast long distance networks.
00:00:26.033 So the problem with TCP Reno,
00:00:27.966 is that it’s performance can be comparatively
00:00:30.133 poor on networks with large bandwidth-delay products.
00:00:33.933 That is, networks where the product,
00:00:36.333 what you get when you multiply the
00:00:37.900 bandwidth of the network, in number of
00:00:39.766 bits per second, and the delay,
00:00:42.100 the round trip time of the network, is large.
00:00:45.833 Now, this is not a problem that
00:00:48.066 most people, have most of the time.
00:00:50.466 But, it's a problem that began to
00:00:52.400 become apparent in the early 2000s when
00:00:55.733 people working at organisations like CERN were
00:00:58.500 trying to transfer very large data files
00:01:01.033 across fast long distance
00:01:05.800 networks between CERN and the universities that
00:01:08.933 were analysing the data.
00:01:11.233 For example, CERN is based at Geneva,
00:01:13.800 in Switzerland, and some of the big
00:01:16.566 sites for analysing the data are based
00:01:19.533 at, for example, Fermilab just outside Chicago in the US.
00:01:23.900 And in order to get the data
00:01:26.166 from CERN to Fermilab, from Geneva to Chicago,
00:01:31.366 they put in place multi-gigabit transatlantic links.
00:01:37.566 And if you think about the congestion window needed to
00:01:42.666 make good use of a link like
00:01:44.666 that, you realise it actually becomes quite large.
00:01:48.066 If you assume the link is 10
00:01:50.766 gigabit per second, which was cutting edge
00:01:54.033 in the early 2000s, but it is
00:01:55.833 now relatively common for high-end links these days,
00:01:59.033 and assume 100 milliseconds round trip time,
00:02:02.100 which is possibly even slightly an under-estimate
00:02:04.933 for the path from Geneva to Chicago,
00:02:08.900 in order to make good use
00:02:11.166 of that, you need a congestion window
00:02:12.866 which equals the bandwidth times the delay.
00:02:15.200 And 10 gigabits per second, times 100
00:02:17.633 milliseconds, gives you a congestion window of
00:02:20.233 about 100,000 packets.
00:02:24.166 And, partly, it takes TCP a long
00:02:28.066 time, a comparatively long time, to slow
00:02:31.333 start up to a 100,000 packet window.
00:02:34.266 But that's not such a big issue,
00:02:36.533 because that only happens once at the
00:02:38.066 start of the connection. The issue,
00:02:40.166 though, is in congestion avoidance.
00:02:42.800 If one packet is lost on the
00:02:44.766 link, out of a window of 100,000,
00:02:47.266 that will cause TCP to back-off and
00:02:49.800 halve it’s window. And it then increases
00:02:53.066 sending rate again, by one packet every round trip time.
00:02:57.300 And backing off from 100,000 packet window
00:03:00.033 to a 50,000 packet window, and then
00:03:02.433 increasing by one each time, means it
00:03:04.766 takes 50,000 round trip times to recover
00:03:07.500 back up to the full window.
00:03:10.400 50,000 round trip times, when the round
00:03:13.000 trip time is 100 milliseconds, is about 1.4 hours.
00:03:17.600 So it takes TCP about one-and-a-half hours
00:03:20.966 to recover from a single packet loss.
00:03:24.300 And, with a window of 100,000 packets,
00:03:27.666 you're sending enough data, at 10 gigabits per second,
00:03:32.033 that the imperfections in the optical fibre,
00:03:35.433 and imperfections in the equipment that are
00:03:37.333 transmitting the packets, become significant.
00:03:40.233 And you're likely to just see occasional
00:03:43.300 random packet losses, just because of imperfections
00:03:46.100 in the transmission medium, even if there's
00:03:48.166 no congestion. And this was becoming a
00:03:50.466 limiting factor, this was becoming a bottleneck
00:03:52.666 in the transmission.
00:03:54.366 It was becoming not possible to build
00:03:56.400 a network that was reliable enough,
00:03:58.733 that it never lost any packets in
00:04:01.433 transferring several hundreds of billions of packets
00:04:03.966 of data,
00:04:05.100 to exchange the data between CERN and
00:04:11.500 the sites which were doing the analysis.
00:04:14.600 TCP cubic is one of a range
00:04:16.733 of algorithms which were developed to try
00:04:19.200 and address this problem. To try and
00:04:22.000 recover much faster than TCP Reno would,
00:04:24.466 in the case when you had very
00:04:26.400 large congestion windows, and small amounts of packet loss.
00:04:32.033 So the idea of TCP cubic,
00:04:34.866 is that it changes the way the
00:04:36.866 congestion control works in the congestion avoidance phase.
00:04:41.200 So, in congestion avoidance, TCP cubic will
00:04:46.033 increase the congestion window faster than TCP
00:04:49.000 Reno would, in cases where the window is large.
00:04:54.366 In cases where the window is relatively
00:04:56.700 small, in the types of networks were
00:04:59.233 Reno has good performance, TCP cubic behaves
00:05:03.800 in a very similar way.
00:05:05.466 But as the windows get bigger,
00:05:07.066 as it gets to a regime with
00:05:09.033 TCP Reno doesn't work effectively, TCP cubic
00:05:11.900 gets more aggressive in adapting its congestion
00:05:15.200 window, and increases the congestion window much
00:05:17.700 more quickly in response to loss.
00:05:21.833 However, as the rate of increase,
00:05:25.500 as the window approaches the value it
00:05:29.500 was before the loss, it slows its
00:05:31.333 rate of increase, so it starts increasing
00:05:33.833 rapidly, slows its rate of increase
00:05:36.000 as it approaches the previous value.
00:05:38.533 And if it then successfully manages to
00:05:41.666 send at that rate, if it successfully
00:05:44.166 moves above the previous sending rate,
00:05:47.600 then it gradually increases sending rate again.
00:05:51.800 It’s called TCP Cubic because it follows
00:05:54.733 a cubic equation to do this.
00:05:56.333 The shape of the equation, the shape
00:06:00.200 of the curve, we see on the
00:06:01.600 slide for TCP cubic is following a cubic graph.
00:06:05.600 The paper listed on the slide,
00:06:08.466 the paper shown on the slide,
00:06:09.900 from Injong Rhee and his collaborators,
00:06:13.633 is the paper which describes the algorithm in detail.
00:06:16.666 And it was eventually specified in IETF
00:06:19.833 RFC 8312 in 2018, although it's been
00:06:24.366 probably the most widely used TCP variant
00:06:27.666 for a number of years before that.
00:06:31.200 The details of how it works:
00:06:33.566 TCP cubic is a somewhat more complex
00:06:36.066 algorithm than Reno.
00:06:38.966 The two parts to the behaviour.
00:06:42.066 If a packet is lost when a
00:06:44.866 TCP cubic sender is in the congestion avoidance phase,
00:06:49.233 it does a multiplicative decrease.
00:06:52.133 However, unlike TCP Reno, which does a
00:06:55.300 multiplicative decrease by multiplying by a factor
00:06:58.766 of 0.5, that is, it halves its
00:07:01.566 sending rate if a single packets is lost,
00:07:04.533 TCP cubic multiples its rate by 0.7.
00:07:09.500 So, instead of dropping back down to
00:07:11.200 50% of its previous sending rate,
00:07:13.400 it drops down to 70% of the sending rate.
00:07:17.233 It backs-off less, it's more aggressive.
00:07:19.600 It’s more aggressive at using bandwidth.
00:07:23.300 It reduces it’s sending rate in response
00:07:25.733 to loss, but by smaller fraction.
00:07:31.866 After it's backed-off, TCP cubic also changes
00:07:36.233 the way in which it increases it’s sending rate in future.
00:07:40.733 So we saw in the previous slide,
00:07:42.500 TCP Reno increases it’s congestion window by
00:07:46.100 one, for every round trip when it
00:07:48.600 successfully sends data.
00:07:50.800 So if the window backs off to
00:07:53.033 10, then it goes to 11 the
00:07:54.900 next round trip time, then 12,
00:07:56.700 and 13, and so on, with a
00:07:58.466 linear increase in the window.
00:08:02.000 TCP cubic, on the other hand,
00:08:04.033 sets the window as we see in
00:08:06.766 the equation on the slide. It sets
00:08:08.766 the window to be a constant,
00:08:11.233 C, times T-K cubed, plus Wmax.
00:08:17.100 Where the constant, C, is set to
00:08:19.766 0.4, which is a threshold which controls
00:08:22.800 how fair it is to TCP Reno,
00:08:25.266 and was determined experimentally.
00:08:28.033 T is the time since the packet
00:08:29.933 loss. K is the time it will
00:08:32.200 increase, it will take to increase the window backup to
00:08:36.266 the maximum it was before the packet
00:08:40.066 loss, and Wmax is the maximum window
00:08:42.633 size it reached before the loss.
00:08:45.200 And this gives the cubic growth function,
00:08:47.866 which we saw on the previous slide,
00:08:49.600 where the window starts to increase quickly,
00:08:52.033 the growth slows as it approaches that previous value
00:08:55.433 it reached just before the loss,
00:08:57.933 and if it successfully passes through that
00:09:00.033 point, the rate of growth increases again.
00:09:03.766 Now, that's the high-level version. And we
00:09:06.666 can already see it's more complex than
00:09:09.266 the TCP Reno equation. The algorithm on
00:09:13.766 the right of the slide, which is
00:09:16.433 intentionally presented in a way which is
00:09:18.933 completely unreadable here,
00:09:21.166 shows the full details. The point is
00:09:24.233 that there's a lot of complexity here.
00:09:27.300 The basic equation, the basic back-off to
00:09:30.766 0.7 times and then follow the cubic
00:09:33.133 equation, to increase rapidly, slow the rate
00:09:36.666 of increase, and then increase rapidly again
00:09:39.100 if it successfully gets past the previous bottleneck point,
00:09:43.133 is enough to illustrate the key principle.
00:09:46.300 The rest of the details are there
00:09:48.133 to make sure it's fair with TCP
00:09:50.066 Reno on links which are slower,
00:09:52.366 or where the round trip time is shorter.
00:09:55.600 And so, in the regime where TCP
00:09:57.733 Reno can successfully make use of the
00:09:59.833 link, TCP Cubic behaves the same way.
00:10:02.866 And, as you get into a regime
00:10:05.000 where Reno can't effectively make use of
00:10:07.666 the capacity, because it can't sustain a
00:10:09.466 large enough congestion window,
00:10:11.133 then cubic starts to behave differently,
00:10:14.433 and starts to switch to the cubic
00:10:16.666 equation. And that allows it to recover
00:10:19.700 from losses more quickly, and to more
00:10:21.833 effectively continue to make use of higher
00:10:23.800 bandwidths and higher latency paths.
00:10:29.200 TCP cubic is the default in most
00:10:33.200 modern operating systems. It’s the default in
00:10:36.866 Linux, it's the default in FreeBSD,
00:10:39.733 I believe it's the default in macOS
00:10:42.733 and iPhones.
00:10:44.666 Microsoft Windows has an algorithm called Compound
00:10:48.566 TCP which is a different algorithm,
00:10:50.900 but has a similar effect.
00:10:54.166 It’s much more complex than TCP Reno.
00:10:56.900 The core response, the back off to
00:11:00.033 70% and then follow the characteristic cubic
00:11:03.900 curve, is conceptually relatively straightforward, but once
00:11:07.733 you start looking at the details of
00:11:09.966 how it behaves, there gets to be a lot of complexity.
00:11:13.833 And most of that is in there
00:11:16.333 to make sure it's reasonably fair to
00:11:19.433 TCP, to TCP Reno, in the regime
00:11:22.833 where Reno typically works. But it improves
00:11:26.233 performance for networks with longer round trip
00:11:28.366 times and higher bandwidths.
00:11:32.033 Both TCP Cubic, and TCP Reno,
00:11:35.933 use congestion control, use packet loss as
00:11:39.800 a congestion signal. And they both eventually
00:11:42.733 fill the router buffers.
00:11:44.533 And TCP cubic does so more aggressively
00:11:47.133 than Reno. So, in both cases,
00:11:49.400 they're trading off latency for throughput,
00:11:51.666 They're trying to make sure the buffers are full.
00:11:53.933 They're trying to make sure
00:11:56.166 the buffers in the intermediate routers are full.
00:11:58.866 And they're both making sure that they
00:12:02.066 keep the congestion window large enough to
00:12:04.433 keep the buffers fully utilised, so packets
00:12:08.633 keep arriving at the receiver at all times.
00:12:11.300 And that's very good for achieving high
00:12:13.033 throughput, but it pushes the latency up.
00:12:16.300 So, again, they’re trading-off increased latency for
00:12:19.933 good performance, for good throughput.
00:12:25.333 And that's what I want to say
00:12:26.666 about Cubic. Again, the goal is to
00:12:29.566 use a different response function to improve
00:12:32.333 throughput on very fast, long distance, links,
00:12:36.100 multi-gigabit per second transatlantic links, being the
00:12:39.833 common example.
00:12:42.300 And the goal is to make good
00:12:44.966 use of throughput.
00:12:47.633 In the next part I’ll talk about
00:12:50.600 alternatives which, rather than focusing on throughput,
00:12:53.800 focus on keeping latency bounded whilst achieving
00:12:57.533 reasonable throughput.
Part 4: Delay-based Congestion Control
The 4th part of the lecture discussed how both the Reno and Cubic algorithms impact latency. It shows how their loss-based response to congestion inevitably causes router queues to fill, increasing path latency, and discusses how this is unavoidable with loss-based congestion control. It introduces the idea of delay-based congestion control and the TCP Vegas algorithm, highlights its potential benefits and deployment challenges. Finally, TCP BBR is briefly introduced as an experimental extension that aims to achieve some of the benefits of delay-based congestion control, in a deployable manner.
00:00:00.566 In the previous parts, I’ve spoken about
00:00:02.700 TCP Reno and TCP cubic. These are
00:00:05.866 the standard, loss based, congestion control algorithms
00:00:08.966 that most TCP implementations use to adapt
00:00:11.933 their sending rate. These are the standard
00:00:14.933 congestion control algorithms for TCP.
00:00:17.566 What I want to do in this
00:00:19.100 part is recap, why these algorithms cause
00:00:23.033 additional latency in the network, and talk
00:00:25.933 about two alternatives which try to adapt
00:00:29.966 the sending rate of TCP without building
00:00:32.933 up queues, and without
00:00:34.800 overloading the network and causing too much latency.
00:00:40.400 So, as I mentioned, TCP Cubic and
00:00:42.900 TCP Reno both aim to fill up the network.
00:00:46.466 They use packet loss as a congestion signal.
00:00:50.300 So the way they work is they
00:00:52.733 gradually increase their sending rate, they’re in
00:00:55.900 either slow start or congestion avoidance phase,
00:00:58.900 and they’re always gradually increasing the sending
00:01:01.433 rates, gradually filling up the queues in
00:01:03.766 the network, until those queues overflow.
00:01:07.333 At that point a packet is lost.
00:01:09.733 The TCP backs-off it's sending rate,
00:01:13.466 it backs-off its window, which allows the
00:01:16.133 queue to drain, but as the queue
00:01:18.200 is draining, both
00:01:19.766 Reno and Cubic are increasing their sending
00:01:22.533 rate, are increasing the sending window,
00:01:25.366 so are to gradually start filling up
00:01:27.833 the queue again.
00:01:29.266 As, we saw, the queues in the
00:01:31.400 network oscillate, but they never quite empty.
00:01:34.333 And both Reno and Cubic, the goal
00:01:36.866 is to keep some packets queued up
00:01:39.766 in the network, make sure there's always
00:01:42.233 some data queued up, so they can
00:01:44.000 keep delivering data.
00:01:47.366 And, no matter how big a queue
00:01:50.300 you put in the network, no matter
00:01:52.200 how much memory you give the routers
00:01:53.866 in the network, TCP Reno and TCP
00:01:57.266 cubic will eventually cause it to overflow.
00:02:00.800 They will keep sending, they'll keep increasing
00:02:04.233 the sending rate, until whatever queue is
00:02:06.866 in the network it's full, and it overflows.
00:02:10.333 And the more memory in the routers,
00:02:12.133 the more buffer in the routers,
00:02:13.900 the longer that queue will get and
00:02:15.833 the worse the latency will be.
00:02:18.433 But in all cases, in order to
00:02:21.366 achieve very high throughput, in order to
00:02:23.533 keep the network busy, keep the bottleneck
00:02:25.433 link busy, TCP Reno and TCP cubic
00:02:29.033 queue some data up.
00:02:31.100 And this adds latency.
00:02:34.300 It means that, whenever there’s TCP Reno,
00:02:37.866 whenever there’s TCP cubic flows, using the
00:02:40.300 network, the queues will have data queued up.
00:02:45.800 There’ll always be data queued up for
00:02:47.800 delivery. There's always packets waiting for delivery.
00:02:50.933 So it forces the network to work
00:02:53.133 in a regime where there's always some
00:02:56.566 excess latency.
00:03:01.333 Now, this is a problem for real-time
00:03:05.066 applications. It’s a problem if you're running
00:03:07.233 a video conferencing tool, or a telephone
00:03:11.366 application, or a game, or a real
00:03:13.766 time control application, because you want low
00:03:16.633 latency for those applications.
00:03:19.133 So it will be desirable if we
00:03:21.166 could have a an alternative to TCP
00:03:23.600 Reno or TCP cubic that can achieve
00:03:25.800 good throughput for TCP, without forcing the
00:03:28.400 queues to be full.
00:03:31.433 One attempt at doing this was a proposal called TCP Vegas.
00:03:37.366 And the insight from TCP Vegas is that
00:03:42.800 you can watch the rate of growth,
00:03:45.800 or increase, of the queue, and use
00:03:48.633 that to infer whether you're sending faster,
00:03:50.700 or slower, than the network can support.
00:03:54.233 The insight was, if you're sending,
00:03:56.166 if a TCP is sending, faster than
00:03:58.366 the maximum capacity a network can deliver
00:04:00.933 at, the queue will gradually fill up.
00:04:03.500 And as the queue gradually fills up,
00:04:05.533 the latency, the round trip time, will gradually increase.
00:04:10.066 TCP Cubic, and TCP Reno, wait until
00:04:13.933 the queue overflows, wait until there's no
00:04:16.133 more space to put new packets in,
00:04:18.066 and a packet is lost, and at
00:04:19.800 that point they slow down.
00:04:22.666 The insight for TCP Vegas was to
00:04:25.300 watch as the delay increases, and as
00:04:28.500 it sees the delay increasing, it slows
00:04:31.300 down before the queue overflows.
00:04:34.533 So it uses the gradual increase in
00:04:36.366 the round trip time, as an indication
00:04:38.500 that it should send slower.
00:04:40.800 And as the round-trip time reduces,
00:04:43.033 as the round-trip time starts to drop,
00:04:45.066 it treats that as an indication that
00:04:46.933 the queue is draining, which means it can send faster.
00:04:50.766 It wants a constant round trip time.
00:04:53.366 And, if the round trip time increases,
00:04:55.300 it reduces its rate; and if the
00:04:57.933 round-trip time decreases, it increases its rate.
00:05:00.200 So, it's trying to balance it’s rate
00:05:03.033 with the round trip time, and not
00:05:04.866 build or shrink the queues.
00:05:08.333 And because you can detect the queue
00:05:10.966 building up before it overflows, you can
00:05:14.233 take action before the queue is completely
00:05:16.133 full. And that means the queue is
00:05:18.466 running with lower occupancy, so you have
00:05:21.000 lower latency across the network.
00:05:23.666 It also means that because packets are
00:05:25.533 not being lost, you don't need to
00:05:27.866 re-transmit as many packets. So it improves
00:05:30.700 the throughput that way, because you're not
00:05:32.600 resending data that you've already sent and has gotten lost.
00:05:36.633 And that's the fundamental idea of TCP
00:05:38.966 Vegas. It doesn't change the slow start behaviour at all.
00:05:42.566 But, once you're into congestion avoidance,
00:05:44.900 it looks at the variation in round
00:05:47.100 trip time rather than looking at packet
00:05:49.200 loss, and uses that to drive the
00:05:51.366 variation in the speed at which it’s sending.
00:05:56.566 The details of how it works.
00:05:59.466 Well, first, it tries to estimate what
00:06:01.766 it calls the base round trip time.
00:06:04.766 So every time it sends a packet,
00:06:07.033 it measures how long it takes to
00:06:08.733 get a response. And it tries to
00:06:10.800 find the smallest possible response time.
00:06:14.166 The idea being that the smallest time
00:06:17.366 it gets a response, would be the
00:06:18.833 time when the queue is that it's emptiest.
00:06:21.766 It may not get the actual,
00:06:23.466 completely empty, queue, but the smaller the
00:06:26.066 response time, it's trying to estimate the
00:06:29.866 time it takes when there's nothing else in the network.
00:06:34.066 And anything on top of that indicates
00:06:36.233 that there is data queued up somewhere in the network.
00:06:41.133 Then it calculates an expected sending rate.
00:06:45.266 It takes the window size, which indicates
00:06:48.033 how many packets it's supposed to send
00:06:50.533 in that round-trip time,
00:06:52.533 how many bytes of data it’s supposed
00:06:54.366 to send in that round-trip time,
00:06:56.066 and it divides it by the base
00:06:57.433 round trip time. So if you divide
00:07:00.633 number of bytes by time, you get
00:07:03.166 a bytes per second, and that gives
00:07:05.566 you the rate at which it should be sending data.
00:07:09.333 And if the network can
00:07:12.033 support sending at that rate, it should
00:07:14.366 be able to deliver that window of
00:07:17.800 packets within a complete round trip time.
00:07:20.866 And, if it can’t, it will take
00:07:22.566 longer than a round trip time to
00:07:24.300 deliver that window of packets, and the
00:07:25.866 queues will be gradually building up Alternatively,
00:07:28.866 if it takes less than a round
00:07:30.333 trip time, this is an indication that
00:07:31.900 the queues are decreasing.
00:07:35.500 And it measures the actual rate at
00:07:37.100 which it sends the packets.
00:07:39.466 And it compares them.
00:07:41.600 And if the actual rate at which
00:07:43.166 it's sending packets is less than the
00:07:45.466 expected rate, if it's taking longer than
00:07:47.733 a round-trip time to deliver the complete
00:07:49.633 window worth of packets, this is a
00:07:51.700 sign that the packets can’t all be delivered.
00:07:56.866 And it, you know, it's trying to send too
00:07:59.966 much. It’s trying to send at too
00:08:01.666 fast a rate, and it should reduce
00:08:03.166 its rate and let the queues drop.
00:08:05.900 Equally, in the other case it should
00:08:08.333 increase its rate, and measuring the difference
00:08:10.966 between the actual and the expected rates,
00:08:13.800 it can measure whether the queues growing or shrinking.
00:08:18.733 And TCP Vegas compares the expected rate,
00:08:21.966 which actually manages to send at,
00:08:24.566 the expected rate at which it gets
00:08:26.566 the acknowledgments back, with the actual rate.
00:08:30.600 And it adjusts the window.
00:08:34.333 And if the expected rate, minus the
00:08:37.700 actual rate, is less than some threshold,
00:08:40.700 that indicates that it should increase its
00:08:43.666 window. And if the expected rate,
00:08:45.933 minus the actual rate, is greater than
00:08:48.000 some other threshold, then it should decrease the window.
00:08:51.266 That is, if data is arriving at
00:08:53.633 the expected rate, or very close to
00:08:56.200 it, this is probably a sign that
00:08:58.366 the network can support a higher rate,
00:09:00.533 and you should try sending a little bit faster.
00:09:03.566 Alternatively, if data is arriving slower
00:09:06.133 than it's being sent,
00:09:07.133 this is a sign that you're sending too fast and you
00:09:09.233 should slow down.
00:09:10.833 And the two thresholds, R1 and R2,
00:09:12.933 determine how close you have to be
00:09:15.033 to the expected rate, and how far
00:09:16.866 away from it you have to be in order to slow down.
00:09:20.733 And the result is that TCP Vegas
00:09:24.700 follows a much smoother transmission rate.
00:09:28.300 Unlike TCP Reno, which follows the characteristic
00:09:31.700 sawtooth pattern, or TCP cubic which follows the
00:09:35.866 cubic equation to change it’s rate,
00:09:39.533 both of which adapt quite abruptly whenever
00:09:43.233 there's a packet loss,
00:09:44.933 TCP Vegas makes a gradual change.
00:09:47.266 It gradually increases, or decreases, it’s sending
00:09:50.466 rate in line with the variations in
00:09:52.833 the queues. So, it’s a much smoother
00:09:54.900 algorithm, which doesn't continually build up and
00:09:58.266 empty the queues.
00:10:01.166 Because the queues are not continuing building
00:10:03.966 up, not continually being filled, this keeps
00:10:08.366 the latency down
00:10:09.400 while still achieving recently good performance.
00:10:15.833 TCP Vegas is a good idea in principle.
00:10:21.633 This idea is known as delay-based congestion
00:10:24.600 control, and I think it's actually a
00:10:26.500 really good idea in principle. It reduces
00:10:29.666 the latency, because it doesn't fill the queues.
00:10:33.100 It reduces the packet loss, because it's
00:10:35.300 not causing, t's not pushing the queues
00:10:38.133 to overflow and causing packets to be
00:10:39.833 lost. So the only packet losses you
00:10:42.233 get are those caused by transmission problems.
00:10:45.433 And this reduces unnecessary, reduces you having
00:10:48.766 to transmit packets, because you forced the
00:10:50.633 network into overload, and forced it to
00:10:52.633 lose the packets, and it reduces the latency.
00:10:57.200 The problem with TCP Vegas is that
00:11:00.600 it doesn't work, doesn’t interwork work with,
00:11:03.900 TCP Reno or TCP cubic.
00:11:07.833 If you have any TCP Reno or
00:11:10.200 Cubic flows on the network, they will
00:11:12.300 aggressively increase their sending rate and try
00:11:15.300 to fill the queues, and the push
00:11:17.300 the queues into overload.
00:11:19.966 And this will increase the round-trip time,
00:11:22.966 reduce the rate at which Vegas can
00:11:26.300 send, and it will force TCP Vegas to slow down.
00:11:30.033 Because TCP Vegas sees the queues increasing,
00:11:33.033 because Cubic and Reno are intentionally trying
00:11:36.266 to fill those queues, and if the
00:11:38.333 queues increase, this causes Vegas to slow down.
00:11:41.200 That gradually means there's more space in
00:11:44.200 the queues, which Cubic and Reno will
00:11:46.633 gradually fill-up, which causes Vegas to slow
00:11:49.200 down, and they end up in a
00:11:50.900 spiral, where the TCP Vegas flows get
00:11:52.800 pushed down to zero, and the Reno
00:11:55.700 or Cubic flows use all of the capacity.
00:11:59.333 So if we only have TCP Vegas
00:12:01.400 in the network, I think it would
00:12:03.466 behave really nicely, and we get really
00:12:05.500 good, low latency, behaviour from the network.
00:12:08.900 Unfortunately we're in a world where Reno,
00:12:11.933 and Cubic, have been deployed everywhere.
00:12:14.733 And without a step change, without an
00:12:18.933 overnight switch where we turn of Cubic,
00:12:21.966 and we turn off Reno, and we
00:12:23.366 turn on Vegas, everywhere we can't deploy
00:12:25.900 TCP Vegas because always loses out to
00:12:28.866 Reno and Cubic.
00:12:31.166 So, it's a good idea in principle,
00:12:33.233 but in practice it can't be used
00:12:35.033 because of the deployment challenge.
00:12:40.600 As I say, it's a good idea
00:12:42.733 in principle, and the idea of using
00:12:45.433 delay as a congestion signal is a
00:12:47.766 good idea in principle, because we can
00:12:50.066 get something which achieves lower latency.
00:12:54.866 Is it possible to deploy a different
00:12:57.733 algorithm? Maybe the problem is not principal,
00:13:00.266 maybe the problem is the algorithm in TCP Vegas?
00:13:05.466 Well, people are trying alternatives which are delay based.
00:13:10.233 And the most recent attempt at this
00:13:12.966 is an algorithm called TCP BBR,
00:13:15.200 Bottleneck Bandwidth and Round-trip time.
00:13:18.466 And again, this is a proposal that
00:13:20.533 came out of Google. And one of
00:13:23.133 the co-authors, if you look at the
00:13:25.533 paper on the right, is Van Jacobson,
00:13:28.033 who was the original designer of TCP
00:13:30.300 congestion control. So there's clearly some smart
00:13:32.833 people behind this.
00:13:34.600 The idea is that it tries to explicitly
00:13:36.966 measure the round-trip time as it sends
00:13:39.500 the packets. It tries to explicitly measure
00:13:42.133 the sending rate in much the same way same way that
00:13:45.666 TCP Vegas does. And, based on those
00:13:48.233 measurements, and some probes where it varies
00:13:51.533 its rate to try and find if
00:13:53.400 it's got more capacity, or try and
00:13:55.400 sense if there is other traffic on the network.
00:13:58.533 It tries to directly set a congestion
00:14:01.066 window that matches the network capacity,
00:14:04.066 based on those measurements.
00:14:06.533 And, because this came out of Google,
00:14:08.600 it got a lot of press,
00:14:10.666 and Google turned it on for a
00:14:13.533 lot of their traffic. I know they
00:14:15.433 were running it for YouTube for a
00:14:16.866 while, and a lot of people saw
00:14:18.966 this, and jumped on the bandwagon.
00:14:21.333 And, for a while, it was starting
00:14:23.100 to get a reasonable amount of deployments.
00:14:27.100 The problem is, it turns out not to work very well.
00:14:31.066 And Justine Sherry at Carnegie Mellon University,
00:14:36.733 and her PhD student Ranysha Ware,
00:14:39.500 did a really nice bit of work
00:14:41.533 that showed that is incredibly unfair to
00:14:44.400 regular TCP traffic.
00:14:46.766 And, it's unfair in kind-of the opposite
00:14:49.633 way to Vegas. Whereas TCP Reno and
00:14:53.600 TCP Cubic would force TCP Vegas flows
00:14:56.400 down to nothing, TCP BBR is unfair
00:14:59.766 in the opposite way, and it demolishes
00:15:02.600 Reno and Cubic flows, and causes tremendous
00:15:05.266 amounts of packet loss for those flows.
00:15:08.266 So it's really much more aggressive than
00:15:11.133 the other flows in certain cases,
00:15:13.233 and this leads to really quite severe unfairness problems.
00:15:17.533 And the Vimeo link on the slide is a link to the talk at
00:15:24.133 the Internet Measurement Conference, where Ranysha talks
00:15:28.233 through that, and demonstrates really clearly that
00:15:30.966 TCP BBR version 2 is really quite problematic, and
00:15:36.033 not very safe to deploy on the current network.
00:15:41.066 And there's a there's a variant called
00:15:43.100 BBR v2, which is under development,
00:15:46.266 and seems to be changing,
00:15:48.566 certainly on a monthly basis, which is
00:15:51.433 trying to solve these problems. And this
00:15:53.866 is very much an active research area,
00:15:55.833 where people are looking to find better alternatives.
00:16:01.966 So that's the principle of delay-based congestion control.
00:16:05.400 Traditional TCP, the Reno algorithm and the
00:16:09.100 Cubic algorithms, intentionally try to fill the
00:16:12.166 queues, they intentionally try to cause latency.
00:16:16.633 TCP Vegas is one well-known algorithm which
00:16:20.833 tries to solve this, and
00:16:24.200 doesn't work in practice, but in principle
00:16:27.766 is a good idea, it just has
00:16:30.033 some deployment challenges, given the installed base
00:16:32.800 of Reno and Cubic.
00:16:35.366 And there are new algorithms, like TCP
00:16:38.200 BBR, which don't currently work well,
00:16:41.466 but have potential to solve this problem.
00:16:44.466 And, hopefully, in the future, a future
00:16:47.166 variant of BBR will work effectively,
00:16:51.800 and we'll be able to transition to
00:16:53.633 a lower latency version of TCP.
Part 5: Explicit Congestion Notification
The use of delay-based congestion control is one way of reducing network latency. Another is to keep Reno and Cubic-style congestion control, but to move away from using packet loss as an implicit congestion signal, and instead provide an explicit congestion notification from the network to the applications. This part of the lecture introduces the ECN extension to TCP/IP that provides such a feature, and discusses its operation and deployment.
00:00:00.433 In the previous parts of the lecture,
00:00:02.166 I’ve discussed TCP congestion control. I’ve discussed
00:00:05.566 how TCP tries to measure what the
00:00:07.700 network's doing and, based on those measurements,
00:00:10.266 adapt it’s sending rate to match the
00:00:12.433 available network capacity.
00:00:14.466 In this part, I want to talk
00:00:15.866 about an alternative technique, known as Explicit
00:00:18.300 Congestion Notification, which allows the network to
00:00:20.733 directly tell TCP when it's sending too
00:00:22.966 fast, and needs to reduce it’s transmission rate.
00:00:28.500 So, as we've discussed, TCP infers the
00:00:31.833 presence of congestion in the network through measurement.
00:00:36.066 If you're using TCP Reno or TCP
00:00:39.066 Cubic, like most TCP flows in the
00:00:42.466 network today, then the way it infers
00:00:45.500 that is because there's packet loss.
00:00:48.033 TCP Reno and TCP Cubic keep gradually
00:00:51.400 increasing their sending rates, trying to cause
00:00:54.333 the queues to overflow.
00:00:56.200 And they cause a queue overflow,
00:00:58.366 cause a packet to be lost,
00:00:59.800 and use that packet loss as the
00:01:01.366 signal that the network is busy,
00:01:04.200 that they've reached the network capacity,
00:01:05.966 and they should reduce the sending rate.
00:01:09.066 And this is problematic for two reasons.
00:01:11.866 First, is because it increases delay.
00:01:15.266 It's continually pushing the queues to be
00:01:18.266 full, which means the network’s operating with
00:01:20.833 full queues, with its maximum possible delay.
00:01:24.400 And the second is because it makes
00:01:27.066 it difficult to distinguish loss which is
00:01:29.533 caused because the queues overflowed, from loss
00:01:32.766 caused because of a transmission error on
00:01:35.900 a link, so called non-congestive loss,
00:01:38.533 which you might get due to interference or a wireless link.
00:01:43.766 The other approach people have discussed,
00:01:45.666 is the approach in TCP Vegas,
00:01:48.233 where look at variation in queuing latency
00:01:51.500 and use that as an indication of loss.
00:01:54.400 So, rather than pushing the queue until
00:01:56.333 it overflows, and detecting the overflow,
00:01:58.866 you watch to see as the queue
00:02:00.733 starts to get bigger, and use that
00:02:02.633 as an indication that you should reduce
00:02:04.233 your sending rate. Or, equally, you spot
00:02:07.300 the queue getting smaller, and use that
00:02:08.900 as an indication that you should maybe
00:02:10.466 increase your sending rate.
00:02:12.700 And this is conceptually a good idea,
00:02:14.566 as we discussed in the last part,
00:02:16.733 because it lets you run TCP with
00:02:18.866 lower latency. But it's difficult to deploy,
00:02:21.833 because it interacts poorly with TCP Cubic
00:02:25.333 and TCP Reno, both of which try
00:02:27.833 to fill the queues.
00:02:31.966 As a result, we're stuck with using
00:02:34.333 Reno and Cubic, and we're stuck with
00:02:36.333 full queues in the network. But we'd
00:02:38.900 like to avoid this, we'd like to
00:02:40.466 go for a lower latency way of
00:02:42.666 using TCP, and make the network work
00:02:45.533 without filling the queues.
00:02:49.300 So one way you might go about
00:02:50.766 doing this is, rather than have TCP
00:02:54.200 push the queues to overflow,
00:02:56.966 have the network rather tell TCP when
00:02:59.866 it's sending too fast.
00:03:02.433 Have something in the network tell the
00:03:04.933 TCP connections that they are congesting the
00:03:07.666 network, and they need to slow down.
00:03:11.233 And this thing is called Explicit Congestion Notification.
00:03:17.333 Explicit Congestion Notification, the ECN bits,
00:03:21.733 are present in the IP header.
00:03:25.266 The slide shows an IPv4 header with
00:03:27.833 the ECN bits indicated in red.
00:03:30.333 The same bits are also present in
00:03:32.500 IPv6, and they're located in the same
00:03:34.766 place in the packet in the IPv6 header.
00:03:38.066 The way these are used.
00:03:40.233 If the sender doesn't support ECN,
00:03:42.866 it sets these bits to zero when
00:03:44.700 it transmits the packet. And they stay
00:03:46.866 at zero, nothing touches them at that point.
00:03:50.233 However, if the sender does support ECN,
00:03:52.933 and it sets these bits to have
00:03:54.700 the value 01, so it sets bit
00:03:57.400 15 of the header to be 1,
00:04:00.433 and it transmits the IP packets as
00:04:02.933 normal, except with this one bit set
00:04:05.066 to indicate that the sender understands ECN.
00:04:10.000 If congestion occurs in the network,
00:04:12.966 if some queue in the network is
00:04:16.333 beginning to get full, it’s not yet
00:04:19.266 at the point of overflow but it's
00:04:20.733 beginning to get full, such that some
00:04:22.800 router in the network thinks it's about
00:04:24.833 to start experiencing congestion,
00:04:27.200 then that router, that router in the
00:04:30.100 network, changes those bits in the IP
00:04:32.433 packets, of some of the packets going
00:04:34.233 past, and sets both of the ECN bits to one.
00:04:38.266 This is known as an ECN Congestion Experienced mark.
00:04:42.333 It's a signal. It's a signal from
00:04:44.966 the network to the endpoints, that the
00:04:47.500 network thinks it's getting busy, and the
00:04:49.266 endpoint should slow down.
00:04:53.266 And that's all it does. It monitors
00:04:55.466 the occupancy in the queues, and if
00:04:57.766 the queue occupancy is higher than some
00:04:59.466 threshold, it sets the ECN bits in
00:05:01.666 the packets going past, to indicate that
00:05:04.766 threshold has been reached and the network
00:05:06.766 is starting to get busy.
00:05:09.233 If the queue overflows,
00:05:11.133 if the endpoints keep sending faster and
00:05:13.866 the queue overflows, then it drops the
00:05:15.466 packet so as normal. The only difference
00:05:17.433 is that there's some intermediate point where
00:05:19.766 the network is starting to get busy,
00:05:21.500 but the queue has not yet overflowed.
00:05:23.966 And at that point, the network marks
00:05:25.666 the packets indicate that it's getting busy.
00:05:32.100 A receiver might get a TCP packet,
00:05:35.133 a TCP segment, delivered within an IP
00:05:37.866 packet, where that IP packet has the
00:05:40.700 ECN Congestion Experienced mark set. Where the
00:05:43.666 network has changed those two bits in
00:05:45.766 the IP header to 11, to indicate
00:05:48.800 that it's experiencing congestion.
00:05:52.366 What it does that point at that
00:05:54.666 point, is it sets a bit in
00:05:58.100 the TCP header of the acknowledgement packet
00:06:01.600 it sends back to the sender.
00:06:04.266 That bit’s known as the ECN Echo
00:06:06.866 field, the ECE field. It sets this
00:06:09.933 bit in the TCP header equal to
00:06:12.633 one on the next packet it sends
00:06:15.600 back to the sender, after it received
00:06:18.033 the IP packet, containing the TCP segment,
00:06:21.400 where that IP packet was marked Congestion Experienced.
00:06:26.133 So the receiver doesn't really do anything
00:06:28.833 with the Congestion Experienced mark, other than
00:06:31.233 mark, set the equivalent mark in the
00:06:33.533 packet it sends back to the sender.
00:06:35.866 So it's telling the sender, “I got
00:06:37.733 a Congestion Experienced mark in one of
00:06:39.900 the packets you sent”.
00:06:43.600 When that packet gets to the sender,
00:06:46.600 the sender sees this bit in the
00:06:48.866 TCP header, the ECN Echo bit set
00:06:52.133 to one, and it realises that the
00:06:54.200 data it was sending
00:06:56.433 caused a router on the path to
00:07:00.000 set the ECN Congestion Experienced mark,
00:07:03.000 which the receiver has then fed back to it.
00:07:07.333 And what it does at that point,
00:07:09.100 is it reduces its congestion window.
00:07:11.800 It acts as-if a packet had been
00:07:15.000 lost, in terms of how it changes its congestion window.
00:07:19.066 So if it's a TCP Reno sender,
00:07:21.733 it will halve its congestion window,
00:07:24.200 the same way it would if a packet was lost.
00:07:27.000 If it's a TCP Cubic sender,
00:07:29.200 it will back off its congestion window
00:07:31.533 to 70%, and then enter the weird
00:07:35.533 cubic equation for changing its congestion window.
00:07:41.033 After it does that, it sets another
00:07:43.900 bit in the header of the next
00:07:47.366 TCP segment it sends out. It sets
00:07:49.900 the CWR bit, the Congestion Window Reduced
00:07:52.533 bit, in the header to tell the
00:07:54.533 network and the receiver that it's done it.
00:07:59.200 So the end result of this,
00:08:00.933 is that rather than a packet being lost
00:08:03.900 because the queue overflowed, and then the
00:08:06.500 acknowledgments coming back indicating, via the triple
00:08:09.466 duplicate ACK, that's a packet had been
00:08:11.166 lost, and then TCP reducing its congestion
00:08:14.266 window and re-transmitting that lost packet.
00:08:17.866 What happens is,
00:08:20.633 the IP packets, TCP packets, in the
00:08:24.366 outbound direction gets a Congestion Experienced mark
00:08:27.400 set, to indicate that the network is
00:08:29.566 starting to get full.
00:08:31.633 The ECN Echo bit is set on
00:08:33.500 the reply, and at that point the
00:08:35.666 sender reduces its window,
00:08:37.733 as-if the loss had occurred.
00:08:42.700 And then carries on sending with the
00:08:44.633 CWR bit set to one on that
00:08:46.533 next packet. So it has the same
00:08:49.000 effect, in terms of reducing the congestion window, as would
00:08:52.600 dropping a packet, but without dropping a
00:08:54.766 packet. So there's no actual packet loss
00:08:56.933 here, there’s just a mark to indicate
00:08:58.833 that the network was getting busy.
00:09:00.500 So it doesn't have to retransmit data,
00:09:02.666 and this happens before the queue is
00:09:04.500 full, so you get lower latency.
00:09:08.300 So ECN is a mechanism to allow
00:09:11.766 TCP to react to congestion before packet loss occurs.
00:09:16.600 It allows routers in the network to
00:09:18.700 signal congestion before the queue overflows.
00:09:21.866 It allows routers in the network to
00:09:23.500 say to TCP, “if you don't slow
00:09:25.566 down, this queue is going to overflow,
00:09:27.900 and I’m going to throw your packets away”.
00:09:31.533 it's independent of how TCP then responds,
00:09:34.366 whether it follows Reno or Cubic or
00:09:37.466 Vegas that doesn't really matter, it's just
00:09:39.600 an indication that it needs to slow
00:09:41.266 down because the queues are starting to
00:09:43.166 build up, and will overflow soon if it doesn't.
00:09:47.466 And if TCP reacts to that,
00:09:49.400 reacts to the ECN Echo bit going
00:09:51.566 back, and the sender reduces its rate,
00:09:53.966 the queues will empty, the router will
00:09:55.700 stop marking the packets, and everything will
00:09:57.900 settle down at a slightly slower rates
00:10:00.300 without causing any packet loss.
00:10:02.733 And the system will adapt, and it
00:10:05.600 will it will achieve the same sort
00:10:07.800 of throughput, it will just react earlier,
00:10:11.100 so you have smaller queues and lower latency.
00:10:14.500 And this gives you the same throughput
00:10:16.966 as you would with TCP Reno or
00:10:20.400 TCP Cubic, but with low latency,
00:10:22.333 which means it's better for competing video
00:10:25.100 conferencing or gaming traffic.
00:10:28.433 And I’ve described the mechanism for TCP,
00:10:31.066 but there are similar ECN extensions for
00:10:33.833 QUIC and for RTP, which is the
00:10:36.566 video conferencing protocol, all designed to achieve
00:10:39.933 the same goal.
00:10:44.400 So ECN, I think, is unambiguously a
00:10:47.100 good thing. It’s a signal from the
00:10:48.866 network to the endpoints that the network
00:10:50.966 is starting to get congested, and the
00:10:52.866 endpoints should slow down.
00:10:54.500 And if the endpoints believe it,
00:10:56.666 if they back off,
00:10:58.500 they reduce their sending rate before the
00:11:00.900 network is overloaded, and we end up
00:11:03.966 in a world where h we still
00:11:06.966 achieve good congestion control, good throughput,
00:11:11.133 but with lower latency.
00:11:13.100 And, if the endpoints don't believe it
00:11:15.200 well, eventually, the routers, the queues,
00:11:17.233 overflow and they lose packets, and we’re
00:11:19.100 no worse-off than we are now.
00:11:22.133 In order to deploy ECN, though,
00:11:25.600 we need to make changes. We need
00:11:27.900 to change the endpoints, to change the
00:11:29.700 end systems, to support these bits in
00:11:31.766 the IP header, and to support,
00:11:33.766 to add support for this into TCP.
00:11:36.500 And we need to update the routers,
00:11:38.666 to actually mark the packets when they're
00:11:40.333 starting to get overloaded.
00:11:44.333 Updating the end points has pretty much
00:11:47.066 been done by now.
00:11:49.200 I think every TCP implementation,
00:11:54.100 implemented in the last 15-20 years or
00:11:57.200 so, supports ECN, and these days,
00:12:00.000 most of them have it turned on by default.
00:12:04.266 And I think we actually have Apple
00:12:06.866 to thank for this.
00:12:09.033 ECN, for a long time, was implemented
00:12:12.900 but turned off by default, because there’d
00:12:15.233 been problems with some old firewalls which
00:12:17.900 reacted badly to it, 20 or so years ago.
00:12:22.233 And, relatively recently, Apple decided that they
00:12:25.666 wanted these lower latency benefits, and they
00:12:29.833 thought ECN should be deployed. So they
00:12:32.566 started turning it on by default in the iPhone.
00:12:37.100 And they kind-of followed an interesting approach.
00:12:40.100 In that for iOS nine, a random
00:12:43.133 subset of 5% of iPhones would turn
00:12:46.233 on ECN for some of their connections.
00:12:51.433 And they measured what happened. And they
00:12:54.233 found out that in the overwhelming majority
00:12:56.433 of cases this worked fine, and occasionally
00:12:59.133 it would fail.
00:13:01.400 And they would call up the network
00:13:03.966 operators, who's networks were showing problems,
00:13:07.433 and they would say “your network doesn't
00:13:10.333 work with iPhones; and currently it's not
00:13:12.800 working well with 5% of iPhones but
00:13:15.233 we're going to increase that number,
00:13:16.933 and maybe you should fix it”.
00:13:19.600 And then, a year later, when iOS
00:13:21.633 10 came out, they did this 50%
00:13:24.066 of connections made by iPhones. And then
00:13:26.933 a year later, for all of the connections.
00:13:30.000 And it's amazing what impact a
00:13:34.200 popular vendor calling up a network operator connect can
00:13:41.433 have on getting them to fix the equipment.
00:13:45.066 And, as a result,
00:13:47.200 ECN is now widely enabled by default
00:13:50.500 in the phones, and the network seems
00:13:53.333 to support it just fine.
00:13:56.300 Most of the routers also support ECN.
00:13:58.833 Although currently relatively few of them seem
00:14:01.400 to enable it by default. So most
00:14:04.066 of the endpoints are now
00:14:05.633 at the stage of sending ECN enabled
00:14:08.166 traffic, and are able to react to
00:14:10.900 the ECN marks, but most of the
00:14:13.400 networks are not currently setting the ECN marks.
00:14:16.933 This is, I think, starting to change.
00:14:19.533 Some of the recent DOCSIS, which is
00:14:22.266 the cable modem standards, are starting to
00:14:26.400 support you ECN. We’re starting to see
00:14:29.500 cable modems, cable Internet connections, which enable
00:14:33.566 ECN by default.
00:14:35.866 And, we're starting to see interest from
00:14:38.900 3GPP, which is the mobile phone standards
00:14:41.100 body to enable this in 5G,
00:14:43.933 6G, networks, so I think it's coming.
00:14:47.100 but it's going to take time.
00:14:49.066 And, I think, as it comes,
00:14:51.233 as ECN gradually gets deployed, we’ll gradually
00:14:53.766 see a reduction in latency across the
00:14:56.000 networks. It’s not going to be dramatic.
00:14:59.400 It's not going to suddenly transform the
00:15:01.300 way the network behaves, but hopefully over
00:15:04.033 the next 5 or 10 years we’ll
00:15:06.166 gradually see the latency reducing as ECN
00:15:09.433 gets more widely deployed.
00:15:13.900 So that's what I want to say
00:15:15.800 about ECN. It’s a mechanism by which
00:15:17.966 the network can signal to the applications
00:15:20.133 that the network is starting to get
00:15:22.033 overloaded, and allow the applications to back
00:15:24.433 off more quickly, in a way which
00:15:26.966 reduces latency and reduces packet loss.
Part 6: Light Speed?
The final part of the lecture moves on from congestion control and queueing, and discusses another factor that affects latency: the network propagation delay. It outlines what is the propagation delay and ways in which it can be reduced, including more direct paths and the use of low-Earth orbit satellite constellations.
00:00:00.433 In this final part of the lecture,
00:00:02.100 I want to move on from talking
00:00:03.600 about congestion control, and the impact of
00:00:05.733 queuing delays on latency, and talk instead
00:00:08.233 about the impact of propagation delays.
00:00:12.300 So, if you think about the latency
00:00:15.166 for traffic being delivered across the network,
00:00:17.433 there are two factors which impact that latency.
00:00:21.433 The first is the time packets spent
00:00:23.933 queued up at various routers within the network.
00:00:28.033 As we've seen in the previous parts
00:00:29.733 of this lecture, this is highly influenced
00:00:32.033 by the choice of TCP congestion control,
00:00:35.100 and whether Explicit Congestion Notification
00:00:37.566 is enabled or not.
00:00:39.533 The other factor, that we've not really
00:00:41.900 discussed to date, is the time it
00:00:44.066 takes the packets to actually propagate down
00:00:46.333 the links between the routers. This depends
00:00:48.700 on the speed at which the signal
00:00:50.500 propagates down the transmission medium.
00:00:53.233 If you're using an optical fibre to
00:00:55.233 transmit the packets, it depends on the
00:00:57.333 speed at which the light propagates through the fibre.
00:01:00.700 If you're using electrical signals in a
00:01:03.133 cable, it depends on the speed at
00:01:04.933 which electrical field propagates down the cable.
00:01:07.600 And if you're using radio signals,
00:01:09.366 it depends on the speed of light,
00:01:11.100 the speed at which the radio signals
00:01:12.666 propagate through the air.
00:01:17.000 As you might expect, physically shorter links
00:01:21.100 have lower propagation delays.
00:01:23.533 A lot of the time it takes
00:01:25.600 a packet to get down a long
00:01:27.233 distance link is just the time it
00:01:29.400 takes the signal to physically transmit along
00:01:32.133 the link. If you make the link
00:01:33.633 shorter it takes less time.
00:01:37.300 And what is perhaps not so obvious,
00:01:40.500 though, is that you can actually get
00:01:43.000 significantly significant latency benefits in certain paths,
00:01:48.166 because the existing network links follow quite
00:01:51.533 indirect routes.
00:01:53.766 For example, if you look at the
00:01:55.566 path the network links take, if you're
00:01:58.066 sending data from Europe to Japan.
00:02:01.066 Quite often, that data goes from Europe,
00:02:03.900 across the Atlantic to, for example,
00:02:06.533 New York or Boston, or somewhere like
00:02:08.900 that, across the US to
00:02:12.866 San Francisco, or Los Angeles, or Seattle,
00:02:17.000 or somewhere along those lines, and then
00:02:19.600 from there, in a cable across the
00:02:21.966 Pacific to Japan.
00:02:25.133 Or alternatively, it goes from Europe through
00:02:27.733 the Mediterranean, the Suez Canal and the
00:02:30.433 Middle East, and across India, and so
00:02:32.800 on, until it eventually reaches Japan the
00:02:35.600 other way around. But neither of these
00:02:38.100 is a particularly direct route.
00:02:40.666 And it turns out that there is
00:02:42.933 a much more direct, a much faster
00:02:44.900 route, to get from Europe to Japan,
00:02:48.033 which is to lay a an optical fibre
00:02:51.233 through the Northwest Passage, across Northern Canada,
00:02:55.233 through the Arctic Ocean, and down through
00:02:57.733 the Bering Strait, and past Russia to
00:02:59.866 get directly to Japan. It's much closer
00:03:03.200 to the great circle route around the
00:03:04.966 globe, and it's much shorter than the
00:03:07.066 route that the networks currently take.
00:03:10.000 And, historically, this hasn't been possible because
00:03:12.566 of the ice in the Arctic.
00:03:14.666 But, with global warming, the Northwest Passage
00:03:17.800 is now ice-free for enough of the
00:03:20.766 year that people are starting to talk
00:03:23.100 about laying optical fibres along that route,
00:03:26.266 because they can get a noticeable latency
00:03:28.733 reduction, for certain amounts of traffic,
00:03:31.400 by just following the physically shorter route.
00:03:38.400 Another factor which influences the propagation delay
00:03:42.600 is the speed of light in the transmission media.
00:03:47.433 Now, if you're sending data using radio links,
00:03:52.000 or using lasers in a vacuum,
00:03:57.033 then these propagate at the speed of light in the vacuum.
00:04:01.100 Which is about 300 million meters per second.
00:04:05.700 The speed of light in optical fibre,
00:04:07.733 though, is slower. The speed at which
00:04:09.900 light propagates down that down a fibre,
00:04:12.633 the speed at which light propagates through
00:04:14.566 glass, is only about 200,000.
00:04:17.000 kilometres per second, 200 million meters per
00:04:19.466 second. So it’s about two thirds of
00:04:21.800 the speed at which it propagates in a vacuum.
00:04:25.633 And this is the reason for systems
00:04:28.100 such as StarLink, which SpaceX is deploying.
00:04:32.900 And the idea of these systems is
00:04:34.700 that, rather than sending the Internet signals
00:04:38.000 down an optical fibre,
00:04:40.133 you send them 100, or a couple
00:04:42.300 of hundred miles, up to a satellite,
00:04:44.733 and they then go around between various
00:04:47.466 satellites in the constellation, in low earth
00:04:50.033 orbit, and then down to a receiver
00:04:53.700 near the destination.
00:04:55.833 And by propagating through vacuum, rather than
00:04:58.833 through optical fibre, the speed of light
00:05:02.800 in vacuum is significantly faster, it's about
00:05:05.300 50% faster than the speed of light
00:05:07.966 in fibre, and this can reduce the latency.
00:05:11.166 And the estimates show that if you
00:05:14.533 have a large enough constellation of satellites,
00:05:17.300 and SpaceX is planning on deploying around
00:05:19.666 4000 satellites, I believe, and with careful
00:05:23.133 routing, you can get about a 40,
00:05:25.800 45, 50% reduction in latency.
00:05:28.566 Just because the signals are transmitting via
00:05:31.866 radio waves, and via inter-satellite laser links,
00:05:35.733 which are in a vacuum, rather than
00:05:39.700 being transmitted through a fibre optic cable.
00:05:42.166 Just because of the differences in the
00:05:44.100 speed of light between the two mediums.
00:05:47.100 And the link on the slide points
00:05:49.733 to some simulations of the StarLink network,
00:05:52.333 which try and demonstrate how this would
00:05:54.966 work, and how it can achieve
00:05:57.366 both network paths that closely follow the
00:06:01.266 great circle routes, and
00:06:03.366 how it can reduce the latency because
00:06:07.566 of the use of satellites.
00:06:13.433 So, what we see is that people
00:06:15.133 are clearly going to some quite extreme
00:06:17.100 lengths to reduce latency.
00:06:19.500 I mean, what we spoke about in
00:06:21.933 the previous part was the use of
00:06:24.366 ECN marking to reduce latency by reducing
00:06:26.766 the amount of queuing. And that's just
00:06:29.200 a configuration change, it’s a software change
00:06:31.466 to some routers. And that seems to
00:06:33.666 me like a reasonable approach to reducing latency.
00:06:36.900 But some people are clearly willing to
00:06:39.633 go to the effort of
00:06:41.833 launching thousands of satellites, or
00:06:44.666 perhaps the slightly less extreme case of
00:06:49.033 laying new optical fibres through the Arctic Ocean.
00:06:53.000 So why are people doing this? Why
00:06:54.933 do people care so much about reducing
00:06:57.100 latency, that they're willing to spend billions
00:06:59.900 of dollars launching thousands of satellites,
00:07:02.833 or running new undersea cables, to do this?
00:07:06.833 Well, you'll be surprised to hear that
00:07:09.233 this is not to improve your gaming
00:07:11.166 experience. And this is not to improve
00:07:13.500 the experience of your zoom calls.
00:07:16.033 Why are people doing this? High frequency share trading.
00:07:20.800 Share traders believe they can make a
00:07:23.600 lot of money, by getting a few milliseconds worth
00:07:27.900 of latency reduction compared to their competitors.
00:07:33.600 Whether that's a good use of a
00:07:35.833 few billion dollars i'll let you decide.
00:07:38.800 But the end result may be,
00:07:41.433 hopefully, that we will get lower latency
00:07:43.866 for the rest of us as well.
00:07:48.733 And that concludes this lecture.
00:07:52.433 There are a bunch of reasons why
00:07:54.566 we have latency in the network.
00:07:56.600 Some of this is due to propagation
00:07:59.200 delays. Some of this, perhaps most of
00:08:01.166 it, in many cases, is due to
00:08:02.866 queuing at intermediate routers.
00:08:05.733 The propagation delays are driven by the speed of light.
00:08:09.200 And unless you can launch many satellites,
00:08:12.966 or lay more optical fibres, that's pretty
00:08:17.500 much a fixed constant, and there's not
00:08:19.833 much we can do about it.
00:08:22.966 Queuing delays, though, are things which we
00:08:25.833 can change. And a lot of the
00:08:28.066 queuing delays in the network are caused
00:08:30.000 because of TCP Reno and TCP Cubic,
00:08:34.400 which push for the queues to be full.
00:08:37.733 Hopefully, we will see improved TCP congestion
00:08:41.366 control algorithms. And TCP Vegas was one
00:08:44.600 attempt in this direction, which unfortunately proved
00:08:48.066 not to be deployable in practice,
00:08:50.833 TCP BBR was another attempt which
00:08:54.233 was problematic for other reasons, because of
00:08:57.433 its unfairness. But people are certainly working
00:09:00.066 on an alternative algorithms in this space,
00:09:02.866 and hopefully we'll see things deployed before too long.
Lecture 6 discussed TCP congestion control and its impact on latency. It discussed the principles of congestion control (e.g., the sliding window algorithm, AIMD, conservation of packets), and their realisation in TCP Reno. It reviewed the choice of TCP initial window, slow start, and the congestion avoidance phase, and the response of TCP to packet loss as a congestion signal.
The lecture noted that TCP Reno cannot effectively make use of fast and long distance paths (e.g., gigabit per second flows, running on transatlantic links). It discussed the TCP Cubic algorithm, that changes the behaviour of TCP in the congestion avoidance phase to make more effective use of such paths.
And it noted that both TCP Reno and TCP Cubic will try to increase their sending rate until packet loss occurs, and will use that loss as a signal to slow down. The fills the in-network queues at routers on the path, causing latency.
The lecture briefly discussed TCP Vegas, and the idea of using delay changes as a congestion signal instead of packet loss, and it noted that TCP Vegas is not deployable in parallel with TCP Reno or Cubic. It highlighted ongoing research with TCP BBR to address some of the limitations of TCP Vegas.
Finally, the lecture highlighted the possible use of Explicit Congestion Notification as a way of signalling congestion to the endpoints, and of causing TCP to reduce its sending rate, before the in-network queues overflow. This potentially offers a way to reduce latency.
Discussion will focus on the behaviour of TCP Reno congestion control, and of understanding how this leads to increased latency. It will discuss the applicability and ease of deployment of ways of reducing that latency.