csperkins.org

Networked Systems H (2022-2023)

Lecture 6: Lowering Latency

This lecture discusses some of the factors that affect the latency of a TCP congestion. It considers TCP congestion control, the TCP Reno and Cubic congestion control algorithms, and their behaviour and performance in terms of throughput and latency. It then considers alternative congestion control, such as the TCP Vegas and BBR algorithms, and the use of explicit congestion notification (ECN), as options to lower latency. Finally, it considers the impact of sub-optimal Internet paths on latency, and the rationale for deploying low-Earth orbit satellite constellations to reduce latency of Internet paths.

Part 1: TCP Congestion Control

This first part of the lecture outlines the principle of congestion control. It discusses packet loss as a congestion signal, conservation of packets in flight, and the additive increase, multiplicative decrease requirements for stability.

Slides for part 1

 

00:00:00.566 In this lecture I’d like to move

00:00:02.400 on from talking about how to transfer

00:00:04.800 data reliably, and talk about mechanisms and

00:00:07.866 means by which transport protocols go about

00:00:10.366 lowering the latency of the communication.

 

00:00:15.466 One of the key limiting factors of

00:00:17.966 performance of network systems, as we've discussed

00:00:20.633 in some of the previous lectures, is latency.

 

00:00:25.000 Part of that is the latency for

00:00:26.800 establishing connections, and we've spoken about that

00:00:29.166 in detail already, where a lot of

00:00:31.566 the issue is the number of round

00:00:33.933 trip times needed to set up a connection.

 

00:00:37.400 And, especially when secure connections are in

00:00:40.700 use, if you're using TCP and TLS,

00:00:43.766 for example, as we discussed, there’s a

00:00:46.033 large number of round trips needed to

00:00:47.766 actually get to the point where you

00:00:49.366 can establish a connection, negotiate security parameters,

00:00:53.266 and start to exchange data.

 

00:00:55.366 And we've already spoken about how the

00:00:58.166 QUIC Transport Protocol

00:01:00.566 has been developed to try and improve

00:01:03.233 latency in terms of establishing a connection.

 

00:01:06.166 The other aspects of latency, and reducing

00:01:08.566 the latency of communications, is actually in

00:01:10.966 terms of data transfer.

 

00:01:13.133 How you deliver data across the network

00:01:15.833 in a way which doesn't lead to

00:01:19.233 excessive delays, and how you can gradually

00:01:23.033 find ways of reducing the latency,

00:01:25.733 and making the network better suited to

00:01:29.200 real time applications, such as telephony,

00:01:31.833 and video conferencing, and gaming, and high

00:01:34.500 frequency trading, and

00:01:38.000 Internet of Things, and control applications.

 

00:01:43.666 A large aspect of that is in

00:01:44.966 terms of how you go about building

00:01:46.866 congestion control, and a lot of the

00:01:48.800 focus in this lecture is going to

00:01:50.700 be on how TCP

00:01:52.300 congestion control works, and how other protocols

00:01:54.700 do congestion, to deliver data in a

00:01:57.533 low latency manner.

 

00:01:59.400 But I’ll also talk a bit about

00:02:01.666 explicit congestion notification, and changes to the

00:02:04.466 way queuing happen in the network,

00:02:07.600 and about services such as SpaceX’s StarLink

00:02:09.700 which are changing the way the network

00:02:12.500 is built to reduce latency.

 

00:02:17.500 I want to start by talking about congestion control,

00:02:20.300 and TCP congestion control in particular.

 

00:02:26.800 And, what I want to do in

00:02:28.666 this part, is talk about some of

00:02:30.566 the principles of congestion

00:02:32.233 control. And talk about what is the

00:02:34.333 problem that's being solved, and how can

00:02:36.700 we go about adapting the rate at

00:02:39.533 which a TCP connection delivers data over

00:02:42.133 the network

00:02:43.633 to make best use of the network

00:02:45.366 capacity, and to do so in a

00:02:47.400 way which doesn't build up queues in

00:02:49.600 the network and induce too much latency.

 

00:02:52.300 So in this part I’ll talk about

00:02:54.566 congestion control principles. In the next part

00:02:57.066 I move on to talk about loss-based

00:02:59.433 congestion control, and talk about TCP Reno

00:03:02.566 and TCP Cubic,

00:03:04.233 which are ways of making very effective

00:03:06.400 use of the overall network capacity,

00:03:08.900 and then move on to talk about

00:03:11.200 ways of lowering latency.

00:03:13.533 I’ll talk about latency reducing congestion control

00:03:15.866 algorithms, such as TCP Vegas or Google's

00:03:18.933 TCP BBR proposal. And then I’ll finish

00:03:22.033 up by talking a little bit about

00:03:24.433 Explicit Congestion Notification

00:03:26.400 in one of the later parts of the lecture.

 

00:03:31.166 TCP is a

00:03:33.666 complex and very highly optimised protocol,

00:03:38.400 especially when it comes to congestion control

00:03:41.666 and loss recovery mechanisms.

 

00:03:44.733 I'm going to attempt to give you

00:03:47.000 a flavour of the way congestion control

00:03:49.500 works in this lecture, but be aware

00:03:52.033 that this is a very simplified review

00:03:54.333 of some quite complex issues.

 

00:03:56.833 The document listed on the slide is

00:03:59.966 entitled “A roadmap for TCP Specification Documents”,

00:04:03.900 and it's the latest IETF standard that describes

00:04:08.833 how TCP works, and points to the

00:04:11.700 details of the different proposals.

 

00:04:15.966 This is a very long and complex

00:04:19.733 document. It’s about, if I remember right,

00:04:22.133 60 or 70 pages long.

00:04:24.133 And all it is, is a list

00:04:26.066 of references to other specifications, with one

00:04:28.533 paragraph about each one describing why that

00:04:30.833 specification is important.

 

00:04:32.666 And the complete specification for TCP is

00:04:35.100 several thousand pages of text. This is

00:04:37.800 a complex protocol with a lot of

00:04:40.933 features in it, and I’m necessarily giving

00:04:43.700 a simplified overview.

 

00:04:47.400 I’m going to talk about TCP.

00:04:50.066 I’m not going to talk much,

00:04:52.066 if at all, about QUIC in this lecture.

 

00:04:54.666 That's not because QUIC isn't interesting,

00:04:57.533 it's because QUIC essentially adopts the same

00:05:00.433 congestion control mechanisms as TCP.

 

00:05:03.433 The QUIC version one standard says to

00:05:07.233 use TCP Reno, use the same congestion

00:05:10.166 control algorithm as TCP Reno.

 

00:05:13.300 And, in practice, most of the QUIC

00:05:15.500 implementations use the Cubic or the BBR

00:05:19.033 congestion control algorithms,

00:05:20.933 which we'll talk about later on.

00:05:22.500 QUIC is basically adopting the same mechanisms

00:05:24.566 as does TCP, and for that reason

00:05:27.366 that I’m not going to talk about

00:05:30.433 them too much separately.

 

00:05:36.966 So what is the goal of congestion

00:05:39.666 control? What are the principles of congestion control?

 

00:05:43.633 Well, the idea of congestion control is

00:05:46.600 to find the right transmission rate for

00:05:49.966 a connection.

 

00:05:51.466 We're trying to find the fastest sending

00:05:53.600 rate which you can send at to

00:05:56.100 match the capacity of the network,

00:05:58.233 and to do so in a way

00:05:59.833 that doesn't build up queues, doesn't overload,

00:06:02.766 doesn't congest the network.

 

00:06:05.066 So we're looking to adapt the transmission

00:06:07.333 rate of a flow of TCP traffic

00:06:09.933 over the network, to match the available

00:06:12.100 network capacity.

 

00:06:13.800 And as the network capacity changes,

00:06:16.100 perhaps because other flows of traffic start

00:06:19.466 up, or perhaps because you're on a

00:06:21.533 mobile device and you move into an

00:06:24.333 area with different radio coverage,

00:06:26.500 the speed at which the TCP is

00:06:29.800 delivering the data should adapt to match

00:06:31.600 the changes and available capacity.

 

00:06:35.966 The fundamental principles of congestion control,

00:06:41.433 as applied in TCP,

00:06:43.300 were first described by Van Jacobson,

00:06:46.500 who we see on the picture on

00:06:49.200 the top right of the slide,

00:06:51.033 in the paper “Congestion Avoidance and Control”.

 

00:06:56.366 And those principles are that TCP responds

00:06:59.166 to packet loss as a congestion signal.

 

00:07:01.933 It treats the loss of a packet,

 

00:07:04.966 because the Internet is a best effort

00:07:07.300 packet network, and it loses, it discards

00:07:09.666 packets, if it can't deliver them,

00:07:11.900 and TCP treats that discard, that loss

00:07:14.700 of a packet, as a congestion signal,

00:07:16.900 and as a signal of it's sending

00:07:19.233 too fast and should slow down.

 

00:07:21.866 It relies on the principle of conservation

00:07:23.833 of packets. It tries to keep the

00:07:26.133 number of packets, which are traversing the

00:07:28.433 network roughly constant,

00:07:29.800 assuming nothing changes in the network.

 

00:07:32.966 And it relies on the principles of

00:07:34.966 additive increase, multiplicative decrease.

 

00:07:37.166 If it has to increase its sending

00:07:39.233 rate, it does so relatively slowly,

00:07:41.333 an additive increase in the rate.

00:07:43.466 And if it has to reduce its

00:07:44.666 sending rate, it does so quickly, a multiplicative decrease.

 

00:07:49.466 And these are the fundamental principles that

00:07:51.833 Van Jacobson elucidated for TCP congestion control,

00:07:55.633 and for congestion control in general.

 

00:07:58.800 And it was Van Jacobson who did

00:08:01.866 the initial implementation of these into TCP

00:08:05.366 in the mid-1980s, about 1984, ’85, or so.

 

00:08:12.300 Since then, the algorithms, the congestion control

00:08:16.333 algorithms, for TCP in general have been

00:08:18.500 maintained by a large number of people.

00:08:20.900 A lot of people have developed this.

 

00:08:23.600 Probably one of the leading people in

00:08:26.733 this space for the last 20 years

00:08:30.700 or so, is Sally Floyd who was

00:08:33.166 very much responsible for taking

00:08:35.533 the TCP standards, making them robust,

00:08:39.166 pushing them through the IETF to get

00:08:41.533 them standardised, and making sure they work,

00:08:43.400 and making sure they work and get really high performance.

 

00:08:46.600 And she very much drove the development

00:08:48.800 to make these robust, and effective,

00:08:51.100 and high performance standards, and to make

00:08:53.766 TCP work as well as it does today.

 

00:08:57.266 And Sally sadly passed away a year

00:09:00.900 or so back, which is a tremendous

00:09:03.933 shame, but we're grateful for her legacy

00:09:08.766 in moving things forward.

 

00:09:13.833 So to go back to the principles.

 

00:09:17.366 The first principle of congestion control in

00:09:20.233 the Internet, and in TCP, is that

00:09:22.833 packet loss is an indication that the

00:09:24.700 network is congested.

 

00:09:28.500 Data flowing across the Internet flows from

00:09:31.433 the sender to the receiver through a

00:09:33.666 series of routers. The IP routers connect

00:09:37.866 together the different links that comprise the network.

 

00:09:41.766 And routers perform two functions:

00:09:44.500 they perform a routing function, and a forwarding function.

 

00:09:50.166 The purpose of the routing function is

00:09:52.566 to figure out how packets should get

00:09:55.166 to their destination. They receive a packet

00:09:57.766 from some network link, look at the

00:09:59.733 destination IP address, and decide which direction

00:10:02.333 to forward that packet. They’re responsible for

00:10:05.100 finding the right path through the network.

 

00:10:08.500 But they're also responsible for forwarding,

00:10:10.566 which is actually putting the packets into

00:10:13.233 the queue of outgoing traffic for the

00:10:15.900 link, and managing that queue of packets

00:10:18.566 to actually transmit the packets across the network.

 

00:10:22.033 And routers in the network have a

00:10:25.366 set of different links; the whole point

00:10:28.133 of a router is to connect different

00:10:30.266 links. And at each link, they have

00:10:32.200 a queue of packets, which are enqueued

00:10:34.100 to be delivered on that link.

 

00:10:36.900 And, perhaps obviously, if packets are arriving

00:10:39.333 faster than the link can deliver those

00:10:41.933 packets, then the queue gradually builds up.

00:10:44.466 More and more packets get enqueued in

00:10:47.200 the router waiting to be delivered.

 

00:10:48.800 And if packets are arriving slower than

00:10:51.433 they can be forwarded,

 

00:10:54.000 then the queue gradually empties as the

00:10:57.133 packets get transmitted.

 

00:11:00.066 Obviously the router has a limited amount

00:11:02.133 of memory, and at some point it's

00:11:04.633 going to run out of space to

00:11:06.200 enqueue packets. So, if packets are being

00:11:08.300 delivered faster than they,

00:11:10.200 if packets arriving at the router faster

00:11:12.833 than they can be delivered down the

00:11:14.633 link, the queue will build up and

00:11:16.500 gradually fill, until it reaches its maximum

00:11:18.833 size. At that point, the router has

00:11:21.133 no space to keep the newly arrived

00:11:23.666 packets, and so it discards the packets.

 

00:11:28.133 And this is what TCP is using

00:11:30.333 as the congestion signal. It’s using the

00:11:32.666 fact that the queue of packets on

00:11:35.100 an outgoing link at a router has

00:11:37.333 filled up. It's using that as an indication that

00:11:41.066 the queue fills up, the packet gets

00:11:43.566 lost, it uses that packet loss as

00:11:45.566 an indication that it's sending too fast.

00:11:47.933 It’s sending faster than the packets can

00:11:50.300 be delivered, and as a result the

00:11:52.666 queue has overflowed, a packet has been

00:11:55.000 lost, and so it needs to slow down.

 

00:11:57.966 And that's the fundamental congestion signal in

00:12:00.666 the network. Packet loss is interpreted as

00:12:03.533 a sign that devices are sending too

00:12:06.366 fast, and should go slower. And if

00:12:10.133 they slow down, the queues will gradually

00:12:12.066 empty, and packets will stop being lost.

 

00:12:15.366 So that's the first fundamental principle.

 

00:12:21.033 The second principle is that

00:12:24.800 we want to keep the number of

00:12:27.000 packets in the network roughly constant.

 

00:12:31.000 TCP, as we saw in the last

00:12:33.266 lecture, sends acknowledgments for packets. When a

00:12:35.866 packet is transmitted it has a sequence

00:12:38.366 number, and the response will come back

00:12:40.500 from the receiver acknowledging receipt of that

00:12:42.566 sequence number.

 

00:12:44.733 The general approach for TCP, once the

00:12:47.766 connection has got going, is that every

00:12:50.866 time it gets an acknowledgement, it uses

00:12:53.733 that as a signal that a packet

00:12:55.666 has been received.

 

00:12:57.533 And if a packet has been received,

00:12:59.233 something has left the network. One of

00:13:01.333 the packets sent into the network has

00:13:03.466 reached the other side, and has been

00:13:05.466 removed from the network at the receiver.

 

00:13:07.833 That means there should be space to

00:13:10.866 put another packet into the network.

 

00:13:13.900 And it's an approach that’s called ACK

00:13:15.866 clocking. Every time a packet arrives at

00:13:18.133 the receiver, and you get an acknowledgement

00:13:20.833 back saying it was received, that indicates

00:13:22.733 you can put another packet in.

 

00:13:24.766 So the total number of packets in

00:13:27.000 transit across the network ends up being

00:13:28.766 roughly constant. One packet out, you put

00:13:31.866 another packet in.

 

00:13:34.433 And it has the advantage that if

00:13:38.300 you're clocking out new packets in receipt

00:13:41.266 of acknowledgments, if, for some reason,

00:13:44.200 the network gets congested, and it takes

00:13:46.966 longer for acknowledgments to come back,

00:13:49.166 because it's taking longer for them to

00:13:50.700 work their way across the network,

00:13:53.566 then that will automatically slow down the

00:13:56.066 rate at which you send. Because it

00:13:58.466 takes longer for the next acknowledgment to

00:14:00.266 come back, therefore it's longer before you

00:14:02.066 send your next packet.

00:14:03.466 So, as the network starts to get

00:14:05.266 busy, as the queue starts to build

00:14:07.333 up, but before the queue has overflowed,

00:14:09.933 it takes longer for the acknowledgments to

00:14:12.233 come back, because the packets are queued

00:14:14.733 up in the intermediate links, and that

00:14:17.133 gradually slows down the behaviour of TCP.

 

00:14:20.166 It reduces the rate at which you can send.

 

00:14:23.366 So it’s, to at least some extent,

00:14:25.500 self adjusting. The network gets busier,

00:14:28.133 the ACKs come back slower, therefore you

00:14:30.066 send a little bit slower.

 

00:14:31.933 And that's the second principle: conservation of

00:14:34.600 packets. One out, one in.

 

00:14:41.300 And the principle of conservation of packets

00:14:44.866 is great, provided the network is in

00:14:48.333 the steady state.

 

00:14:50.166 But you also need to be able

00:14:51.733 to adapt the rate at which you're sending.

 

00:14:54.633 The way TCP adapts is very much

00:14:57.300 focused on starting slowly and gradually increasing.

 

00:15:04.900 When it needs to increase it’s sending

00:15:07.100 rate, TCP increases linearly. It adds a

00:15:10.866 small amounts to the sending rate each round trip time.

 

00:15:15.433 So it just gradually, slowly, increases the

00:15:18.000 sending rating. It gradually

00:15:19.500 pushes up the rate

00:15:23.233 until it spots a loss. Until it

00:15:26.566 loses a packet. Until it overflows a queue.

 

00:15:29.633 And then it responds to congestion by

00:15:32.166 rapidly decreasing its rate. If a congestion

00:15:36.233 event happens, if a packet is lost,

00:15:38.500 TCP halves its rate. It responds faster

00:15:42.266 than it increases, it slows down faster than it increases.

 

00:15:45.966 And this is the final principle,

00:15:48.166 what’s known as additive increase, multiplicative decrease.

 

00:15:50.866 The goal is to keep the network

00:15:53.066 stable. The goal is to not overload the network.

 

00:15:57.733 If you can, keep going at a

00:16:00.366 steady rate. Follow the ACK clocking approach.

 

00:16:03.300 Gradually, just slowly, increase the rate a

00:16:05.900 bit. Keep pushing, just in case there’s

00:16:08.366 more capacity than you think. So just

00:16:10.800 gradually keep probing to increase the rate.

 

00:16:13.733 If you overload the network, if you

00:16:16.333 cause congestion, if you overflow the queues,

00:16:18.566 cause a packet to be lost,

00:16:19.766 slow down rapidly. Halve your sending rate,

00:16:22.733 and gradually build up again.

 

00:16:24.866 The fact that you slow down faster

00:16:27.000 than you speed up, the fact that

00:16:28.966 you follow the one in, one out approach,

00:16:31.900 keeps the network stable. It makes sure

00:16:34.433 it doesn't overload the network, and it

00:16:36.166 means that if the network does overload,

00:16:38.000 it responds and recovers quickly The goal

00:16:40.866 is to keep the traffic moving.

00:16:42.500 And TCP is very effective at doing this.

 

00:16:47.366 So those are the fundamental principles of

00:16:49.800 TCP congestion control. Packet loss as an

00:16:52.866 indication of congestion.

00:16:54.800 Conservation of packets, and ACK clocking.

00:16:57.500 One in, one out, where possible.

 

00:17:00.366 If you need to increase the sending

00:17:03.233 rate, increase slowly. If a problem happens,

00:17:06.100 decrease quickly. And that will keep the network stable.

 

00:17:10.200 In the next part I’ll talk about

00:17:12.466 TCP Reno, which is one of the

00:17:15.100 more popular approaches for doing this in practice.

Part 2: TCP Reno

The second part of the lecture discusses TCP Reno congestion control. It outlines the principles of window based congestion control, and describes how they are implemented in TCP. The choice of initial window, and how the recommended initial window has changed over time, is discussed, along with the slow start algorithm for finding the path capacity and the congestion avoidance algorithm for adapting the congestion window.

Slides for part 2

 

00:00:00.666 In the previous part, I spoke about

00:00:02.366 the principles of TCP congestion control in

00:00:04.733 general terms. I spoke about the idea

00:00:07.666 of packet loss as a congestion signal,

00:00:10.300 about the conservation of packets, and about

00:00:12.800 the idea of additive increase multiplicative decrease

00:00:15.500 – increase slowly, decrease the sending quite

00:00:17.666 quickly as a way of achieving stability.

 

00:00:20.333 In this part I want to talk

00:00:21.900 about TCP Reno, and some of the

00:00:24.066 details of how TCP congestion control works in practice.

 

00:00:27.500 I’ll talk about the basic TCP congestion

00:00:29.866 control algorithm, how the sliding window algorithm

00:00:33.033 works to adapt the sending rate,

00:00:36.100 and the slow start and congestion avoidance

00:00:39.566 phases of congestion control.

 

00:00:44.600 TCP is what's known as a window

00:00:48.166 based congestion control protocol.

 

00:00:51.100 That is, it maintains what's known as

00:00:54.633 a sliding window of data which is

00:00:56.700 available to be sent over the network.

 

00:00:59.800 And the sliding window determines what range

00:01:02.100 of sequence numbers can be sent by

00:01:04.500 TCP onto the network.

 

00:01:06.933 It uses the additive increase multiplicative decrease

00:01:11.100 approach to grow and shrink the window.

00:01:13.300 And that determines, at any point,

00:01:15.666 how much data TCP sender can send

00:01:17.700 onto the network.

 

00:01:19.666 It augments these with algorithms known as

00:01:22.000 slow start and congestion avoidance. Slow start

00:01:24.933 being the approach TCP uses to get

00:01:28.900 a connection going in a safe way,

00:01:31.533 and congestion avoidance being the approach it

00:01:33.800 uses to maintain the sending rate once

00:01:36.733 the flow has got started.

 

00:01:39.633 The fundamental goal of TCP is that

00:01:42.233 if you have several TCP flows sharing

00:01:45.533 a link, sharing a bottleneck link in the network,

00:01:50.300 each of those flows should get an

00:01:52.733 approximately equal share of the bandwidth.

 

00:01:55.900 So, if you have four TCP flows

00:01:57.666 sharing a link, they should each get

00:01:59.733 approximately one quarter of the capacity of that link.

 

00:02:03.866 And TCP does this reasonably well.

 

00:02:06.666 It’s not perfect. It, to some extent,

00:02:10.100 biases against long distance flows,

00:02:13.433 and shorter flows tend to win out

00:02:15.900 a little over long distance flows.

 

00:02:18.066 But, in general, it works pretty well,

00:02:19.900 and does give flows roughly a roughly

00:02:22.866 equal share of the bandwidth.

 

00:02:26.100 The basic algorithm it uses to do

00:02:28.066 this, the basic congestion control algorithm,

00:02:30.600 is an approach known as TCP Reno.

00:02:32.233 And this is the state of the

00:02:35.200 art in TCP as of about 1990.

 

00:02:42.866 TCP is an ACK based protocol.

 

00:02:46.700 You send a packet, and sometime later

00:02:48.933 an acknowledgement comes back telling you that

00:02:52.000 the packet arrived, and indicating the sequence

00:02:54.366 number of the next packet which is expected.

 

00:02:59.300 The simplest way you might think that

 

00:03:01.300 would work, is you send a packet.

00:03:03.466 You wait for the acknowledgment. You send

00:03:05.533 another packet You wait for the acknowledgement. And so on.

 

00:03:09.500 The problem with that, is that it

00:03:11.166 tends to perform very poorly.

 

00:03:14.200 It takes a certain amount of time

00:03:16.366 to send a packet down a link.

00:03:18.666 That depends on the size of the

00:03:20.033 packet, and the link bandwidth.

 

00:03:23.566 The size of the packet is expressed

00:03:25.600 as some number of bits to be sent.

00:03:27.700 The link bandwidth is expressed in some

00:03:29.800 number of bits it can deliver each

00:03:31.300 second. And if you did divide the

00:03:33.233 packet size by the bandwidth, that gives

00:03:35.300 you the number of seconds it takes to send each packet.

 

00:03:39.166 It takes a certain amount of time

00:03:41.300 for that packet to propagate down the

00:03:43.733 link to the receiver, and for the

00:03:45.733 acknowledgment come back to you, depending on

00:03:48.900 the round trip of the link.

 

00:03:51.100 And you can measure the round trip time of the link.

 

00:03:54.833 And you can divide one by the other.

00:03:57.533 You can take the time it takes to send a packet, and the

00:04:00.333 time it takes for the acknowledgment to

00:04:02.066 come back, and divide one by the

00:04:03.900 other, to get the link utilisation.

 

00:04:07.200 And, ideally, you want that fraction be

00:04:08.900 close to one. You want to be

00:04:11.166 spending most of the time sending packets,

00:04:13.466 and not much time waiting for the

00:04:15.166 acknowledgments to come back before you can

00:04:16.900 send the next packet.

 

00:04:19.866 The problem is that's often not the case.

 

00:04:23.566 For example, if we assume we're trying

00:04:25.566 to send data, and we have a

00:04:27.366 gigabit link, which is connecting the machine

00:04:30.100 we're sending data from, and we’re trying

00:04:31.800 to go from Glasgow to London.

00:04:33.966 And this might be the case you would find if you had a one

00:04:37.133 of the machines in the Boyd Orr

00:04:39.233 labs, which is connected to the University's

00:04:42.166 gigabit Ethernet, and the University has a

00:04:44.666 10 gigabit per second link to the

00:04:46.666 rest of the Internet, so the bottleneck is that Ethernet.

 

00:04:51.100 If you're talking to a machine in London,

00:04:54.100 let's make some assumptions on how long this will take.

 

00:04:59.166 You’re sending using Ethernet, and the biggest

00:05:01.533 packet an Ethernet can deliver is 1500

00:05:03.566 bytes. So 1500 bytes, multiplied by eight

00:05:06.833 bits per byte, gives you a number

00:05:09.066 of bits in the packet. And it’s

00:05:11.366 a gigabit Ethernet, so it's sending a

00:05:13.233 billion bits per second.

 

00:05:15.400 So 1500 bytes, times eight bits,

00:05:17.866 divided by a billion bits per second.

 

00:05:21.133 It will take 12 microseconds, 0.000012 of

00:05:26.866 a second, 12 microseconds to send a

00:05:29.266 packet down the link. And that’s just

00:05:31.800 the time it takes to physically serialise

00:05:34.066 1500 bytes down a gigabit per second link.

 

00:05:39.400 The round trip time to London, if you measure it, is about

00:05:44.566 a 100th of a second, about 10 milliseconds.

 

00:05:47.833 If you divide one by the other,

00:05:50.200 you find that the utilisation is 0.0012.

00:05:54.700 0.12% of the link is in use.

 

00:05:59.266 The time it takes to send a

00:06:00.933 packet is tiny compared to the time

00:06:02.833 it takes to get a response.

00:06:04.566 So if you're just sending one packet,

00:06:06.433 and waiting for a response, the link

00:06:08.166 is idle 99.9% of the time.

 

00:06:14.166 The idea of a sliding window protocol

00:06:16.733 is to not just send one packet

00:06:18.566 and wait for an acknowledgement.

00:06:20.133 It’s to send several packets,

00:06:22.566 and wait for the acknowledgments. And the

00:06:25.466 window is the number of packets that

00:06:27.266 can be outstanding before the acknowledgement comes back.

 

00:06:31.000 The ideas is, you can start several

00:06:33.133 packets going, and eventually the acknowledgement comes

00:06:36.566 back, and that starts triggering the next

00:06:38.466 packets to be clocked out. This idea

00:06:40.233 is to improve the utilisation by sending

00:06:42.500 more than one packet before you get an acknowledgment.

 

00:06:47.200 And this is the fundamental approach to

00:06:49.266 sliding window protocols. The sender starts sending

00:06:51.833 data packets, and there's what's known as

00:06:54.433 a congestion window that's that specifies how

00:06:57.000 many packets that's it’s allowed to send

00:06:59.600 before it gets an acknowledgement.

 

00:07:02.033 And, in this example, the congestion window is six packets.

 

00:07:06.133 And the sender starts. It sends the

00:07:08.300 first data packet, and that gets sent

00:07:11.100 and starts its way traveling down the link.

 

00:07:14.533 And at some point later it sends

00:07:16.566 the next packet, and then the next packet, and so.

 

00:07:20.433 After a certain amount of time that

00:07:22.366 first packet arrives at the receiver,

00:07:24.400 and the receiver generates the acknowledgments which

00:07:26.800 comes back towards the sender.

 

00:07:28.933 And while this is happening, the sender

00:07:30.633 is sending more of the packets from its window.

 

00:07:33.966 And the receiver’s gradually receiving those and

00:07:36.266 sending the acknowledgments. And, at some point later,

00:07:38.966 the acknowledgement makes it back to the sender.

 

00:07:42.666 And in this case we've set the

00:07:44.733 window size to be six packets.

00:07:46.700 And it just so happens that the

00:07:48.500 acknowledgement for the first packet arrives back

00:07:51.733 at the sender, just as it has finished sending packet six.

 

00:07:57.700 And that triggers the window to increase.

00:07:59.866 That triggers the window to slide along.

00:08:02.066 So instead of being allowed to send packets one through six,

00:08:05.533 we're now allowed to send packets two

00:08:07.366 through seven. Because one packet has arrived,

00:08:09.833 that's opened up the window to allow

00:08:11.400 us to send one more packet.

 

00:08:13.733 And the acknowledgement indicates that packet one

00:08:16.200 has arrived. So just as we'd run

00:08:19.133 out of packets to send, just as

00:08:20.600 we've sent our six packets which are

00:08:22.533 allowed by the window, the acknowledgement arrives,

00:08:25.033 slides the window a long one,

00:08:26.566 tells us we can now send one more.

 

00:08:29.566 And the idea is that you size

00:08:31.600 the window such that you send just

00:08:33.366 enough packets that by the time the

00:08:35.600 acknowledgement comes back, you're ready to slide

00:08:37.800 the window along. You've sent everything that

00:08:40.000 was in your window.

 

00:08:41.766 And each acknowledgement releases the next packet

00:08:44.833 for transmission, if you get the window sized right.

 

00:08:48.766 And if there's a problem, if the acknowledgments

00:08:51.233 don't come back because something got lost,

00:08:54.033 then it stalls. You hadn't sent too

00:08:56.966 many excess packets, you're not just keeping

00:08:59.200 sending without getting acknowledgments,

00:09:01.466 you're just sending enough

00:09:02.933 that the acknowledgments come back, just as

00:09:04.966 you run out of things to send.

00:09:06.966 And everything just keeps it sort-of balanced.

 

00:09:09.300 Every acknowledgement triggers the next packet to

00:09:11.466 be sent, and it rolls along.

 

00:09:14.500 How big should the window be? Well,

00:09:17.166 it should be sized to match the

00:09:18.466 bandwidth times the delay on the path.

 

00:09:20.900 And you work it out in bytes.

00:09:23.266 It's the bandwidth of the path,

00:09:24.600 a gigabit in the previous example,

00:09:26.933 times the latency,

00:09:28.300 100th of a second, and you multiply

00:09:30.833 those together and that tells you how

00:09:32.366 many bytes can be in flight.

 

00:09:33.733 And you divide that by the packet

00:09:35.366 size, and that tells you how many packets you can send.

 

00:09:39.633 The problem is, the sender doesn't know

00:09:42.333 the bandwidth of the path, and it

00:09:44.700 doesn't know that latency. It doesn't know

00:09:46.633 the round trip time.

 

00:09:49.366 It can measure the round trip time,

00:09:51.433 but not until after it started sending.

00:09:53.733 Once it’s sent a packet, it can

00:09:55.800 wait for an acknowledgement to come back

00:09:57.566 and get an estimate of the round

00:09:58.966 trip time. But it can't do that

00:10:00.566 at the point where it starts sending.

 

00:10:02.566 And it can't know what is the

00:10:04.300 bandwidth. It knows the password for the

00:10:06.500 link it's connected to, but it doesn't

00:10:08.033 know the bandwidth for the rest of

00:10:09.500 the links throughout the network.

 

00:10:11.366 It doesn't know how many other TCP

00:10:13.566 flows it’s sharing the traffic with,

00:10:15.133 so it doesn't know how much of

00:10:16.600 that capacity it's got available.

 

00:10:19.000 And that this is the problem with

00:10:21.366 the sliding window algorithms. If you get

00:10:24.033 the window size right,

00:10:26.100 It allows you to do the ACK

00:10:27.900 clocking, it allows you to clock out

00:10:29.566 the packets at the right time,

00:10:31.100 just in time for the next packet to become available.

 

00:10:34.166 But, in order to pick the right

00:10:35.500 window size, you need to know the

00:10:36.833 bandwidth and the delay, and you don't

00:10:38.666 know either of those at the start of the connection.

 

00:10:44.333 TCP follows the sliding window approach.

 

00:10:47.700 TCP Reno is very much a sliding

00:10:51.266 window protocol, and it's optimised for not

00:10:53.966 knowing what the window sizes are.

 

00:10:58.466 And the challenge with TCP is to

00:11:01.033 pick what should be the initial window.

00:11:02.933 To pick how many packets you should

00:11:04.566 send, before you know anything about the

00:11:06.600 round trip time, or anything about bandwidth.

 

00:11:09.700 And how to find the path capacity,

00:11:11.633 how to figure out at what point

00:11:13.700 you've got the right size window.

00:11:15.866 And then how to adapt the window

00:11:18.833 to cope with changes in the capacity.

 

00:11:23.600 So there's two fundamental problems with TCP

00:11:26.766 Reno congestion control. Picking the initial window size

00:11:31.666 for the first set of packets you send.

 

00:11:34.833 And then, adapting that initial window size

00:11:37.500 to find the bottleneck capacity, and to

00:11:39.733 adapt to changes in that bottleneck capacity.

 

00:11:42.366 If you get the window size right,

00:11:44.500 you can make effective use of the

00:11:46.033 network capacity. If you get it wrong

00:11:48.633 you’ll either send too slowly, and end

00:11:50.900 up wasting capacity. Or you'll send too

00:11:53.033 quickly, and overload the network, and cause

00:11:55.200 packets to be lost because the queues fill.

 

00:12:01.800 So, how does TCP find the initial window?

 

00:12:05.966 Well, to start with, you have no

00:12:07.766 information. When you're making a TCP connection

00:12:10.900 to a host you haven't communicated with

00:12:12.966 before, you don't know the round trip

00:12:15.100 time to that host, you don’t know

00:12:16.633 how long it will take to get

00:12:17.800 a response, and you don't know the network capacity.

 

00:12:21.133 So you have no information to know

00:12:23.666 what an appropriately sized window should be.

 

00:12:27.966 The only safe thing you can do.

00:12:30.600 The only thing which is safe in

00:12:32.133 all circumstances, is to send one packet,

00:12:34.766 and see if it arrives, see if you get an ACK.

 

00:12:38.433 And if it works, send a little

00:12:39.933 bit faster next time.

 

00:12:42.500 And then gradually increase the rate at which you send.

 

00:12:46.100 The only safe thing to do

00:12:48.033 is to start at the lowest possible rate,

00:12:50.400 equivalent of stop-and-wait, and then gradually

00:12:53.700 increase your rate from there, once you know that it works.

 

00:12:58.366 The problem is, of course, that's pessimistic,

00:13:00.433 in most cases.

 

00:13:02.000 Most links are not the slowest possible link.

00:13:04.500 Most links, you can send faster than that.

 

00:13:09.233 What TCP has traditionally done, and the

00:13:12.466 traditional approach in TCP Reno, is declared

00:13:15.300 the initial window to be three packets.

 

00:13:18.533 So you can send three packets,

00:13:20.300 without getting any acknowledgments back.

 

00:13:23.300 And, by the time the third packet

00:13:24.800 has been sent, you should be just

00:13:27.033 about to get the acknowledgement back,

00:13:28.566 which will open it up for you to send the fourth.

00:13:30.933 And at that point, it starts ACK clocking.

 

00:13:34.700 And why is it three packets?

00:13:37.066 Because someone did some measurements,

00:13:38.933 and decided that was what safe.

 

00:13:42.500 More recently, I guess, about 10 years

00:13:45.666 ago now, Nandita Dukkipati and her group

00:13:49.333 at Google did another set of measurements,

00:13:52.333 and showed that was actually pessimistic.

00:13:55.066 The networks had gotten a lot faster

00:13:57.233 in the time since TCP was first

00:13:59.833 standardised, and they came to the conclusion,

00:14:02.900 based on the measurements of browsers accessing

00:14:05.733 the Google site, that about 10 packets

00:14:08.600 was a good starting point.

 

00:14:11.500 And the idea here is that 10

00:14:13.133 packets, you can send 10 packets at

00:14:15.533 the start of a connection, and after

00:14:18.500 you’ve sent 10 packets you should have

00:14:20.266 got an acknowledgement back.

 

00:14:22.666 Why ten?

 

00:14:24.633 Again, it's a balance between safety and

00:14:27.233 performance. If you send too many packets

00:14:31.633 onto a network which can't cope with

00:14:33.333 them, those packets will get queued up

00:14:35.533 and, in the best case, it’ll just

00:14:37.566 add latency because they're all queued up

00:14:39.666 somewhere. And in the worst case they'll

00:14:41.466 overflow the queues, and cause packet loss,

00:14:43.500 and you'll have to re-transmit them.

 

00:14:45.900 So you don't want to send too

00:14:47.733 fast. Equally, you don't want to send

00:14:49.700 too slow, because that just wastes capacity.

 

00:14:52.733 And the measurements that Google came up with

00:14:56.000 at this point, which was around 10

00:14:58.133 years ago, was that about 10 packets

00:15:00.433 was a good starting point for most connections.

 

00:15:03.466 It was unlikely to cause congestion in

00:15:06.800 most cases, and was also unlikely to

00:15:08.966 waste too much bandwidth.

 

00:15:11.900 And I think what we'd expect to

00:15:13.333 see, is that over time the initial

00:15:14.900 window will gradually increase, as network connections

00:15:17.233 around the world gradually get faster.

 

00:15:19.566 And it's balancing making good use of

00:15:22.766 connections in well-connected

00:15:25.133 first-world parts of the world, where there’s

00:15:28.633 good infrastructure,

00:15:30.800 against not overloading connections in parts of

00:15:34.333 the world where the infrastructure at less well developed.

 

00:15:40.233 The initial window lets you send something.

 

00:15:43.266 With a modern TCP, it lets you send 10 packets.

 

00:15:48.266 And you can send those 10 packets,

00:15:50.166 or whatever the initial window is,

00:15:52.200 without waiting for an acknowledgement to come back.

 

00:15:55.733 But it's probably not the right size;

00:15:58.333 it’s probably not the right window size.

 

00:16:01.300 If you're on a very fast connection,

00:16:04.200 in a well-connected part of the world,

00:16:06.033 you probably want a much bigger window than 10 packets.

 

00:16:09.033 And if you're on a poor quality

00:16:11.500 mobile connection, or in a part of

00:16:13.433 the world where the infrastructure is less

00:16:15.133 well developed, you probably want a smaller window.

 

00:16:18.433 So you need to somehow adapt

00:16:20.000 to match the network capacity.

 

00:16:23.466 And there's two parts to this.

00:16:25.700 What's called slow start, where you try

00:16:28.200 to quickly find the appropriate initial window,

00:16:32.366 where starting from initial window, you quickly

00:16:34.900 converge on what the right window is.

 

00:16:37.266 And congestion avoidance, where you adapt in

00:16:39.800 the long term to match changes in

00:16:42.633 capacity once the thing is running.

 

00:16:47.300 So how does slow start work?

 

00:16:49.400 Well, this is the phase at the beginning of the connection.

 

00:16:52.766 It's easiest to illustrate if you assume

00:16:55.066 that the initial window is one packet.

 

00:16:57.600 If the initial window is one packet,

00:16:59.966 you send one packet, and at some

00:17:02.066 point later an acknowledgement comes back.

 

00:17:05.200 And the way slow start works is

00:17:07.066 that each acknowledgment you get back

00:17:09.433 increases the window by one.

 

00:17:13.733 So if you send one packet,

00:17:15.833 and get one packet back, that increases

00:17:18.466 the window from one to two,

00:17:20.133 so you can send two packets the next time.

 

00:17:23.133 And you send those two packets,

00:17:25.333 and you get two acknowledgments back.

00:17:27.066 And each acknowledgments increases the window by

00:17:29.233 one, so it goes to three,

00:17:30.800 and then to four. So you can

00:17:32.166 send four packets the next time.

 

00:17:35.233 And then you get four acknowledgments back,

00:17:37.666 each of which increases the window,

00:17:39.433 so your window is now eight.

 

00:17:42.133 And, as we are all, I think,

00:17:45.400 painfully aware after the pandemic, this is

00:17:47.966 exponential growth.

 

00:17:50.233 The window is doubling each time.

00:17:52.300 So it's called slow start because it

00:17:54.366 starts very slow, with one packet or

00:17:56.500 three packets or 10 packets, depending on

00:17:58.600 the version of TCP you have.

00:18:00.466 But each round trip time the window doubles.

00:18:03.666 It doubles it's sending rate each time.

 

00:18:06.866 And this carries on until it loses

00:18:09.533 a packet. This carries on until it

00:18:11.766 fills the queues and overflows the capacity

00:18:14.300 of the network somewhere.

 

00:18:16.333 At which points it halves back to

00:18:18.266 its previous value, and drops out of

00:18:19.866 the slow start phase.

 

00:18:23.733 If we look at this graphically,

00:18:26.133 what we see on the graph at

00:18:27.800 the bottom of the slide, we have

00:18:29.600 time on the X axis, and the

00:18:31.666 congestion window, the size of the congestion

00:18:33.800 window, on the y axis.

 

00:18:35.700 And we're assuming an initial window of

00:18:37.433 one packet. We see that, on the

00:18:39.933 first round trip it sends the one

00:18:41.566 packet, gets the acknowledgement back. The second

00:18:44.766 round trip it sends two packets.

00:18:46.700 And then four, and then eight,

00:18:48.166 and then 16. And each time it

00:18:50.400 doubles it's sending rate.

 

00:18:52.366 So you have this exponential growth phase,

00:18:54.500 starting at whatever the initial window is,

00:18:57.466 and doubling each time until it reaches

00:18:59.366 the network capacity.

 

00:19:01.500 And eventually it fills the network.

00:19:03.600 Eventually some queue, somewhere in the network,

00:19:05.766 is full. And it overflows and the packet gets lost.

 

00:19:10.266 At that point the connection halves it’s

00:19:12.200 rate, back to the value just before

00:19:14.466 it last increased. In this example,

00:19:17.233 we see that it got up to

00:19:19.333 an initial window of 16, and then

00:19:21.900 something got lost, and then it halved

00:19:23.433 back down to a window of eight.

 

00:19:26.266 At that point TCP enters what's known

00:19:28.466 as the congestion avoidance phase.

 

00:19:33.500 The goal of congestion avoidance is to

00:19:37.500 adapt to changes in capacity.

 

00:19:41.300 After the slow start phase, you know

00:19:43.366 you've got approximately the right size window

00:19:45.466 for the path. It's telling you roughly

00:19:47.366 how many packets you should be sending

00:19:48.900 each round trip time. The goal,

00:19:51.266 once you’re in congestion avoidance, is to adapt to changes.

 

00:19:55.666 Maybe the capacity of the path changes.

00:19:58.900 Maybe you're on a mobile device,

00:20:00.900 with a wireless connection, and the quality

00:20:04.033 of the wireless connection changes.

 

00:20:06.400 Maybe the amount of cross traffic changes.

00:20:09.466 Maybe additional people start sharing the link

00:20:12.266 with you, and you have less capacity

00:20:14.033 because you’re sharing with more TCP flows.

 

00:20:16.666 Or maybe some of the cross traffic

00:20:18.033 goes away, and the amount of capacity

00:20:20.100 you have available increases because there's less

00:20:22.133 competing traffic.

 

00:20:24.433 And the congestion avoidance phase follows an

00:20:27.200 additive increase, multiplicative decrease,

00:20:29.300 approach to adapting

00:20:30.633 the congestion window when that happens.

 

00:20:34.866 So, in congestion avoidance,

00:20:38.166 if it successfully manages to send a

00:20:40.466 complete window of packets, and gets acknowledgments

00:20:43.300 back for each of those packets.

00:20:45.333 So it's sent out

00:20:47.900 eight packets, for example, and gets eight

00:20:50.600 acknowledgments back,

00:20:52.366 it knows the network can support that sending rate.

 

00:20:55.766 So it increases its window by one.

 

00:20:59.133 So the next time, it sends out nine packets

00:21:02.600 and expects to get nine acknowledgments back

00:21:05.333 over the next round trip cycle.

 

00:21:08.233 And if it successfully does that,

00:21:09.966 it increases the window again.

 

00:21:12.500 And it sends 10 packets, and expects

00:21:15.400 to get 10 acknowledgments back.

 

00:21:17.800 And we see that each round trip

00:21:20.000 it gradually increases the sending rate by

00:21:22.166 one. So it sends 8 packets,

00:21:24.566 then 9, then 10, then 11,

00:21:26.333 and 12, and keeps gradually, linearly,

00:21:29.166 increasing its rate.

 

00:21:31.900 Up until the point that something gets lost.

 

00:21:36.966 And if a packet gets lost?

 

00:21:40.300 You’ll be able to detect that because,

00:21:43.100 as we saw in the previous lecture,

00:21:44.733 you'll get a triple duplicates acknowledgement.

00:21:46.833 And that indicates that one of the

00:21:49.433 packets got lost, but the rest of

00:21:50.933 the data in the window was received.

 

00:21:54.666 And what you do at that point,

00:21:56.500 is you do a multiplicative decrease in

00:21:58.566 the window. You halve the window.

 

00:22:02.300 So, in this case, the sender was

00:22:04.533 sending with a window of

00:22:07.133 12 packets, and it successfully sent that.

00:22:10.200 And then it tried to send,

00:22:13.500 tried to increase its rate, realised it

00:22:17.066 didn't work, realised something got lost,

00:22:19.133 and so it halved its window back down to six.

 

00:22:23.500 And then it gradually switches back,

00:22:25.466 it switches back, and goes back to

00:22:27.400 the gradual additive increase.

 

00:22:29.733 And it follows this sawtooth pattern.

 

00:22:32.433 Gradual linear increase, one packet more each

00:22:35.666 round trip time.

 

00:22:37.633 Until it sends too fast, causes a

00:22:40.166 packet to be lost because it overflows

00:22:41.966 a queue, halves it’s sending rate,

00:22:44.133 and then gradually starts increasing it again.

 

00:22:47.833 It follows this sawtooth pattern. Gradual increase,

00:22:51.500 quick back-off; gradual increase, quick back-off.

 

00:22:57.433 The other way TCP can detect the

00:22:59.633 loss is by what’s known as a

00:23:01.266 time out. It’s sending the packets,

00:23:04.500 and suddenly the acknowledgements stop coming back entirely.

 

00:23:09.633 And this means that either the receiver

00:23:11.833 has crashed, the receiving system has gone

00:23:14.933 away, or perhaps more likely the network has failed.

 

00:23:18.733 And the data it’s sending is either

00:23:21.600 not reaching the sender, or the reverse path has failed,

00:23:24.766 and the acknowledgments are not coming back.

 

00:23:29.200 At that point, after nothing has come back for a while,

00:23:33.333 it assumes a timeout has happened,

00:23:37.466 and resets the window down to the initial window.

 

00:23:41.833 And in the example we see on

00:23:43.866 the slide, at time 14 we've got

00:23:45.933 a timeout, and it resets and the

00:23:48.500 initial window goes back to one packet.

 

00:23:51.566 At that point, it re-enters slow start.

00:23:53.633 It starts again from the beginning.

 

00:23:55.966 And whether your initial window is one

00:23:58.066 packet, or three packets, or ten packets,

00:24:00.233 it starts in the beginning, and it

00:24:02.066 re-enters slow start, and it tries again

00:24:04.100 for the connection.

 

00:24:06.466 And if this was a transient failure,

00:24:08.500 that will probably succeed. If it wasn’t,

00:24:11.366 it may end up in yet another

00:24:13.900 timeout, while it takes time for the

00:24:15.600 network to recover, or

00:24:17.933 for the system you're talking to,

00:24:19.866 to recover, and it will be a

00:24:21.266 while before it can successfully send a

00:24:22.966 packet. But, when it does, when the

00:24:24.766 network recovers, it starts sending again,

00:24:26.866 and resets the connection from the beginning.

 

00:24:30.366 How long, should the timeout be?

 

00:24:33.533 Well, the standard says a maximum of

00:24:37.200 one second, or the average round trip

00:24:39.900 time plus four times the statistical variance

00:24:42.200 in the round trip time.

 

00:24:45.200 And, if you're a statistician, you’ll recognise

00:24:47.666 that the RTT plus four times the

00:24:49.766 variance, if you're assuming a normal distribution of

00:24:54.233 round trip time samples, accounts for 99%

00:24:57.733 of the samples falling within range.

 

00:25:01.266 So it's finding the 99th percentile of

00:25:04.466 the expected time to get an acknowledgement back.

 

00:25:12.700 Now, TCP follows this saw tooth behaviour,

00:25:16.866 with gradual additive increase in the sending

00:25:19.466 rate, and then a back-off, halving it’s

00:25:22.333 sending rate, and then a gradual increase again.

 

00:25:25.633 And we see this in the top

00:25:27.166 graph on the slide which is showing a

00:25:29.766 measured congestion window for a real TCP flow.

 

00:25:34.166 And, after dynamics of the slow start

00:25:36.266 at the beginning, we see it follows this sawtooth pattern.

 

00:25:41.366 How does that affect the rest of the network?

 

00:25:45.033 Well, the packets are, at some point,

00:25:48.133 getting queued up at whatever the bottleneck link is.

 

00:25:53.733 And the second graph we see on

00:25:55.466 the left, going down, is the size of the queue.

 

00:25:58.866 And we see that as the sending

00:26:00.766 rate increases, the queue gradually builds up.

 

00:26:04.200 Initially the queue is empty, and as

00:26:06.566 it starts sending faster, the queue gradually gets fuller.

 

00:26:11.333 And at some point the queue gets full, and overflows.

 

00:26:17.866 And when the queue gets full,

00:26:19.633 when the queue overflows, when packets gets

00:26:21.800 lost, TCP halves it’s sending rate.

 

00:26:24.700 And that causes the queue to rapidly

00:26:27.166 empty, because there's less packets coming in,

00:26:29.566 so the queue drains.

 

00:26:31.466 But what we see is that just

00:26:33.266 as the queue is getting to empty,

00:26:35.666 the rate is starting to increase again.

 

00:26:38.566 Just as the queue gets the point

00:26:40.200 where it would have nothing to send,

00:26:41.833 the rate starts picking up, such that

00:26:44.033 the queue starts to gradually refill.

 

00:26:46.600 So the queues in the routers also

00:26:48.600 follow a sawtooth pattern. They gradually fill

00:26:51.500 up until they get to a full point,

 

00:26:55.200 And then the rate halves, the queue

00:26:58.433 empties rapidly because

00:27:00.133 there's much less traffic coming back,

00:27:02.133 and as it's emptying the rate at

00:27:04.233 which the sender is sending is gradually

00:27:06.500 filling up, and the queue size oscillates.

 

00:27:09.266 And we see the same thing happens

00:27:11.066 with the round trip time, in the

00:27:13.766 third of the graphs, as the queue gradually

00:27:17.000 fills up, the round trip time goes

00:27:18.900 up, and up, and up, it's taking

00:27:20.733 longer for the packets because they're queued up somewhere.

 

00:27:23.366 And then the rate reduces, the queue

00:27:26.266 drops, the round trip time drops.

00:27:28.733 And it gradually, as the rate picks up afterwards

00:27:33.066 back into congestion avoidance, the queue gradually

00:27:35.666 fills, the round trip time gradually increases.

 

00:27:38.466 So, both window size, and the queue

00:27:40.666 size, and the round trip time,

00:27:42.266 all follow this characteristic sawtooth pattern.

 

00:27:47.066 What's interesting though, if we look at

00:27:50.100 the fourth graph down on the left,

00:27:52.800 is we're looking at the rate at

00:27:54.333 which packets are arriving at the receiver.

 

00:27:56.966 And we see that the rate at

00:27:58.800 which packets are arriving at the receiver

00:28:00.533 is pretty much constant.

 

00:28:03.300 What's happening is that the packets are

00:28:05.266 being queued up at the link,

00:28:07.400 and as the queue fills there's more

00:28:09.833 and more packets queued up

00:28:11.900 at the bottleneck link. And when TCP

00:28:15.366 backs-off, when it reduces it's window,

00:28:19.000 that lets the queue drain. But the

00:28:21.866 queue never quite empties. We just see

00:28:25.133 very occasional drops where the queue gets

00:28:27.566 empty, but typically the queue always has

00:28:30.033 something in it.

 

00:28:31.800 It's emptying rapidly, it’s getting less and

00:28:34.166 less data in it, but the queue,

00:28:37.666 if the buffer is sized right,

00:28:39.866 if the window is chosen right, never quite empties.

 

00:28:43.800 So the TCP sender is following this

00:28:46.433 sawtooth pattern, with its sending window,

00:28:49.600 which is gradually filling up the queues.

00:28:51.966 And then the queues are gradually draining

00:28:53.966 when TCP backs-off and halves its rate,

00:28:56.933 but the queue never quite empties.

00:28:58.933 It always has some data to send,

00:29:00.633 so the receiver is always receiving data.

 

00:29:03.700 So, even though the sender's following the

00:29:05.766 sawtooth pattern, the receiver receives constant rate

00:29:08.266 data the whole time,

00:29:10.233 at approximately the bottleneck bandwidth.

 

00:29:13.866 And that's the genius of TCP.

 

00:29:16.566 It manages, by following this additive increase,

00:29:20.066 multiplicative decrease, approach, it manages to adapt

00:29:24.333 the rate such that the buffer never

00:29:27.200 quite empties, and the data continues to be delivered.

 

00:29:32.233 And for that to work, it needs

00:29:34.433 the router to have enough buffering capacity

00:29:37.400 in it. And the amount of buffering

00:29:39.600 the router needs, is the bandwidth times

00:29:42.166 the delay of the path. And too

00:29:44.333 little buffering in the router

00:29:47.033 leads to

00:29:49.933 the queue overflowing, and it not quite

00:29:52.633 managing to sustain the rate. Too much,

00:29:55.500 you just get what’s known as buffer bloat.

 

00:29:59.366 It's safe, I mean in terms of

00:30:00.700 throughput, it keeps receiving the data.

00:30:02.766 But the queues get very big,

00:30:04.800 and they never get anywhere near empty,

00:30:07.466 so the amount of data queued up

00:30:09.766 increases, and you just get increased latency.

 

00:30:15.033 So that's TCP Reno. It's really effective

00:30:18.100 at keeping the bottleneck fully utilised.

00:30:20.466 But it trades latency for throughput.

00:30:22.866 It tries to fill the queue,

00:30:24.766 it's continually pushing, it’s continually queuing up data.

 

00:30:28.066 Making sure the queue is never empty.

 

00:30:30.800 Making sure the queue is never empty,

00:30:32.500 so provided there’s enough buffering in the

00:30:34.800 network there are always packets being delivered.

 

00:30:37.566 And that's great, if your goal is

00:30:39.966 to maximise the rate at which information

00:30:42.400 is delivered. TCP is really good at

00:30:45.466 keeping the bottleneck link fully utilised.

00:30:47.800 It’s really, really good at delivering data

00:30:49.900 as fast as the network can support it.

 

00:30:52.333 But it trades that off for latency.

 

00:30:56.500 It's also really good at making sure

00:30:59.166 there are queues in the network,

00:31:01.066 and making sure that the network is

00:31:03.466 not operating at its lowest possible latency.

00:31:06.300 There's always some data queued up.

 

00:31:11.733 There are two other limitations,

00:31:13.966 other than increased latency.

 

00:31:16.700 First, is that TCP assumes that losses

00:31:19.066 are due to congestion.

 

00:31:21.600 And historically that's been true. Certainly in

00:31:24.466 wired links, packet loss is almost always

00:31:27.566 caused by a queue filling up,

00:31:30.433 overflowing, and a router not having space

00:31:34.133 to enqueue a packet.

 

00:31:36.666 In certain types of wireless links,

00:31:39.366 in 4G or in WiFi links,

00:31:41.500 that's not always the case, and you

00:31:43.733 do get packet loss due to corruption.

 

00:31:46.533 And TCP will treat this as a

00:31:49.000 signal to slow down. Which means that

00:31:51.166 TCP sometimes behaves sub-optimally on wireless links.

 

00:31:55.366 And there's a mechanism called Explicit Congestion

00:31:57.966 Notification, which we'll talk about in one

00:32:00.400 of the later parts of this lecture,

00:32:01.900 which tries to address that.

 

00:32:04.400 The other, is that the congestion avoidance

00:32:07.433 phase can take a long time to ramp up.

 

00:32:10.600 On very long distance links, very high capacity

00:32:16.133 links, it can take a long time

00:32:17.666 to get up to, after packet loss,

00:32:20.300 it can take a very long time

00:32:21.433 to get back up to an appropriate rate.

 

00:32:23.766 And there are some occasions with very

00:32:26.333 fast long distance links, where it performs

00:32:28.300 poorly, because of the way the congestion

00:32:31.066 avoidance works.

 

00:32:32.933 And there's an algorithm known as TCP

00:32:34.800 Cubic, which i'll talk about in the

00:32:36.500 next part, which tries to address that.

 

00:32:40.333 And that's the basics of TCP.

 

00:32:42.600 The basic TCP congestion control algorithm is

00:32:45.333 a sliding window algorithm, where the window

00:32:48.500 indicates how many packets you’re allowed to

00:32:50.800 send before getting an acknowledgement.

 

00:32:53.766 The goal of the slow start and

00:32:56.333 the congestion avoidance phases, and the additive

00:32:59.266 increase, multiplicative decrease, is to adapt the

00:33:02.166 size of the window to match the network capacity.

 

00:33:05.133 It always tries to match the size

00:33:07.166 of the window exactly to the capacity,

00:33:09.633 so it's making the most use of the network resources.

 

00:33:14.733 In the next part, I’ll move on

00:33:16.933 and talk about an extension to the

00:33:20.033 TCP Reno algorithm, known as TCP Cubic,

00:33:23.066 which is intended to improve performance on

00:33:25.533 very fast and long distance networks.

 

00:33:27.966 And then, in the later parts,

00:33:29.466 we'll talk about extensions to reduce latency,

00:33:32.600 and to work on wireless links where

00:33:35.933 there are non-congestive losses.

Part 3: TCP Cubic

The third part of the lecture talks about the TCP Cubic congestion control algorithm, a widely used extension to TCP that improves its performance on fast, long-distance, networks. The lecture discusses the limitations of TCP Reno that led to the development of Cubic, and outlines how Cubic congestion control improves performance but retains fairness with Reno.

Slides for part 3

 

00:00:00.833 In the previous part, I spoke about TCP Reno.

 

00:00:04.133 TCP Reno is the default congestion control

00:00:07.033 algorithms for TCP, but it's actually not

00:00:09.566 particularly widely used in practice these days.

 

00:00:12.566 What most modern TCP versions use is,

00:00:14.966 instead, an algorithm known as TCP Cubic.

 

00:00:18.600 And the goal of TCP cubic is

00:00:20.666 to improve TCP performance on fast long distance networks.

 

00:00:26.033 So the problem with TCP Reno,

00:00:27.966 is that it’s performance can be comparatively

00:00:30.133 poor on networks with large bandwidth-delay products.

 

00:00:33.933 That is, networks where the product,

00:00:36.333 what you get when you multiply the

00:00:37.900 bandwidth of the network, in number of

00:00:39.766 bits per second, and the delay,

00:00:42.100 the round trip time of the network, is large.

 

00:00:45.833 Now, this is not a problem that

00:00:48.066 most people, have most of the time.

00:00:50.466 But, it's a problem that began to

00:00:52.400 become apparent in the early 2000s when

00:00:55.733 people working at organisations like CERN were

00:00:58.500 trying to transfer very large data files

00:01:01.033 across fast long distance

00:01:05.800 networks between CERN and the universities that

00:01:08.933 were analysing the data.

 

00:01:11.233 For example, CERN is based at Geneva,

00:01:13.800 in Switzerland, and some of the big

00:01:16.566 sites for analysing the data are based

00:01:19.533 at, for example, Fermilab just outside Chicago in the US.

 

00:01:23.900 And in order to get the data

00:01:26.166 from CERN to Fermilab, from Geneva to Chicago,

00:01:31.366 they put in place multi-gigabit transatlantic links.

 

00:01:37.566 And if you think about the congestion window needed to

00:01:42.666 make good use of a link like

00:01:44.666 that, you realise it actually becomes quite large.

 

00:01:48.066 If you assume the link is 10

00:01:50.766 gigabit per second, which was cutting edge

00:01:54.033 in the early 2000s, but it is

00:01:55.833 now relatively common for high-end links these days,

00:01:59.033 and assume 100 milliseconds round trip time,

00:02:02.100 which is possibly even slightly an under-estimate

00:02:04.933 for the path from Geneva to Chicago,

00:02:08.900 in order to make good use

00:02:11.166 of that, you need a congestion window

00:02:12.866 which equals the bandwidth times the delay.

00:02:15.200 And 10 gigabits per second, times 100

00:02:17.633 milliseconds, gives you a congestion window of

00:02:20.233 about 100,000 packets.

 

00:02:24.166 And, partly, it takes TCP a long

00:02:28.066 time, a comparatively long time, to slow

00:02:31.333 start up to a 100,000 packet window.

00:02:34.266 But that's not such a big issue,

00:02:36.533 because that only happens once at the

00:02:38.066 start of the connection. The issue,

00:02:40.166 though, is in congestion avoidance.

 

00:02:42.800 If one packet is lost on the

00:02:44.766 link, out of a window of 100,000,

00:02:47.266 that will cause TCP to back-off and

00:02:49.800 halve it’s window. And it then increases

00:02:53.066 sending rate again, by one packet every round trip time.

 

00:02:57.300 And backing off from 100,000 packet window

00:03:00.033 to a 50,000 packet window, and then

00:03:02.433 increasing by one each time, means it

00:03:04.766 takes 50,000 round trip times to recover

00:03:07.500 back up to the full window.

 

00:03:10.400 50,000 round trip times, when the round

00:03:13.000 trip time is 100 milliseconds, is about 1.4 hours.

 

00:03:17.600 So it takes TCP about one-and-a-half hours

00:03:20.966 to recover from a single packet loss.

 

00:03:24.300 And, with a window of 100,000 packets,

00:03:27.666 you're sending enough data, at 10 gigabits per second,

00:03:32.033 that the imperfections in the optical fibre,

00:03:35.433 and imperfections in the equipment that are

00:03:37.333 transmitting the packets, become significant.

 

00:03:40.233 And you're likely to just see occasional

00:03:43.300 random packet losses, just because of imperfections

00:03:46.100 in the transmission medium, even if there's

00:03:48.166 no congestion. And this was becoming a

00:03:50.466 limiting factor, this was becoming a bottleneck

00:03:52.666 in the transmission.

 

00:03:54.366 It was becoming not possible to build

00:03:56.400 a network that was reliable enough,

00:03:58.733 that it never lost any packets in

00:04:01.433 transferring several hundreds of billions of packets

00:04:03.966 of data,

00:04:05.100 to exchange the data between CERN and

00:04:11.500 the sites which were doing the analysis.

 

00:04:14.600 TCP cubic is one of a range

00:04:16.733 of algorithms which were developed to try

00:04:19.200 and address this problem. To try and

00:04:22.000 recover much faster than TCP Reno would,

00:04:24.466 in the case when you had very

00:04:26.400 large congestion windows, and small amounts of packet loss.

 

00:04:32.033 So the idea of TCP cubic,

00:04:34.866 is that it changes the way the

00:04:36.866 congestion control works in the congestion avoidance phase.

 

00:04:41.200 So, in congestion avoidance, TCP cubic will

00:04:46.033 increase the congestion window faster than TCP

00:04:49.000 Reno would, in cases where the window is large.

 

00:04:54.366 In cases where the window is relatively

00:04:56.700 small, in the types of networks were

00:04:59.233 Reno has good performance, TCP cubic behaves

00:05:03.800 in a very similar way.

00:05:05.466 But as the windows get bigger,

00:05:07.066 as it gets to a regime with

00:05:09.033 TCP Reno doesn't work effectively, TCP cubic

00:05:11.900 gets more aggressive in adapting its congestion

00:05:15.200 window, and increases the congestion window much

00:05:17.700 more quickly in response to loss.

 

00:05:21.833 However, as the rate of increase,

00:05:25.500 as the window approaches the value it

00:05:29.500 was before the loss, it slows its

00:05:31.333 rate of increase, so it starts increasing

00:05:33.833 rapidly, slows its rate of increase

00:05:36.000 as it approaches the previous value.

 

00:05:38.533 And if it then successfully manages to

00:05:41.666 send at that rate, if it successfully

00:05:44.166 moves above the previous sending rate,

00:05:47.600 then it gradually increases sending rate again.

 

00:05:51.800 It’s called TCP Cubic because it follows

00:05:54.733 a cubic equation to do this.

 

00:05:56.333 The shape of the equation, the shape

00:06:00.200 of the curve, we see on the

00:06:01.600 slide for TCP cubic is following a cubic graph.

 

00:06:05.600 The paper listed on the slide,

00:06:08.466 the paper shown on the slide,

00:06:09.900 from Injong Rhee and his collaborators,

 

00:06:13.633 is the paper which describes the algorithm in detail.

 

00:06:16.666 And it was eventually specified in IETF

00:06:19.833 RFC 8312 in 2018, although it's been

00:06:24.366 probably the most widely used TCP variant

00:06:27.666 for a number of years before that.

 

00:06:31.200 The details of how it works:

00:06:33.566 TCP cubic is a somewhat more complex

00:06:36.066 algorithm than Reno.

 

00:06:38.966 The two parts to the behaviour.

 

00:06:42.066 If a packet is lost when a

00:06:44.866 TCP cubic sender is in the congestion avoidance phase,

00:06:49.233 it does a multiplicative decrease.

 

00:06:52.133 However, unlike TCP Reno, which does a

00:06:55.300 multiplicative decrease by multiplying by a factor

00:06:58.766 of 0.5, that is, it halves its

00:07:01.566 sending rate if a single packets is lost,

00:07:04.533 TCP cubic multiples its rate by 0.7.

 

00:07:09.500 So, instead of dropping back down to

00:07:11.200 50% of its previous sending rate,

00:07:13.400 it drops down to 70% of the sending rate.

 

00:07:17.233 It backs-off less, it's more aggressive.

00:07:19.600 It’s more aggressive at using bandwidth.

00:07:23.300 It reduces it’s sending rate in response

00:07:25.733 to loss, but by smaller fraction.

 

00:07:31.866 After it's backed-off, TCP cubic also changes

00:07:36.233 the way in which it increases it’s sending rate in future.

 

00:07:40.733 So we saw in the previous slide,

00:07:42.500 TCP Reno increases it’s congestion window by

00:07:46.100 one, for every round trip when it

00:07:48.600 successfully sends data.

 

00:07:50.800 So if the window backs off to

00:07:53.033 10, then it goes to 11 the

00:07:54.900 next round trip time, then 12,

00:07:56.700 and 13, and so on, with a

00:07:58.466 linear increase in the window.

 

00:08:02.000 TCP cubic, on the other hand,

00:08:04.033 sets the window as we see in

00:08:06.766 the equation on the slide. It sets

00:08:08.766 the window to be a constant,

00:08:11.233 C, times T-K cubed, plus Wmax.

 

00:08:17.100 Where the constant, C, is set to

00:08:19.766 0.4, which is a threshold which controls

00:08:22.800 how fair it is to TCP Reno,

00:08:25.266 and was determined experimentally.

 

00:08:28.033 T is the time since the packet

00:08:29.933 loss. K is the time it will

00:08:32.200 increase, it will take to increase the window backup to

00:08:36.266 the maximum it was before the packet

00:08:40.066 loss, and Wmax is the maximum window

00:08:42.633 size it reached before the loss.

 

00:08:45.200 And this gives the cubic growth function,

00:08:47.866 which we saw on the previous slide,

00:08:49.600 where the window starts to increase quickly,

00:08:52.033 the growth slows as it approaches that previous value

00:08:55.433 it reached just before the loss,

00:08:57.933 and if it successfully passes through that

00:09:00.033 point, the rate of growth increases again.

 

00:09:03.766 Now, that's the high-level version. And we

00:09:06.666 can already see it's more complex than

00:09:09.266 the TCP Reno equation. The algorithm on

00:09:13.766 the right of the slide, which is

00:09:16.433 intentionally presented in a way which is

00:09:18.933 completely unreadable here,

00:09:21.166 shows the full details. The point is

00:09:24.233 that there's a lot of complexity here.

 

00:09:27.300 The basic equation, the basic back-off to

00:09:30.766 0.7 times and then follow the cubic

00:09:33.133 equation, to increase rapidly, slow the rate

00:09:36.666 of increase, and then increase rapidly again

00:09:39.100 if it successfully gets past the previous bottleneck point,

00:09:43.133 is enough to illustrate the key principle.

 

00:09:46.300 The rest of the details are there

00:09:48.133 to make sure it's fair with TCP

00:09:50.066 Reno on links which are slower,

00:09:52.366 or where the round trip time is shorter.

 

00:09:55.600 And so, in the regime where TCP

00:09:57.733 Reno can successfully make use of the

00:09:59.833 link, TCP Cubic behaves the same way.

 

00:10:02.866 And, as you get into a regime

00:10:05.000 where Reno can't effectively make use of

00:10:07.666 the capacity, because it can't sustain a

00:10:09.466 large enough congestion window,

00:10:11.133 then cubic starts to behave differently,

00:10:14.433 and starts to switch to the cubic

00:10:16.666 equation. And that allows it to recover

00:10:19.700 from losses more quickly, and to more

00:10:21.833 effectively continue to make use of higher

00:10:23.800 bandwidths and higher latency paths.

 

00:10:29.200 TCP cubic is the default in most

00:10:33.200 modern operating systems. It’s the default in

00:10:36.866 Linux, it's the default in FreeBSD,

00:10:39.733 I believe it's the default in macOS

00:10:42.733 and iPhones.

 

00:10:44.666 Microsoft Windows has an algorithm called Compound

00:10:48.566 TCP which is a different algorithm,

00:10:50.900 but has a similar effect.

 

00:10:54.166 It’s much more complex than TCP Reno.

 

00:10:56.900 The core response, the back off to

00:11:00.033 70% and then follow the characteristic cubic

00:11:03.900 curve, is conceptually relatively straightforward, but once

00:11:07.733 you start looking at the details of

00:11:09.966 how it behaves, there gets to be a lot of complexity.

 

00:11:13.833 And most of that is in there

00:11:16.333 to make sure it's reasonably fair to

00:11:19.433 TCP, to TCP Reno, in the regime

00:11:22.833 where Reno typically works. But it improves

00:11:26.233 performance for networks with longer round trip

00:11:28.366 times and higher bandwidths.

 

00:11:32.033 Both TCP Cubic, and TCP Reno,

00:11:35.933 use congestion control, use packet loss as

00:11:39.800 a congestion signal. And they both eventually

00:11:42.733 fill the router buffers.

00:11:44.533 And TCP cubic does so more aggressively

00:11:47.133 than Reno. So, in both cases,

00:11:49.400 they're trading off latency for throughput,

00:11:51.666 They're trying to make sure the buffers are full.

00:11:53.933 They're trying to make sure

00:11:56.166 the buffers in the intermediate routers are full.

00:11:58.866 And they're both making sure that they

00:12:02.066 keep the congestion window large enough to

00:12:04.433 keep the buffers fully utilised, so packets

00:12:08.633 keep arriving at the receiver at all times.

 

00:12:11.300 And that's very good for achieving high

00:12:13.033 throughput, but it pushes the latency up.

 

00:12:16.300 So, again, they’re trading-off increased latency for

00:12:19.933 good performance, for good throughput.

 

00:12:25.333 And that's what I want to say

00:12:26.666 about Cubic. Again, the goal is to

00:12:29.566 use a different response function to improve

00:12:32.333 throughput on very fast, long distance, links,

00:12:36.100 multi-gigabit per second transatlantic links, being the

00:12:39.833 common example.

00:12:42.300 And the goal is to make good

00:12:44.966 use of throughput.

 

00:12:47.633 In the next part I’ll talk about

00:12:50.600 alternatives which, rather than focusing on throughput,

00:12:53.800 focus on keeping latency bounded whilst achieving

00:12:57.533 reasonable throughput.

Part 4: Delay-based Congestion Control

The 4th part of the lecture discussed how both the Reno and Cubic algorithms impact latency. It shows how their loss-based response to congestion inevitably causes router queues to fill, increasing path latency, and discusses how this is unavoidable with loss-based congestion control. It introduces the idea of delay-based congestion control and the TCP Vegas algorithm, highlights its potential benefits and deployment challenges. Finally, TCP BBR is briefly introduced as an experimental extension that aims to achieve some of the benefits of delay-based congestion control, in a deployable manner.

Slides for part 4

 

00:00:00.566 In the previous parts, I’ve spoken about

00:00:02.700 TCP Reno and TCP cubic. These are

00:00:05.866 the standard, loss based, congestion control algorithms

00:00:08.966 that most TCP implementations use to adapt

00:00:11.933 their sending rate. These are the standard

00:00:14.933 congestion control algorithms for TCP.

 

00:00:17.566 What I want to do in this

00:00:19.100 part is recap, why these algorithms cause

00:00:23.033 additional latency in the network, and talk

00:00:25.933 about two alternatives which try to adapt

00:00:29.966 the sending rate of TCP without building

00:00:32.933 up queues, and without

00:00:34.800 overloading the network and causing too much latency.

 

00:00:40.400 So, as I mentioned, TCP Cubic and

00:00:42.900 TCP Reno both aim to fill up the network.

 

00:00:46.466 They use packet loss as a congestion signal.

 

00:00:50.300 So the way they work is they

00:00:52.733 gradually increase their sending rate, they’re in

00:00:55.900 either slow start or congestion avoidance phase,

00:00:58.900 and they’re always gradually increasing the sending

00:01:01.433 rates, gradually filling up the queues in

00:01:03.766 the network, until those queues overflow.

 

00:01:07.333 At that point a packet is lost.

 

00:01:09.733 The TCP backs-off it's sending rate,

00:01:13.466 it backs-off its window, which allows the

00:01:16.133 queue to drain, but as the queue

00:01:18.200 is draining, both

00:01:19.766 Reno and Cubic are increasing their sending

00:01:22.533 rate, are increasing the sending window,

00:01:25.366 so are to gradually start filling up

 

00:01:27.833 the queue again.

 

00:01:29.266 As, we saw, the queues in the

00:01:31.400 network oscillate, but they never quite empty.

 

00:01:34.333 And both Reno and Cubic, the goal

00:01:36.866 is to keep some packets queued up

00:01:39.766 in the network, make sure there's always

00:01:42.233 some data queued up, so they can

00:01:44.000 keep delivering data.

 

00:01:47.366 And, no matter how big a queue

00:01:50.300 you put in the network, no matter

00:01:52.200 how much memory you give the routers

00:01:53.866 in the network, TCP Reno and TCP

00:01:57.266 cubic will eventually cause it to overflow.

 

00:02:00.800 They will keep sending, they'll keep increasing

00:02:04.233 the sending rate, until whatever queue is

00:02:06.866 in the network it's full, and it overflows.

 

00:02:10.333 And the more memory in the routers,

00:02:12.133 the more buffer in the routers,

00:02:13.900 the longer that queue will get and

00:02:15.833 the worse the latency will be.

 

00:02:18.433 But in all cases, in order to

00:02:21.366 achieve very high throughput, in order to

00:02:23.533 keep the network busy, keep the bottleneck

00:02:25.433 link busy, TCP Reno and TCP cubic

00:02:29.033 queue some data up.

 

00:02:31.100 And this adds latency.

 

00:02:34.300 It means that, whenever there’s TCP Reno,

00:02:37.866 whenever there’s TCP cubic flows, using the

00:02:40.300 network, the queues will have data queued up.

 

00:02:45.800 There’ll always be data queued up for

00:02:47.800 delivery. There's always packets waiting for delivery.

00:02:50.933 So it forces the network to work

00:02:53.133 in a regime where there's always some

00:02:56.566 excess latency.

 

00:03:01.333 Now, this is a problem for real-time

00:03:05.066 applications. It’s a problem if you're running

00:03:07.233 a video conferencing tool, or a telephone

00:03:11.366 application, or a game, or a real

00:03:13.766 time control application, because you want low

00:03:16.633 latency for those applications.

 

00:03:19.133 So it will be desirable if we

00:03:21.166 could have a an alternative to TCP

00:03:23.600 Reno or TCP cubic that can achieve

00:03:25.800 good throughput for TCP, without forcing the

00:03:28.400 queues to be full.

 

00:03:31.433 One attempt at doing this was a proposal called TCP Vegas.

 

00:03:37.366 And the insight from TCP Vegas is that

00:03:42.800 you can watch the rate of growth,

00:03:45.800 or increase, of the queue, and use

00:03:48.633 that to infer whether you're sending faster,

00:03:50.700 or slower, than the network can support.

 

00:03:54.233 The insight was, if you're sending,

00:03:56.166 if a TCP is sending, faster than

00:03:58.366 the maximum capacity a network can deliver

00:04:00.933 at, the queue will gradually fill up.

00:04:03.500 And as the queue gradually fills up,

00:04:05.533 the latency, the round trip time, will gradually increase.

 

00:04:10.066 TCP Cubic, and TCP Reno, wait until

00:04:13.933 the queue overflows, wait until there's no

00:04:16.133 more space to put new packets in,

00:04:18.066 and a packet is lost, and at

00:04:19.800 that point they slow down.

 

00:04:22.666 The insight for TCP Vegas was to

00:04:25.300 watch as the delay increases, and as

00:04:28.500 it sees the delay increasing, it slows

00:04:31.300 down before the queue overflows.

 

00:04:34.533 So it uses the gradual increase in

00:04:36.366 the round trip time, as an indication

00:04:38.500 that it should send slower.

 

00:04:40.800 And as the round-trip time reduces,

00:04:43.033 as the round-trip time starts to drop,

00:04:45.066 it treats that as an indication that

00:04:46.933 the queue is draining, which means it can send faster.

 

00:04:50.766 It wants a constant round trip time.

00:04:53.366 And, if the round trip time increases,

00:04:55.300 it reduces its rate; and if the

00:04:57.933 round-trip time decreases, it increases its rate.

00:05:00.200 So, it's trying to balance it’s rate

00:05:03.033 with the round trip time, and not

00:05:04.866 build or shrink the queues.

 

00:05:08.333 And because you can detect the queue

00:05:10.966 building up before it overflows, you can

00:05:14.233 take action before the queue is completely

00:05:16.133 full. And that means the queue is

00:05:18.466 running with lower occupancy, so you have

00:05:21.000 lower latency across the network.

 

00:05:23.666 It also means that because packets are

00:05:25.533 not being lost, you don't need to

00:05:27.866 re-transmit as many packets. So it improves

00:05:30.700 the throughput that way, because you're not

00:05:32.600 resending data that you've already sent and has gotten lost.

 

00:05:36.633 And that's the fundamental idea of TCP

00:05:38.966 Vegas. It doesn't change the slow start behaviour at all.

 

00:05:42.566 But, once you're into congestion avoidance,

00:05:44.900 it looks at the variation in round

00:05:47.100 trip time rather than looking at packet

00:05:49.200 loss, and uses that to drive the

00:05:51.366 variation in the speed at which it’s sending.

 

00:05:56.566 The details of how it works.

 

00:05:59.466 Well, first, it tries to estimate what

00:06:01.766 it calls the base round trip time.

 

00:06:04.766 So every time it sends a packet,

00:06:07.033 it measures how long it takes to

00:06:08.733 get a response. And it tries to

00:06:10.800 find the smallest possible response time.

 

00:06:14.166 The idea being that the smallest time

00:06:17.366 it gets a response, would be the

00:06:18.833 time when the queue is that it's emptiest.

 

00:06:21.766 It may not get the actual,

00:06:23.466 completely empty, queue, but the smaller the

00:06:26.066 response time, it's trying to estimate the

00:06:29.866 time it takes when there's nothing else in the network.

 

00:06:34.066 And anything on top of that indicates

00:06:36.233 that there is data queued up somewhere in the network.

 

00:06:41.133 Then it calculates an expected sending rate.

00:06:45.266 It takes the window size, which indicates

00:06:48.033 how many packets it's supposed to send

00:06:50.533 in that round-trip time,

00:06:52.533 how many bytes of data it’s supposed

00:06:54.366 to send in that round-trip time,

00:06:56.066 and it divides it by the base

00:06:57.433 round trip time. So if you divide

00:07:00.633 number of bytes by time, you get

00:07:03.166 a bytes per second, and that gives

00:07:05.566 you the rate at which it should be sending data.

 

00:07:09.333 And if the network can

00:07:12.033 support sending at that rate, it should

00:07:14.366 be able to deliver that window of

00:07:17.800 packets within a complete round trip time.

 

00:07:20.866 And, if it can’t, it will take

00:07:22.566 longer than a round trip time to

00:07:24.300 deliver that window of packets, and the

00:07:25.866 queues will be gradually building up Alternatively,

00:07:28.866 if it takes less than a round

00:07:30.333 trip time, this is an indication that

00:07:31.900 the queues are decreasing.

 

00:07:35.500 And it measures the actual rate at

00:07:37.100 which it sends the packets.

 

00:07:39.466 And it compares them.

 

00:07:41.600 And if the actual rate at which

00:07:43.166 it's sending packets is less than the

00:07:45.466 expected rate, if it's taking longer than

00:07:47.733 a round-trip time to deliver the complete

00:07:49.633 window worth of packets, this is a

00:07:51.700 sign that the packets can’t all be delivered.

 

00:07:56.866 And it, you know, it's trying to send too

00:07:59.966 much. It’s trying to send at too

00:08:01.666 fast a rate, and it should reduce

00:08:03.166 its rate and let the queues drop.

 

00:08:05.900 Equally, in the other case it should

00:08:08.333 increase its rate, and measuring the difference

00:08:10.966 between the actual and the expected rates,

00:08:13.800 it can measure whether the queues growing or shrinking.

 

00:08:18.733 And TCP Vegas compares the expected rate,

00:08:21.966 which actually manages to send at,

00:08:24.566 the expected rate at which it gets

00:08:26.566 the acknowledgments back, with the actual rate.

 

00:08:30.600 And it adjusts the window.

 

00:08:34.333 And if the expected rate, minus the

00:08:37.700 actual rate, is less than some threshold,

00:08:40.700 that indicates that it should increase its

00:08:43.666 window. And if the expected rate,

00:08:45.933 minus the actual rate, is greater than

00:08:48.000 some other threshold, then it should decrease the window.

 

00:08:51.266 That is, if data is arriving at

00:08:53.633 the expected rate, or very close to

00:08:56.200 it, this is probably a sign that

00:08:58.366 the network can support a higher rate,

00:09:00.533 and you should try sending a little bit faster.

 

00:09:03.566 Alternatively, if data is arriving slower

00:09:06.133 than it's being sent,

00:09:07.133 this is a sign that you're sending too fast and you

00:09:09.233 should slow down.

 

00:09:10.833 And the two thresholds, R1 and R2,

00:09:12.933 determine how close you have to be

00:09:15.033 to the expected rate, and how far

00:09:16.866 away from it you have to be in order to slow down.

 

00:09:20.733 And the result is that TCP Vegas

00:09:24.700 follows a much smoother transmission rate.

 

00:09:28.300 Unlike TCP Reno, which follows the characteristic

00:09:31.700 sawtooth pattern, or TCP cubic which follows the

00:09:35.866 cubic equation to change it’s rate,

00:09:39.533 both of which adapt quite abruptly whenever

00:09:43.233 there's a packet loss,

00:09:44.933 TCP Vegas makes a gradual change.

00:09:47.266 It gradually increases, or decreases, it’s sending

00:09:50.466 rate in line with the variations in

00:09:52.833 the queues. So, it’s a much smoother

00:09:54.900 algorithm, which doesn't continually build up and

00:09:58.266 empty the queues.

 

00:10:01.166 Because the queues are not continuing building

00:10:03.966 up, not continually being filled, this keeps

00:10:08.366 the latency down

00:10:09.400 while still achieving recently good performance.

 

00:10:15.833 TCP Vegas is a good idea in principle.

 

00:10:21.633 This idea is known as delay-based congestion

00:10:24.600 control, and I think it's actually a

00:10:26.500 really good idea in principle. It reduces

00:10:29.666 the latency, because it doesn't fill the queues.

 

00:10:33.100 It reduces the packet loss, because it's

00:10:35.300 not causing, t's not pushing the queues

00:10:38.133 to overflow and causing packets to be

00:10:39.833 lost. So the only packet losses you

00:10:42.233 get are those caused by transmission problems.

 

00:10:45.433 And this reduces unnecessary, reduces you having

00:10:48.766 to transmit packets, because you forced the

00:10:50.633 network into overload, and forced it to

00:10:52.633 lose the packets, and it reduces the latency.

 

00:10:57.200 The problem with TCP Vegas is that

00:11:00.600 it doesn't work, doesn’t interwork work with,

00:11:03.900 TCP Reno or TCP cubic.

 

00:11:07.833 If you have any TCP Reno or

00:11:10.200 Cubic flows on the network, they will

00:11:12.300 aggressively increase their sending rate and try

00:11:15.300 to fill the queues, and the push

00:11:17.300 the queues into overload.

 

00:11:19.966 And this will increase the round-trip time,

00:11:22.966 reduce the rate at which Vegas can

00:11:26.300 send, and it will force TCP Vegas to slow down.

 

00:11:30.033 Because TCP Vegas sees the queues increasing,

00:11:33.033 because Cubic and Reno are intentionally trying

00:11:36.266 to fill those queues, and if the

00:11:38.333 queues increase, this causes Vegas to slow down.

 

00:11:41.200 That gradually means there's more space in

00:11:44.200 the queues, which Cubic and Reno will

00:11:46.633 gradually fill-up, which causes Vegas to slow

00:11:49.200 down, and they end up in a

00:11:50.900 spiral, where the TCP Vegas flows get

00:11:52.800 pushed down to zero, and the Reno

00:11:55.700 or Cubic flows use all of the capacity.

 

00:11:59.333 So if we only have TCP Vegas

00:12:01.400 in the network, I think it would

00:12:03.466 behave really nicely, and we get really

00:12:05.500 good, low latency, behaviour from the network.

 

00:12:08.900 Unfortunately we're in a world where Reno,

00:12:11.933 and Cubic, have been deployed everywhere.

 

00:12:14.733 And without a step change, without an

00:12:18.933 overnight switch where we turn of Cubic,

00:12:21.966 and we turn off Reno, and we

00:12:23.366 turn on Vegas, everywhere we can't deploy

00:12:25.900 TCP Vegas because always loses out to

00:12:28.866 Reno and Cubic.

 

00:12:31.166 So, it's a good idea in principle,

00:12:33.233 but in practice it can't be used

00:12:35.033 because of the deployment challenge.

 

00:12:40.600 As I say, it's a good idea

00:12:42.733 in principle, and the idea of using

00:12:45.433 delay as a congestion signal is a

00:12:47.766 good idea in principle, because we can

00:12:50.066 get something which achieves lower latency.

 

00:12:54.866 Is it possible to deploy a different

00:12:57.733 algorithm? Maybe the problem is not principal,

00:13:00.266 maybe the problem is the algorithm in TCP Vegas?

 

00:13:05.466 Well, people are trying alternatives which are delay based.

 

00:13:10.233 And the most recent attempt at this

00:13:12.966 is an algorithm called TCP BBR,

00:13:15.200 Bottleneck Bandwidth and Round-trip time.

 

00:13:18.466 And again, this is a proposal that

00:13:20.533 came out of Google. And one of

00:13:23.133 the co-authors, if you look at the

00:13:25.533 paper on the right, is Van Jacobson,

00:13:28.033 who was the original designer of TCP

00:13:30.300 congestion control. So there's clearly some smart

00:13:32.833 people behind this.

 

00:13:34.600 The idea is that it tries to explicitly

00:13:36.966 measure the round-trip time as it sends

00:13:39.500 the packets. It tries to explicitly measure

00:13:42.133 the sending rate in much the same way same way that

00:13:45.666 TCP Vegas does. And, based on those

00:13:48.233 measurements, and some probes where it varies

00:13:51.533 its rate to try and find if

00:13:53.400 it's got more capacity, or try and

00:13:55.400 sense if there is other traffic on the network.

 

00:13:58.533 It tries to directly set a congestion

00:14:01.066 window that matches the network capacity,

00:14:04.066 based on those measurements.

 

00:14:06.533 And, because this came out of Google,

00:14:08.600 it got a lot of press,

00:14:10.666 and Google turned it on for a

00:14:13.533 lot of their traffic. I know they

00:14:15.433 were running it for YouTube for a

00:14:16.866 while, and a lot of people saw

00:14:18.966 this, and jumped on the bandwagon.

00:14:21.333 And, for a while, it was starting

00:14:23.100 to get a reasonable amount of deployments.

 

00:14:27.100 The problem is, it turns out not to work very well.

 

00:14:31.066 And Justine Sherry at Carnegie Mellon University,

00:14:36.733 and her PhD student Ranysha Ware,

00:14:39.500 did a really nice bit of work

00:14:41.533 that showed that is incredibly unfair to

00:14:44.400 regular TCP traffic.

 

00:14:46.766 And, it's unfair in kind-of the opposite

00:14:49.633 way to Vegas. Whereas TCP Reno and

00:14:53.600 TCP Cubic would force TCP Vegas flows

00:14:56.400 down to nothing, TCP BBR is unfair

00:14:59.766 in the opposite way, and it demolishes

00:15:02.600 Reno and Cubic flows, and causes tremendous

00:15:05.266 amounts of packet loss for those flows.

 

00:15:08.266 So it's really much more aggressive than

00:15:11.133 the other flows in certain cases,

00:15:13.233 and this leads to really quite severe unfairness problems.

 

00:15:17.533 And the Vimeo link on the slide is a link to the talk at

00:15:24.133 the Internet Measurement Conference, where Ranysha talks

00:15:28.233 through that, and demonstrates really clearly that

00:15:30.966 TCP BBR version 2 is really quite problematic, and

00:15:36.033 not very safe to deploy on the current network.

 

00:15:41.066 And there's a there's a variant called

00:15:43.100 BBR v2, which is under development,

00:15:46.266 and seems to be changing,

00:15:48.566 certainly on a monthly basis, which is

00:15:51.433 trying to solve these problems. And this

00:15:53.866 is very much an active research area,

00:15:55.833 where people are looking to find better alternatives.

 

00:16:01.966 So that's the principle of delay-based congestion control.

 

00:16:05.400 Traditional TCP, the Reno algorithm and the

00:16:09.100 Cubic algorithms, intentionally try to fill the

00:16:12.166 queues, they intentionally try to cause latency.

 

00:16:16.633 TCP Vegas is one well-known algorithm which

00:16:20.833 tries to solve this, and

00:16:24.200 doesn't work in practice, but in principle

00:16:27.766 is a good idea, it just has

00:16:30.033 some deployment challenges, given the installed base

00:16:32.800 of Reno and Cubic.

 

00:16:35.366 And there are new algorithms, like TCP

00:16:38.200 BBR, which don't currently work well,

00:16:41.466 but have potential to solve this problem.

 

00:16:44.466 And, hopefully, in the future, a future

00:16:47.166 variant of BBR will work effectively,

00:16:51.800 and we'll be able to transition to

00:16:53.633 a lower latency version of TCP.

Part 5: Explicit Congestion Notification

The use of delay-based congestion control is one way of reducing network latency. Another is to keep Reno and Cubic-style congestion control, but to move away from using packet loss as an implicit congestion signal, and instead provide an explicit congestion notification from the network to the applications. This part of the lecture introduces the ECN extension to TCP/IP that provides such a feature, and discusses its operation and deployment.

Slides for part 5

 

00:00:00.433 In the previous parts of the lecture,

00:00:02.166 I’ve discussed TCP congestion control. I’ve discussed

00:00:05.566 how TCP tries to measure what the

00:00:07.700 network's doing and, based on those measurements,

00:00:10.266 adapt it’s sending rate to match the

00:00:12.433 available network capacity.

 

00:00:14.466 In this part, I want to talk

00:00:15.866 about an alternative technique, known as Explicit

00:00:18.300 Congestion Notification, which allows the network to

00:00:20.733 directly tell TCP when it's sending too

00:00:22.966 fast, and needs to reduce it’s transmission rate.

 

00:00:28.500 So, as we've discussed, TCP infers the

00:00:31.833 presence of congestion in the network through measurement.

 

00:00:36.066 If you're using TCP Reno or TCP

00:00:39.066 Cubic, like most TCP flows in the

00:00:42.466 network today, then the way it infers

00:00:45.500 that is because there's packet loss.

 

00:00:48.033 TCP Reno and TCP Cubic keep gradually

00:00:51.400 increasing their sending rates, trying to cause

00:00:54.333 the queues to overflow.

 

00:00:56.200 And they cause a queue overflow,

00:00:58.366 cause a packet to be lost,

00:00:59.800 and use that packet loss as the

00:01:01.366 signal that the network is busy,

00:01:04.200 that they've reached the network capacity,

00:01:05.966 and they should reduce the sending rate.

 

00:01:09.066 And this is problematic for two reasons.

 

00:01:11.866 First, is because it increases delay.

 

00:01:15.266 It's continually pushing the queues to be

00:01:18.266 full, which means the network’s operating with

00:01:20.833 full queues, with its maximum possible delay.

 

00:01:24.400 And the second is because it makes

00:01:27.066 it difficult to distinguish loss which is

00:01:29.533 caused because the queues overflowed, from loss

00:01:32.766 caused because of a transmission error on

00:01:35.900 a link, so called non-congestive loss,

00:01:38.533 which you might get due to interference or a wireless link.

 

00:01:43.766 The other approach people have discussed,

00:01:45.666 is the approach in TCP Vegas,

00:01:48.233 where look at variation in queuing latency

00:01:51.500 and use that as an indication of loss.

 

00:01:54.400 So, rather than pushing the queue until

00:01:56.333 it overflows, and detecting the overflow,

00:01:58.866 you watch to see as the queue

00:02:00.733 starts to get bigger, and use that

00:02:02.633 as an indication that you should reduce

00:02:04.233 your sending rate. Or, equally, you spot

00:02:07.300 the queue getting smaller, and use that

00:02:08.900 as an indication that you should maybe

00:02:10.466 increase your sending rate.

 

00:02:12.700 And this is conceptually a good idea,

00:02:14.566 as we discussed in the last part,

00:02:16.733 because it lets you run TCP with

00:02:18.866 lower latency. But it's difficult to deploy,

00:02:21.833 because it interacts poorly with TCP Cubic

00:02:25.333 and TCP Reno, both of which try

00:02:27.833 to fill the queues.

 

00:02:31.966 As a result, we're stuck with using

00:02:34.333 Reno and Cubic, and we're stuck with

00:02:36.333 full queues in the network. But we'd

00:02:38.900 like to avoid this, we'd like to

00:02:40.466 go for a lower latency way of

00:02:42.666 using TCP, and make the network work

00:02:45.533 without filling the queues.

 

00:02:49.300 So one way you might go about

00:02:50.766 doing this is, rather than have TCP

00:02:54.200 push the queues to overflow,

00:02:56.966 have the network rather tell TCP when

00:02:59.866 it's sending too fast.

 

00:03:02.433 Have something in the network tell the

00:03:04.933 TCP connections that they are congesting the

00:03:07.666 network, and they need to slow down.

 

00:03:11.233 And this thing is called Explicit Congestion Notification.

 

00:03:17.333 Explicit Congestion Notification, the ECN bits,

00:03:21.733 are present in the IP header.

 

00:03:25.266 The slide shows an IPv4 header with

00:03:27.833 the ECN bits indicated in red.

00:03:30.333 The same bits are also present in

00:03:32.500 IPv6, and they're located in the same

00:03:34.766 place in the packet in the IPv6 header.

 

00:03:38.066 The way these are used.

00:03:40.233 If the sender doesn't support ECN,

00:03:42.866 it sets these bits to zero when

00:03:44.700 it transmits the packet. And they stay

00:03:46.866 at zero, nothing touches them at that point.

 

00:03:50.233 However, if the sender does support ECN,

00:03:52.933 and it sets these bits to have

00:03:54.700 the value 01, so it sets bit

00:03:57.400 15 of the header to be 1,

00:04:00.433 and it transmits the IP packets as

00:04:02.933 normal, except with this one bit set

00:04:05.066 to indicate that the sender understands ECN.

 

00:04:10.000 If congestion occurs in the network,

00:04:12.966 if some queue in the network is

00:04:16.333 beginning to get full, it’s not yet

00:04:19.266 at the point of overflow but it's

00:04:20.733 beginning to get full, such that some

00:04:22.800 router in the network thinks it's about

00:04:24.833 to start experiencing congestion,

00:04:27.200 then that router, that router in the

00:04:30.100 network, changes those bits in the IP

00:04:32.433 packets, of some of the packets going

00:04:34.233 past, and sets both of the ECN bits to one.

 

00:04:38.266 This is known as an ECN Congestion Experienced mark.

 

00:04:42.333 It's a signal. It's a signal from

00:04:44.966 the network to the endpoints, that the

00:04:47.500 network thinks it's getting busy, and the

00:04:49.266 endpoint should slow down.

 

00:04:53.266 And that's all it does. It monitors

00:04:55.466 the occupancy in the queues, and if

00:04:57.766 the queue occupancy is higher than some

00:04:59.466 threshold, it sets the ECN bits in

00:05:01.666 the packets going past, to indicate that

00:05:04.766 threshold has been reached and the network

00:05:06.766 is starting to get busy.

 

00:05:09.233 If the queue overflows,

00:05:11.133 if the endpoints keep sending faster and

00:05:13.866 the queue overflows, then it drops the

00:05:15.466 packet so as normal. The only difference

00:05:17.433 is that there's some intermediate point where

00:05:19.766 the network is starting to get busy,

00:05:21.500 but the queue has not yet overflowed.

 

00:05:23.966 And at that point, the network marks

00:05:25.666 the packets indicate that it's getting busy.

 

00:05:32.100 A receiver might get a TCP packet,

00:05:35.133 a TCP segment, delivered within an IP

00:05:37.866 packet, where that IP packet has the

00:05:40.700 ECN Congestion Experienced mark set. Where the

00:05:43.666 network has changed those two bits in

00:05:45.766 the IP header to 11, to indicate

00:05:48.800 that it's experiencing congestion.

 

00:05:52.366 What it does that point at that

00:05:54.666 point, is it sets a bit in

00:05:58.100 the TCP header of the acknowledgement packet

00:06:01.600 it sends back to the sender.

 

00:06:04.266 That bit’s known as the ECN Echo

00:06:06.866 field, the ECE field. It sets this

00:06:09.933 bit in the TCP header equal to

00:06:12.633 one on the next packet it sends

00:06:15.600 back to the sender, after it received

00:06:18.033 the IP packet, containing the TCP segment,

00:06:21.400 where that IP packet was marked Congestion Experienced.

 

00:06:26.133 So the receiver doesn't really do anything

00:06:28.833 with the Congestion Experienced mark, other than

00:06:31.233 mark, set the equivalent mark in the

00:06:33.533 packet it sends back to the sender.

 

00:06:35.866 So it's telling the sender, “I got

00:06:37.733 a Congestion Experienced mark in one of

00:06:39.900 the packets you sent”.

 

00:06:43.600 When that packet gets to the sender,

00:06:46.600 the sender sees this bit in the

00:06:48.866 TCP header, the ECN Echo bit set

00:06:52.133 to one, and it realises that the

00:06:54.200 data it was sending

00:06:56.433 caused a router on the path to

00:07:00.000 set the ECN Congestion Experienced mark,

00:07:03.000 which the receiver has then fed back to it.

 

00:07:07.333 And what it does at that point,

00:07:09.100 is it reduces its congestion window.

 

00:07:11.800 It acts as-if a packet had been

00:07:15.000 lost, in terms of how it changes its congestion window.

 

00:07:19.066 So if it's a TCP Reno sender,

00:07:21.733 it will halve its congestion window,

00:07:24.200 the same way it would if a packet was lost.

 

00:07:27.000 If it's a TCP Cubic sender,

00:07:29.200 it will back off its congestion window

00:07:31.533 to 70%, and then enter the weird

00:07:35.533 cubic equation for changing its congestion window.

 

00:07:41.033 After it does that, it sets another

00:07:43.900 bit in the header of the next

00:07:47.366 TCP segment it sends out. It sets

00:07:49.900 the CWR bit, the Congestion Window Reduced

00:07:52.533 bit, in the header to tell the

00:07:54.533 network and the receiver that it's done it.

 

00:07:59.200 So the end result of this,

00:08:00.933 is that rather than a packet being lost

00:08:03.900 because the queue overflowed, and then the

00:08:06.500 acknowledgments coming back indicating, via the triple

00:08:09.466 duplicate ACK, that's a packet had been

00:08:11.166 lost, and then TCP reducing its congestion

00:08:14.266 window and re-transmitting that lost packet.

 

00:08:17.866 What happens is,

00:08:20.633 the IP packets, TCP packets, in the

00:08:24.366 outbound direction gets a Congestion Experienced mark

00:08:27.400 set, to indicate that the network is

00:08:29.566 starting to get full.

 

00:08:31.633 The ECN Echo bit is set on

00:08:33.500 the reply, and at that point the

00:08:35.666 sender reduces its window,

00:08:37.733 as-if the loss had occurred.

 

00:08:42.700 And then carries on sending with the

00:08:44.633 CWR bit set to one on that

00:08:46.533 next packet. So it has the same

00:08:49.000 effect, in terms of reducing the congestion window, as would

00:08:52.600 dropping a packet, but without dropping a

00:08:54.766 packet. So there's no actual packet loss

00:08:56.933 here, there’s just a mark to indicate

00:08:58.833 that the network was getting busy.

00:09:00.500 So it doesn't have to retransmit data,

00:09:02.666 and this happens before the queue is

00:09:04.500 full, so you get lower latency.

 

00:09:08.300 So ECN is a mechanism to allow

00:09:11.766 TCP to react to congestion before packet loss occurs.

 

00:09:16.600 It allows routers in the network to

00:09:18.700 signal congestion before the queue overflows.

 

00:09:21.866 It allows routers in the network to

00:09:23.500 say to TCP, “if you don't slow

00:09:25.566 down, this queue is going to overflow,

00:09:27.900 and I’m going to throw your packets away”.

 

00:09:31.533 it's independent of how TCP then responds,

00:09:34.366 whether it follows Reno or Cubic or

00:09:37.466 Vegas that doesn't really matter, it's just

00:09:39.600 an indication that it needs to slow

00:09:41.266 down because the queues are starting to

00:09:43.166 build up, and will overflow soon if it doesn't.

 

00:09:47.466 And if TCP reacts to that,

00:09:49.400 reacts to the ECN Echo bit going

00:09:51.566 back, and the sender reduces its rate,

00:09:53.966 the queues will empty, the router will

00:09:55.700 stop marking the packets, and everything will

00:09:57.900 settle down at a slightly slower rates

00:10:00.300 without causing any packet loss.

 

00:10:02.733 And the system will adapt, and it

00:10:05.600 will it will achieve the same sort

00:10:07.800 of throughput, it will just react earlier,

00:10:11.100 so you have smaller queues and lower latency.

 

00:10:14.500 And this gives you the same throughput

00:10:16.966 as you would with TCP Reno or

00:10:20.400 TCP Cubic, but with low latency,

00:10:22.333 which means it's better for competing video

00:10:25.100 conferencing or gaming traffic.

 

00:10:28.433 And I’ve described the mechanism for TCP,

00:10:31.066 but there are similar ECN extensions for

00:10:33.833 QUIC and for RTP, which is the

00:10:36.566 video conferencing protocol, all designed to achieve

00:10:39.933 the same goal.

 

00:10:44.400 So ECN, I think, is unambiguously a

00:10:47.100 good thing. It’s a signal from the

00:10:48.866 network to the endpoints that the network

00:10:50.966 is starting to get congested, and the

00:10:52.866 endpoints should slow down.

 

00:10:54.500 And if the endpoints believe it,

00:10:56.666 if they back off,

00:10:58.500 they reduce their sending rate before the

00:11:00.900 network is overloaded, and we end up

00:11:03.966 in a world where h we still

00:11:06.966 achieve good congestion control, good throughput,

00:11:11.133 but with lower latency.

 

00:11:13.100 And, if the endpoints don't believe it

00:11:15.200 well, eventually, the routers, the queues,

00:11:17.233 overflow and they lose packets, and we’re

00:11:19.100 no worse-off than we are now.

 

00:11:22.133 In order to deploy ECN, though,

00:11:25.600 we need to make changes. We need

00:11:27.900 to change the endpoints, to change the

00:11:29.700 end systems, to support these bits in

00:11:31.766 the IP header, and to support,

00:11:33.766 to add support for this into TCP.

 

00:11:36.500 And we need to update the routers,

00:11:38.666 to actually mark the packets when they're

00:11:40.333 starting to get overloaded.

 

00:11:44.333 Updating the end points has pretty much

00:11:47.066 been done by now.

 

00:11:49.200 I think every TCP implementation,

00:11:54.100 implemented in the last 15-20 years or

00:11:57.200 so, supports ECN, and these days,

00:12:00.000 most of them have it turned on by default.

 

00:12:04.266 And I think we actually have Apple

00:12:06.866 to thank for this.

00:12:09.033 ECN, for a long time, was implemented

00:12:12.900 but turned off by default, because there’d

00:12:15.233 been problems with some old firewalls which

00:12:17.900 reacted badly to it, 20 or so years ago.

 

00:12:22.233 And, relatively recently, Apple decided that they

00:12:25.666 wanted these lower latency benefits, and they

00:12:29.833 thought ECN should be deployed. So they

00:12:32.566 started turning it on by default in the iPhone.

 

00:12:37.100 And they kind-of followed an interesting approach.

00:12:40.100 In that for iOS nine, a random

00:12:43.133 subset of 5% of iPhones would turn

00:12:46.233 on ECN for some of their connections.

 

00:12:51.433 And they measured what happened. And they

00:12:54.233 found out that in the overwhelming majority

00:12:56.433 of cases this worked fine, and occasionally

00:12:59.133 it would fail.

 

00:13:01.400 And they would call up the network

00:13:03.966 operators, who's networks were showing problems,

00:13:07.433 and they would say “your network doesn't

00:13:10.333 work with iPhones; and currently it's not

00:13:12.800 working well with 5% of iPhones but

00:13:15.233 we're going to increase that number,

00:13:16.933 and maybe you should fix it”.

 

00:13:19.600 And then, a year later, when iOS

00:13:21.633 10 came out, they did this 50%

00:13:24.066 of connections made by iPhones. And then

00:13:26.933 a year later, for all of the connections.

 

00:13:30.000 And it's amazing what impact a

00:13:34.200 popular vendor calling up a network operator connect can

00:13:41.433 have on getting them to fix the equipment.

 

00:13:45.066 And, as a result,

00:13:47.200 ECN is now widely enabled by default

00:13:50.500 in the phones, and the network seems

00:13:53.333 to support it just fine.

 

00:13:56.300 Most of the routers also support ECN.

00:13:58.833 Although currently relatively few of them seem

00:14:01.400 to enable it by default. So most

00:14:04.066 of the endpoints are now

00:14:05.633 at the stage of sending ECN enabled

00:14:08.166 traffic, and are able to react to

00:14:10.900 the ECN marks, but most of the

00:14:13.400 networks are not currently setting the ECN marks.

 

00:14:16.933 This is, I think, starting to change.

00:14:19.533 Some of the recent DOCSIS, which is

00:14:22.266 the cable modem standards, are starting to

00:14:26.400 support you ECN. We’re starting to see

00:14:29.500 cable modems, cable Internet connections, which enable

00:14:33.566 ECN by default.

 

00:14:35.866 And, we're starting to see interest from

00:14:38.900 3GPP, which is the mobile phone standards

00:14:41.100 body to enable this in 5G,

00:14:43.933 6G, networks, so I think it's coming.

00:14:47.100 but it's going to take time.

 

00:14:49.066 And, I think, as it comes,

00:14:51.233 as ECN gradually gets deployed, we’ll gradually

00:14:53.766 see a reduction in latency across the

00:14:56.000 networks. It’s not going to be dramatic.

 

00:14:59.400 It's not going to suddenly transform the

00:15:01.300 way the network behaves, but hopefully over

00:15:04.033 the next 5 or 10 years we’ll

00:15:06.166 gradually see the latency reducing as ECN

00:15:09.433 gets more widely deployed.

 

00:15:13.900 So that's what I want to say

00:15:15.800 about ECN. It’s a mechanism by which

00:15:17.966 the network can signal to the applications

00:15:20.133 that the network is starting to get

00:15:22.033 overloaded, and allow the applications to back

00:15:24.433 off more quickly, in a way which

00:15:26.966 reduces latency and reduces packet loss.

Part 6: Light Speed?

The final part of the lecture moves on from congestion control and queueing, and discusses another factor that affects latency: the network propagation delay. It outlines what is the propagation delay and ways in which it can be reduced, including more direct paths and the use of low-Earth orbit satellite constellations.

Slides for part 6

 

00:00:00.433 In this final part of the lecture,

00:00:02.100 I want to move on from talking

00:00:03.600 about congestion control, and the impact of

00:00:05.733 queuing delays on latency, and talk instead

00:00:08.233 about the impact of propagation delays.

 

00:00:12.300 So, if you think about the latency

00:00:15.166 for traffic being delivered across the network,

00:00:17.433 there are two factors which impact that latency.

 

00:00:21.433 The first is the time packets spent

00:00:23.933 queued up at various routers within the network.

 

00:00:28.033 As we've seen in the previous parts

00:00:29.733 of this lecture, this is highly influenced

00:00:32.033 by the choice of TCP congestion control,

00:00:35.100 and whether Explicit Congestion Notification

00:00:37.566 is enabled or not.

 

00:00:39.533 The other factor, that we've not really

00:00:41.900 discussed to date, is the time it

00:00:44.066 takes the packets to actually propagate down

00:00:46.333 the links between the routers. This depends

00:00:48.700 on the speed at which the signal

00:00:50.500 propagates down the transmission medium.

 

00:00:53.233 If you're using an optical fibre to

00:00:55.233 transmit the packets, it depends on the

00:00:57.333 speed at which the light propagates through the fibre.

 

00:01:00.700 If you're using electrical signals in a

00:01:03.133 cable, it depends on the speed at

00:01:04.933 which electrical field propagates down the cable.

 

00:01:07.600 And if you're using radio signals,

00:01:09.366 it depends on the speed of light,

00:01:11.100 the speed at which the radio signals

00:01:12.666 propagate through the air.

 

00:01:17.000 As you might expect, physically shorter links

00:01:21.100 have lower propagation delays.

 

00:01:23.533 A lot of the time it takes

00:01:25.600 a packet to get down a long

00:01:27.233 distance link is just the time it

00:01:29.400 takes the signal to physically transmit along

00:01:32.133 the link. If you make the link

00:01:33.633 shorter it takes less time.

 

00:01:37.300 And what is perhaps not so obvious,

00:01:40.500 though, is that you can actually get

00:01:43.000 significantly significant latency benefits in certain paths,

00:01:48.166 because the existing network links follow quite

00:01:51.533 indirect routes.

 

00:01:53.766 For example, if you look at the

00:01:55.566 path the network links take, if you're

00:01:58.066 sending data from Europe to Japan.

 

00:02:01.066 Quite often, that data goes from Europe,

00:02:03.900 across the Atlantic to, for example,

00:02:06.533 New York or Boston, or somewhere like

00:02:08.900 that, across the US to

00:02:12.866 San Francisco, or Los Angeles, or Seattle,

00:02:17.000 or somewhere along those lines, and then

00:02:19.600 from there, in a cable across the

00:02:21.966 Pacific to Japan.

 

00:02:25.133 Or alternatively, it goes from Europe through

00:02:27.733 the Mediterranean, the Suez Canal and the

00:02:30.433 Middle East, and across India, and so

00:02:32.800 on, until it eventually reaches Japan the

00:02:35.600 other way around. But neither of these

00:02:38.100 is a particularly direct route.

 

00:02:40.666 And it turns out that there is

00:02:42.933 a much more direct, a much faster

00:02:44.900 route, to get from Europe to Japan,

00:02:48.033 which is to lay a an optical fibre

00:02:51.233 through the Northwest Passage, across Northern Canada,

00:02:55.233 through the Arctic Ocean, and down through

00:02:57.733 the Bering Strait, and past Russia to

00:02:59.866 get directly to Japan. It's much closer

00:03:03.200 to the great circle route around the

00:03:04.966 globe, and it's much shorter than the

00:03:07.066 route that the networks currently take.

 

00:03:10.000 And, historically, this hasn't been possible because

00:03:12.566 of the ice in the Arctic.

00:03:14.666 But, with global warming, the Northwest Passage

00:03:17.800 is now ice-free for enough of the

00:03:20.766 year that people are starting to talk

00:03:23.100 about laying optical fibres along that route,

00:03:26.266 because they can get a noticeable latency

00:03:28.733 reduction, for certain amounts of traffic,

00:03:31.400 by just following the physically shorter route.

 

00:03:38.400 Another factor which influences the propagation delay

00:03:42.600 is the speed of light in the transmission media.

 

00:03:47.433 Now, if you're sending data using radio links,

00:03:52.000 or using lasers in a vacuum,

00:03:57.033 then these propagate at the speed of light in the vacuum.

 

00:04:01.100 Which is about 300 million meters per second.

 

00:04:05.700 The speed of light in optical fibre,

00:04:07.733 though, is slower. The speed at which

00:04:09.900 light propagates down that down a fibre,

00:04:12.633 the speed at which light propagates through

00:04:14.566 glass, is only about 200,000.

00:04:17.000 kilometres per second, 200 million meters per

00:04:19.466 second. So it’s about two thirds of

00:04:21.800 the speed at which it propagates in a vacuum.

 

00:04:25.633 And this is the reason for systems

00:04:28.100 such as StarLink, which SpaceX is deploying.

 

00:04:32.900 And the idea of these systems is

00:04:34.700 that, rather than sending the Internet signals

00:04:38.000 down an optical fibre,

00:04:40.133 you send them 100, or a couple

00:04:42.300 of hundred miles, up to a satellite,

00:04:44.733 and they then go around between various

00:04:47.466 satellites in the constellation, in low earth

00:04:50.033 orbit, and then down to a receiver

00:04:53.700 near the destination.

 

00:04:55.833 And by propagating through vacuum, rather than

00:04:58.833 through optical fibre, the speed of light

00:05:02.800 in vacuum is significantly faster, it's about

00:05:05.300 50% faster than the speed of light

00:05:07.966 in fibre, and this can reduce the latency.

 

00:05:11.166 And the estimates show that if you

00:05:14.533 have a large enough constellation of satellites,

00:05:17.300 and SpaceX is planning on deploying around

00:05:19.666 4000 satellites, I believe, and with careful

00:05:23.133 routing, you can get about a 40,

00:05:25.800 45, 50% reduction in latency.

 

00:05:28.566 Just because the signals are transmitting via

00:05:31.866 radio waves, and via inter-satellite laser links,

00:05:35.733 which are in a vacuum, rather than

00:05:39.700 being transmitted through a fibre optic cable.

00:05:42.166 Just because of the differences in the

00:05:44.100 speed of light between the two mediums.

 

00:05:47.100 And the link on the slide points

00:05:49.733 to some simulations of the StarLink network,

00:05:52.333 which try and demonstrate how this would

00:05:54.966 work, and how it can achieve

00:05:57.366 both network paths that closely follow the

00:06:01.266 great circle routes, and

00:06:03.366 how it can reduce the latency because

00:06:07.566 of the use of satellites.

 

00:06:13.433 So, what we see is that people

00:06:15.133 are clearly going to some quite extreme

00:06:17.100 lengths to reduce latency.

 

00:06:19.500 I mean, what we spoke about in

00:06:21.933 the previous part was the use of

00:06:24.366 ECN marking to reduce latency by reducing

00:06:26.766 the amount of queuing. And that's just

00:06:29.200 a configuration change, it’s a software change

00:06:31.466 to some routers. And that seems to

00:06:33.666 me like a reasonable approach to reducing latency.

 

00:06:36.900 But some people are clearly willing to

00:06:39.633 go to the effort of

00:06:41.833 launching thousands of satellites, or

00:06:44.666 perhaps the slightly less extreme case of

00:06:49.033 laying new optical fibres through the Arctic Ocean.

 

00:06:53.000 So why are people doing this? Why

00:06:54.933 do people care so much about reducing

00:06:57.100 latency, that they're willing to spend billions

00:06:59.900 of dollars launching thousands of satellites,

00:07:02.833 or running new undersea cables, to do this?

 

00:07:06.833 Well, you'll be surprised to hear that

00:07:09.233 this is not to improve your gaming

00:07:11.166 experience. And this is not to improve

00:07:13.500 the experience of your zoom calls.

 

00:07:16.033 Why are people doing this? High frequency share trading.

 

00:07:20.800 Share traders believe they can make a

00:07:23.600 lot of money, by getting a few milliseconds worth

00:07:27.900 of latency reduction compared to their competitors.

 

00:07:33.600 Whether that's a good use of a

00:07:35.833 few billion dollars i'll let you decide.

 

00:07:38.800 But the end result may be,

00:07:41.433 hopefully, that we will get lower latency

00:07:43.866 for the rest of us as well.

 

00:07:48.733 And that concludes this lecture.

 

00:07:52.433 There are a bunch of reasons why

00:07:54.566 we have latency in the network.

00:07:56.600 Some of this is due to propagation

00:07:59.200 delays. Some of this, perhaps most of

00:08:01.166 it, in many cases, is due to

00:08:02.866 queuing at intermediate routers.

 

00:08:05.733 The propagation delays are driven by the speed of light.

 

00:08:09.200 And unless you can launch many satellites,

00:08:12.966 or lay more optical fibres, that's pretty

00:08:17.500 much a fixed constant, and there's not

00:08:19.833 much we can do about it.

 

00:08:22.966 Queuing delays, though, are things which we

00:08:25.833 can change. And a lot of the

00:08:28.066 queuing delays in the network are caused

00:08:30.000 because of TCP Reno and TCP Cubic,

00:08:34.400 which push for the queues to be full.

 

00:08:37.733 Hopefully, we will see improved TCP congestion

00:08:41.366 control algorithms. And TCP Vegas was one

00:08:44.600 attempt in this direction, which unfortunately proved

00:08:48.066 not to be deployable in practice,

 

00:08:50.833 TCP BBR was another attempt which

00:08:54.233 was problematic for other reasons, because of

00:08:57.433 its unfairness. But people are certainly working

00:09:00.066 on an alternative algorithms in this space,

00:09:02.866 and hopefully we'll see things deployed before too long.

Discussion

Lecture 6 discussed TCP congestion control and its impact on latency. It discussed the principles of congestion control (e.g., the sliding window algorithm, AIMD, conservation of packets), and their realisation in TCP Reno. It reviewed the choice of TCP initial window, slow start, and the congestion avoidance phase, and the response of TCP to packet loss as a congestion signal.

The lecture noted that TCP Reno cannot effectively make use of fast and long distance paths (e.g., gigabit per second flows, running on transatlantic links). It discussed the TCP Cubic algorithm, that changes the behaviour of TCP in the congestion avoidance phase to make more effective use of such paths.

And it noted that both TCP Reno and TCP Cubic will try to increase their sending rate until packet loss occurs, and will use that loss as a signal to slow down. The fills the in-network queues at routers on the path, causing latency.

The lecture briefly discussed TCP Vegas, and the idea of using delay changes as a congestion signal instead of packet loss, and it noted that TCP Vegas is not deployable in parallel with TCP Reno or Cubic. It highlighted ongoing research with TCP BBR, a new proposal that aims to make a deployable congestion controller that is latency sensitive, and some of the fairness problems with BBR v1.

Finally, the lecture highlighted the possible use of Explicit Congestion Notification as a way of signalling congestion to the endpoints, and of causing TCP to reduce its sending rate, before the in-network queues overflow. This potentially offers a way to reduce latency.

Discussion will focus on the behaviour of TCP Reno congestion control, to understand the basic dynamics of TCP, why these are so effective at keeping the network occupied, and understanding how this leads to high latency. We will then discuss the applicability and ease of deployment of several alternatives (Cubic, Vegas, BBR, and ECN) and how they change performance and latency.