Networked Systems H (2021-2022)
Lecture 6: Lowering Latency
This lecture discusses some of the factors that affect the latency of
a TCP congestion. It considers TCP congestion control, the TCP Reno
and Cubic congestion control algorithms, and their behaviour and
performance in terms of throughput and latency. It then considers
alternative congestion control, such as the TCP Vegas and BBR
algorithms, and the use of explicit congestion notification (ECN),
as options to lower latency. Finally, it considers the impact of
sub-optimal Internet paths on latency, and the rationale for deploying
low-Earth orbit satellite constellations to reduce latency of Internet
paths.
Part 1: TCP Congestion Control
This first part of the lecture outlines the principle of congestion
control. It discusses packet loss as a congestion signal, conservation
of packets in flight, and the additive increase, multiplicative
decrease requirements for stability.
Slides for part 1
00:00:00.566
In this lecture I’d like to move
00:00:02.400
on from talking about how to transfer
00:00:04.800
data reliably, and talk about mechanisms and
00:00:07.866
means by which transport protocols go about
00:00:10.366
lowering the latency of the communication.
00:00:15.466
One of the key limiting factors of
00:00:17.966
performance of network systems, as we've discussed
00:00:20.633
in some of the previous lectures, is latency.
00:00:25.000
Part of that is the latency for
00:00:26.800
establishing connections, and we've spoken about that
00:00:29.166
in detail already, where a lot of
00:00:31.566
the issue is the number of round
00:00:33.933
trip times needed to set up a connection.
00:00:37.400
And, especially when secure connections are in
00:00:40.700
use, if you're using TCP and TLS,
00:00:43.766
for example, as we discussed, there’s a
00:00:46.033
large number of round trips needed to
00:00:47.766
actually get to the point where you
00:00:49.366
can establish a connection, negotiate security parameters,
00:00:53.266
and start to exchange data.
00:00:55.366
And we've already spoken about how the
00:00:58.166
QUIC Transport Protocol
00:01:00.566
has been developed to try and improve
00:01:03.233
latency in terms of establishing a connection.
00:01:06.166
The other aspects of latency, and reducing
00:01:08.566
the latency of communications, is actually in
00:01:10.966
terms of data transfer.
00:01:13.133
How you deliver data across the network
00:01:15.833
in a way which doesn't lead to
00:01:19.233
excessive delays, and how you can gradually
00:01:23.033
find ways of reducing the latency,
00:01:25.733
and making the network better suited to
00:01:29.200
real time applications, such as telephony,
00:01:31.833
and video conferencing, and gaming, and high
00:01:34.500
frequency trading, and
00:01:38.000
Internet of Things, and control applications.
00:01:43.666
A large aspect of that is in
00:01:44.966
terms of how you go about building
00:01:46.866
congestion control, and a lot of the
00:01:48.800
focus in this lecture is going to
00:01:50.700
be on how TCP
00:01:52.300
congestion control works, and how other protocols
00:01:54.700
do congestion, to deliver data in a
00:01:57.533
low latency manner.
00:01:59.400
But I’ll also talk a bit about
00:02:01.666
explicit congestion notification, and changes to the
00:02:04.466
way queuing happen in the network,
00:02:07.600
and about services such as SpaceX’s StarLink
00:02:09.700
which are changing the way the network
00:02:12.500
is built to reduce latency.
00:02:17.500
I want to start by talking about congestion control,
00:02:20.300
and TCP congestion control in particular.
00:02:26.800
And, what I want to do in
00:02:28.666
this part, is talk about some of
00:02:30.566
the principles of congestion
00:02:32.233
control. And talk about what is the
00:02:34.333
problem that's being solved, and how can
00:02:36.700
we go about adapting the rate at
00:02:39.533
which a TCP connection delivers data over
00:02:42.133
the network
00:02:43.633
to make best use of the network
00:02:45.366
capacity, and to do so in a
00:02:47.400
way which doesn't build up queues in
00:02:49.600
the network and induce too much latency.
00:02:52.300
So in this part I’ll talk about
00:02:54.566
congestion control principles. In the next part
00:02:57.066
I move on to talk about loss-based
00:02:59.433
congestion control, and talk about TCP Reno
00:03:02.566
and TCP Cubic,
00:03:04.233
which are ways of making very effective
00:03:06.400
use of the overall network capacity,
00:03:08.900
and then move on to talk about
00:03:11.200
ways of lowering latency.
00:03:13.533
I’ll talk about latency reducing congestion control
00:03:15.866
algorithms, such as TCP Vegas or Google's
00:03:18.933
TCP BBR proposal. And then I’ll finish
00:03:22.033
up by talking a little bit about
00:03:24.433
Explicit Congestion Notification
00:03:26.400
in one of the later parts of the lecture.
00:03:31.166
TCP is a
00:03:33.666
complex and very highly optimised protocol,
00:03:38.400
especially when it comes to congestion control
00:03:41.666
and loss recovery mechanisms.
00:03:44.733
I'm going to attempt to give you
00:03:47.000
a flavour of the way congestion control
00:03:49.500
works in this lecture, but be aware
00:03:52.033
that this is a very simplified review
00:03:54.333
of some quite complex issues.
00:03:56.833
The document listed on the slide is
00:03:59.966
entitled “A roadmap for TCP Specification Documents”,
00:04:03.900
and it's the latest IETF standard that describes
00:04:08.833
how TCP works, and points to the
00:04:11.700
details of the different proposals.
00:04:15.966
This is a very long and complex
00:04:19.733
document. It’s about, if I remember right,
00:04:22.133
60 or 70 pages long.
00:04:24.133
And all it is, is a list
00:04:26.066
of references to other specifications, with one
00:04:28.533
paragraph about each one describing why that
00:04:30.833
specification is important.
00:04:32.666
And the complete specification for TCP is
00:04:35.100
several thousand pages of text. This is
00:04:37.800
a complex protocol with a lot of
00:04:40.933
features in it, and I’m necessarily giving
00:04:43.700
a simplified overview.
00:04:47.400
I’m going to talk about TCP.
00:04:50.066
I’m not going to talk much,
00:04:52.066
if at all, about QUIC in this lecture.
00:04:54.666
That's not because QUIC isn't interesting,
00:04:57.533
it's because QUIC essentially adopts the same
00:05:00.433
congestion control mechanisms as TCP.
00:05:03.433
The QUIC version one standard says to
00:05:07.233
use TCP Reno, use the same congestion
00:05:10.166
control algorithm as TCP Reno.
00:05:13.300
And, in practice, most of the QUIC
00:05:15.500
implementations use the Cubic or the BBR
00:05:19.033
congestion control algorithms,
00:05:20.933
which we'll talk about later on.
00:05:22.500
QUIC is basically adopting the same mechanisms
00:05:24.566
as does TCP, and for that reason
00:05:27.366
that I’m not going to talk about
00:05:30.433
them too much separately.
00:05:36.966
So what is the goal of congestion
00:05:39.666
control? What are the principles of congestion control?
00:05:43.633
Well, the idea of congestion control is
00:05:46.600
to find the right transmission rate for
00:05:49.966
a connection.
00:05:51.466
We're trying to find the fastest sending
00:05:53.600
rate which you can send at to
00:05:56.100
match the capacity of the network,
00:05:58.233
and to do so in a way
00:05:59.833
that doesn't build up queues, doesn't overload,
00:06:02.766
doesn't congest the network.
00:06:05.066
So we're looking to adapt the transmission
00:06:07.333
rate of a flow of TCP traffic
00:06:09.933
over the network, to match the available
00:06:12.100
network capacity.
00:06:13.800
And as the network capacity changes,
00:06:16.100
perhaps because other flows of traffic start
00:06:19.466
up, or perhaps because you're on a
00:06:21.533
mobile device and you move into an
00:06:24.333
area with different radio coverage,
00:06:26.500
the speed at which the TCP is
00:06:29.800
delivering the data should adapt to match
00:06:31.600
the changes and available capacity.
00:06:35.966
The fundamental principles of congestion control,
00:06:41.433
as applied in TCP,
00:06:43.300
were first described by Van Jacobson,
00:06:46.500
who we see on the picture on
00:06:49.200
the top right of the slide,
00:06:51.033
in the paper “Congestion Avoidance and Control”.
00:06:56.366
And those principles are that TCP responds
00:06:59.166
to packet loss as a congestion signal.
00:07:01.933
It treats the loss of a packet,
00:07:04.966
because the Internet is a best effort
00:07:07.300
packet network, and it loses, it discards
00:07:09.666
packets, if it can't deliver them,
00:07:11.900
and TCP treats that discard, that loss
00:07:14.700
of a packet, as a congestion signal,
00:07:16.900
and as a signal of it's sending
00:07:19.233
too fast and should slow down.
00:07:21.866
It relies on the principle of conservation
00:07:23.833
of packets. It tries to keep the
00:07:26.133
number of packets, which are traversing the
00:07:28.433
network roughly constant,
00:07:29.800
assuming nothing changes in the network.
00:07:32.966
And it relies on the principles of
00:07:34.966
additive increase, multiplicative decrease.
00:07:37.166
If it has to increase its sending
00:07:39.233
rate, it does so relatively slowly,
00:07:41.333
an additive increase in the rate.
00:07:43.466
And if it has to reduce its
00:07:44.666
sending rate, it does so quickly, a multiplicative decrease.
00:07:49.466
And these are the fundamental principles that
00:07:51.833
Van Jacobson elucidated for TCP congestion control,
00:07:55.633
and for congestion control in general.
00:07:58.800
And it was Van Jacobson who did
00:08:01.866
the initial implementation of these into TCP
00:08:05.366
in the mid-1980s, about 1984, ’85, or so.
00:08:12.300
Since then, the algorithms, the congestion control
00:08:16.333
algorithms, for TCP in general have been
00:08:18.500
maintained by a large number of people.
00:08:20.900
A lot of people have developed this.
00:08:23.600
Probably one of the leading people in
00:08:26.733
this space for the last 20 years
00:08:30.700
or so, is Sally Floyd who was
00:08:33.166
very much responsible for taking
00:08:35.533
the TCP standards, making them robust,
00:08:39.166
pushing them through the IETF to get
00:08:41.533
them standardised, and making sure they work,
00:08:43.400
and making sure they work and get really high performance.
00:08:46.600
And she very much drove the development
00:08:48.800
to make these robust, and effective,
00:08:51.100
and high performance standards, and to make
00:08:53.766
TCP work as well as it does today.
00:08:57.266
And Sally sadly passed away a year
00:09:00.900
or so back, which is a tremendous
00:09:03.933
shame, but we're grateful for her legacy
00:09:08.766
in moving things forward.
00:09:13.833
So to go back to the principles.
00:09:17.366
The first principle of congestion control in
00:09:20.233
the Internet, and in TCP, is that
00:09:22.833
packet loss is an indication that the
00:09:24.700
network is congested.
00:09:28.500
Data flowing across the Internet flows from
00:09:31.433
the sender to the receiver through a
00:09:33.666
series of routers. The IP routers connect
00:09:37.866
together the different links that comprise the network.
00:09:41.766
And routers perform two functions:
00:09:44.500
they perform a routing function, and a forwarding function.
00:09:50.166
The purpose of the routing function is
00:09:52.566
to figure out how packets should get
00:09:55.166
to their destination. They receive a packet
00:09:57.766
from some network link, look at the
00:09:59.733
destination IP address, and decide which direction
00:10:02.333
to forward that packet. They’re responsible for
00:10:05.100
finding the right path through the network.
00:10:08.500
But they're also responsible for forwarding,
00:10:10.566
which is actually putting the packets into
00:10:13.233
the queue of outgoing traffic for the
00:10:15.900
link, and managing that queue of packets
00:10:18.566
to actually transmit the packets across the network.
00:10:22.033
And routers in the network have a
00:10:25.366
set of different links; the whole point
00:10:28.133
of a router is to connect different
00:10:30.266
links. And at each link, they have
00:10:32.200
a queue of packets, which are enqueued
00:10:34.100
to be delivered on that link.
00:10:36.900
And, perhaps obviously, if packets are arriving
00:10:39.333
faster than the link can deliver those
00:10:41.933
packets, then the queue gradually builds up.
00:10:44.466
More and more packets get enqueued in
00:10:47.200
the router waiting to be delivered.
00:10:48.800
And if packets are arriving slower than
00:10:51.433
they can be forwarded,
00:10:54.000
then the queue gradually empties as the
00:10:57.133
packets get transmitted.
00:11:00.066
Obviously the router has a limited amount
00:11:02.133
of memory, and at some point it's
00:11:04.633
going to run out of space to
00:11:06.200
enqueue packets. So, if packets are being
00:11:08.300
delivered faster than they,
00:11:10.200
if packets arriving at the router faster
00:11:12.833
than they can be delivered down the
00:11:14.633
link, the queue will build up and
00:11:16.500
gradually fill, until it reaches its maximum
00:11:18.833
size. At that point, the router has
00:11:21.133
no space to keep the newly arrived
00:11:23.666
packets, and so it discards the packets.
00:11:28.133
And this is what TCP is using
00:11:30.333
as the congestion signal. It’s using the
00:11:32.666
fact that the queue of packets on
00:11:35.100
an outgoing link at a router has
00:11:37.333
filled up. It's using that as an indication that
00:11:41.066
the queue fills up, the packet gets
00:11:43.566
lost, it uses that packet loss as
00:11:45.566
an indication that it's sending too fast.
00:11:47.933
It’s sending faster than the packets can
00:11:50.300
be delivered, and as a result the
00:11:52.666
queue has overflowed, a packet has been
00:11:55.000
lost, and so it needs to slow down.
00:11:57.966
And that's the fundamental congestion signal in
00:12:00.666
the network. Packet loss is interpreted as
00:12:03.533
a sign that devices are sending too
00:12:06.366
fast, and should go slower. And if
00:12:10.133
they slow down, the queues will gradually
00:12:12.066
empty, and packets will stop being lost.
00:12:15.366
So that's the first fundamental principle.
00:12:21.033
The second principle is that
00:12:24.800
we want to keep the number of
00:12:27.000
packets in the network roughly constant.
00:12:31.000
TCP, as we saw in the last
00:12:33.266
lecture, sends acknowledgments for packets. When a
00:12:35.866
packet is transmitted it has a sequence
00:12:38.366
number, and the response will come back
00:12:40.500
from the receiver acknowledging receipt of that
00:12:42.566
sequence number.
00:12:44.733
The general approach for TCP, once the
00:12:47.766
connection has got going, is that every
00:12:50.866
time it gets an acknowledgement, it uses
00:12:53.733
that as a signal that a packet
00:12:55.666
has been received.
00:12:57.533
And if a packet has been received,
00:12:59.233
something has left the network. One of
00:13:01.333
the packets sent into the network has
00:13:03.466
reached the other side, and has been
00:13:05.466
removed from the network at the receiver.
00:13:07.833
That means there should be space to
00:13:10.866
put another packet into the network.
00:13:13.900
And it's an approach that’s called ACK
00:13:15.866
clocking. Every time a packet arrives at
00:13:18.133
the receiver, and you get an acknowledgement
00:13:20.833
back saying it was received, that indicates
00:13:22.733
you can put another packet in.
00:13:24.766
So the total number of packets in
00:13:27.000
transit across the network ends up being
00:13:28.766
roughly constant. One packet out, you put
00:13:31.866
another packet in.
00:13:34.433
And it has the advantage that if
00:13:38.300
you're clocking out new packets in receipt
00:13:41.266
of acknowledgments, if, for some reason,
00:13:44.200
the network gets congested, and it takes
00:13:46.966
longer for acknowledgments to come back,
00:13:49.166
because it's taking longer for them to
00:13:50.700
work their way across the network,
00:13:53.566
then that will automatically slow down the
00:13:56.066
rate at which you send. Because it
00:13:58.466
takes longer for the next acknowledgment to
00:14:00.266
come back, therefore it's longer before you
00:14:02.066
send your next packet.
00:14:03.466
So, as the network starts to get
00:14:05.266
busy, as the queue starts to build
00:14:07.333
up, but before the queue has overflowed,
00:14:09.933
it takes longer for the acknowledgments to
00:14:12.233
come back, because the packets are queued
00:14:14.733
up in the intermediate links, and that
00:14:17.133
gradually slows down the behaviour of TCP.
00:14:20.166
It reduces the rate at which you can send.
00:14:23.366
So it’s, to at least some extent,
00:14:25.500
self adjusting. The network gets busier,
00:14:28.133
the ACKs come back slower, therefore you
00:14:30.066
send a little bit slower.
00:14:31.933
And that's the second principle: conservation of
00:14:34.600
packets. One out, one in.
00:14:41.300
And the principle of conservation of packets
00:14:44.866
is great, provided the network is in
00:14:48.333
the steady state.
00:14:50.166
But you also need to be able
00:14:51.733
to adapt the rate at which you're sending.
00:14:54.633
The way TCP adapts is very much
00:14:57.300
focused on starting slowly and gradually increasing.
00:15:04.900
When it needs to increase it’s sending
00:15:07.100
rate, TCP increases linearly. It adds a
00:15:10.866
small amounts to the sending rate each round trip time.
00:15:15.433
So it just gradually, slowly, increases the
00:15:18.000
sending rating. It gradually
00:15:19.500
pushes up the rate
00:15:23.233
until it spots a loss. Until it
00:15:26.566
loses a packet. Until it overflows a queue.
00:15:29.633
And then it responds to congestion by
00:15:32.166
rapidly decreasing its rate. If a congestion
00:15:36.233
event happens, if a packet is lost,
00:15:38.500
TCP halves its rate. It responds faster
00:15:42.266
than it increases, it slows down faster than it increases.
00:15:45.966
And this is the final principle,
00:15:48.166
what’s known as additive increase, multiplicative decrease.
00:15:50.866
The goal is to keep the network
00:15:53.066
stable. The goal is to not overload the network.
00:15:57.733
If you can, keep going at a
00:16:00.366
steady rate. Follow the ACK clocking approach.
00:16:03.300
Gradually, just slowly, increase the rate a
00:16:05.900
bit. Keep pushing, just in case there’s
00:16:08.366
more capacity than you think. So just
00:16:10.800
gradually keep probing to increase the rate.
00:16:13.733
If you overload the network, if you
00:16:16.333
cause congestion, if you overflow the queues,
00:16:18.566
cause a packet to be lost,
00:16:19.766
slow down rapidly. Halve your sending rate,
00:16:22.733
and gradually build up again.
00:16:24.866
The fact that you slow down faster
00:16:27.000
than you speed up, the fact that
00:16:28.966
you follow the one in, one out approach,
00:16:31.900
keeps the network stable. It makes sure
00:16:34.433
it doesn't overload the network, and it
00:16:36.166
means that if the network does overload,
00:16:38.000
it responds and recovers quickly The goal
00:16:40.866
is to keep the traffic moving.
00:16:42.500
And TCP is very effective at doing this.
00:16:47.366
So those are the fundamental principles of
00:16:49.800
TCP congestion control. Packet loss as an
00:16:52.866
indication of congestion.
00:16:54.800
Conservation of packets, and ACK clocking.
00:16:57.500
One in, one out, where possible.
00:17:00.366
If you need to increase the sending
00:17:03.233
rate, increase slowly. If a problem happens,
00:17:06.100
decrease quickly. And that will keep the network stable.
00:17:10.200
In the next part I’ll talk about
00:17:12.466
TCP Reno, which is one of the
00:17:15.100
more popular approaches for doing this in practice.
Part 2: TCP Reno
The second part of the lecture discusses TCP Reno congestion control.
It outlines the principles of window based congestion control, and
describes how they are implemented in TCP. The choice of initial
window, and how the recommended initial window has changed over time,
is discussed, along with the slow start algorithm for finding the
path capacity and the congestion avoidance algorithm for adapting
the congestion window.
Slides for part 2
00:00:00.666
In the previous part, I spoke about
00:00:02.366
the principles of TCP congestion control in
00:00:04.733
general terms. I spoke about the idea
00:00:07.666
of packet loss as a congestion signal,
00:00:10.300
about the conservation of packets, and about
00:00:12.800
the idea of additive increase multiplicative decrease
00:00:15.500
– increase slowly, decrease the sending quite
00:00:17.666
quickly as a way of achieving stability.
00:00:20.333
In this part I want to talk
00:00:21.900
about TCP Reno, and some of the
00:00:24.066
details of how TCP congestion control works in practice.
00:00:27.500
I’ll talk about the basic TCP congestion
00:00:29.866
control algorithm, how the sliding window algorithm
00:00:33.033
works to adapt the sending rate,
00:00:36.100
and the slow start and congestion avoidance
00:00:39.566
phases of congestion control.
00:00:44.600
TCP is what's known as a window
00:00:48.166
based congestion control protocol.
00:00:51.100
That is, it maintains what's known as
00:00:54.633
a sliding window of data which is
00:00:56.700
available to be sent over the network.
00:00:59.800
And the sliding window determines what range
00:01:02.100
of sequence numbers can be sent by
00:01:04.500
TCP onto the network.
00:01:06.933
It uses the additive increase multiplicative decrease
00:01:11.100
approach to grow and shrink the window.
00:01:13.300
And that determines, at any point,
00:01:15.666
how much data TCP sender can send
00:01:17.700
onto the network.
00:01:19.666
It augments these with algorithms known as
00:01:22.000
slow start and congestion avoidance. Slow start
00:01:24.933
being the approach TCP uses to get
00:01:28.900
a connection going in a safe way,
00:01:31.533
and congestion avoidance being the approach it
00:01:33.800
uses to maintain the sending rate once
00:01:36.733
the flow has got started.
00:01:39.633
The fundamental goal of TCP is that
00:01:42.233
if you have several TCP flows sharing
00:01:45.533
a link, sharing a bottleneck link in the network,
00:01:50.300
each of those flows should get an
00:01:52.733
approximately equal share of the bandwidth.
00:01:55.900
So, if you have four TCP flows
00:01:57.666
sharing a link, they should each get
00:01:59.733
approximately one quarter of the capacity of that link.
00:02:03.866
And TCP does this reasonably well.
00:02:06.666
It’s not perfect. It, to some extent,
00:02:10.100
biases against long distance flows,
00:02:13.433
and shorter flows tend to win out
00:02:15.900
a little over long distance flows.
00:02:18.066
But, in general, it works pretty well,
00:02:19.900
and does give flows roughly a roughly
00:02:22.866
equal share of the bandwidth.
00:02:26.100
The basic algorithm it uses to do
00:02:28.066
this, the basic congestion control algorithm,
00:02:30.600
is an approach known as TCP Reno.
00:02:32.233
And this is the state of the
00:02:35.200
art in TCP as of about 1990.
00:02:42.866
TCP is an ACK based protocol.
00:02:46.700
You send a packet, and sometime later
00:02:48.933
an acknowledgement comes back telling you that
00:02:52.000
the packet arrived, and indicating the sequence
00:02:54.366
number of the next packet which is expected.
00:02:59.300
The simplest way you might think that
00:03:01.300
would work, is you send a packet.
00:03:03.466
You wait for the acknowledgment. You send
00:03:05.533
another packet You wait for the acknowledgement. And so on.
00:03:09.500
The problem with that, is that it
00:03:11.166
tends to perform very poorly.
00:03:14.200
It takes a certain amount of time
00:03:16.366
to send a packet down a link.
00:03:18.666
That depends on the size of the
00:03:20.033
packet, and the link bandwidth.
00:03:23.566
The size of the packet is expressed
00:03:25.600
as some number of bits to be sent.
00:03:27.700
The link bandwidth is expressed in some
00:03:29.800
number of bits it can deliver each
00:03:31.300
second. And if you did divide the
00:03:33.233
packet size by the bandwidth, that gives
00:03:35.300
you the number of seconds it takes to send each packet.
00:03:39.166
It takes a certain amount of time
00:03:41.300
for that packet to propagate down the
00:03:43.733
link to the receiver, and for the
00:03:45.733
acknowledgment come back to you, depending on
00:03:48.900
the round trip of the link.
00:03:51.100
And you can measure the round trip time of the link.
00:03:54.833
And you can divide one by the other.
00:03:57.533
You can take the time it takes to send a packet, and the
00:04:00.333
time it takes for the acknowledgment to
00:04:02.066
come back, and divide one by the
00:04:03.900
other, to get the link utilisation.
00:04:07.200
And, ideally, you want that fraction be
00:04:08.900
close to one. You want to be
00:04:11.166
spending most of the time sending packets,
00:04:13.466
and not much time waiting for the
00:04:15.166
acknowledgments to come back before you can
00:04:16.900
send the next packet.
00:04:19.866
The problem is that's often not the case.
00:04:23.566
For example, if we assume we're trying
00:04:25.566
to send data, and we have a
00:04:27.366
gigabit link, which is connecting the machine
00:04:30.100
we're sending data from, and we’re trying
00:04:31.800
to go from Glasgow to London.
00:04:33.966
And this might be the case you would find if you had a one
00:04:37.133
of the machines in the Boyd Orr
00:04:39.233
labs, which is connected to the University's
00:04:42.166
gigabit Ethernet, and the University has a
00:04:44.666
10 gigabit per second link to the
00:04:46.666
rest of the Internet, so the bottleneck is that Ethernet.
00:04:51.100
If you're talking to a machine in London,
00:04:54.100
let's make some assumptions on how long this will take.
00:04:59.166
You’re sending using Ethernet, and the biggest
00:05:01.533
packet an Ethernet can deliver is 1500
00:05:03.566
bytes. So 1500 bytes, multiplied by eight
00:05:06.833
bits per byte, gives you a number
00:05:09.066
of bits in the packet. And it’s
00:05:11.366
a gigabit Ethernet, so it's sending a
00:05:13.233
billion bits per second.
00:05:15.400
So 1500 bytes, times eight bits,
00:05:17.866
divided by a billion bits per second.
00:05:21.133
It will take 12 microseconds, 0.000012 of
00:05:26.866
a second, 12 microseconds to send a
00:05:29.266
packet down the link. And that’s just
00:05:31.800
the time it takes to physically serialise
00:05:34.066
1500 bytes down a gigabit per second link.
00:05:39.400
The round trip time to London, if you measure it, is about
00:05:44.566
a 100th of a second, about 10 milliseconds.
00:05:47.833
If you divide one by the other,
00:05:50.200
you find that the utilisation is 0.0012.
00:05:54.700
0.12% of the link is in use.
00:05:59.266
The time it takes to send a
00:06:00.933
packet is tiny compared to the time
00:06:02.833
it takes to get a response.
00:06:04.566
So if you're just sending one packet,
00:06:06.433
and waiting for a response, the link
00:06:08.166
is idle 99.9% of the time.
00:06:14.166
The idea of a sliding window protocol
00:06:16.733
is to not just send one packet
00:06:18.566
and wait for an acknowledgement.
00:06:20.133
It’s to send several packets,
00:06:22.566
and wait for the acknowledgments. And the
00:06:25.466
window is the number of packets that
00:06:27.266
can be outstanding before the acknowledgement comes back.
00:06:31.000
The ideas is, you can start several
00:06:33.133
packets going, and eventually the acknowledgement comes
00:06:36.566
back, and that starts triggering the next
00:06:38.466
packets to be clocked out. This idea
00:06:40.233
is to improve the utilisation by sending
00:06:42.500
more than one packet before you get an acknowledgment.
00:06:47.200
And this is the fundamental approach to
00:06:49.266
sliding window protocols. The sender starts sending
00:06:51.833
data packets, and there's what's known as
00:06:54.433
a congestion window that's that specifies how
00:06:57.000
many packets that's it’s allowed to send
00:06:59.600
before it gets an acknowledgement.
00:07:02.033
And, in this example, the congestion window is six packets.
00:07:06.133
And the sender starts. It sends the
00:07:08.300
first data packet, and that gets sent
00:07:11.100
and starts its way traveling down the link.
00:07:14.533
And at some point later it sends
00:07:16.566
the next packet, and then the next packet, and so.
00:07:20.433
After a certain amount of time that
00:07:22.366
first packet arrives at the receiver,
00:07:24.400
and the receiver generates the acknowledgments which
00:07:26.800
comes back towards the sender.
00:07:28.933
And while this is happening, the sender
00:07:30.633
is sending more of the packets from its window.
00:07:33.966
And the receiver’s gradually receiving those and
00:07:36.266
sending the acknowledgments. And, at some point later,
00:07:38.966
the acknowledgement makes it back to the sender.
00:07:42.666
And in this case we've set the
00:07:44.733
window size to be six packets.
00:07:46.700
And it just so happens that the
00:07:48.500
acknowledgement for the first packet arrives back
00:07:51.733
at the sender, just as it has finished sending packet six.
00:07:57.700
And that triggers the window to increase.
00:07:59.866
That triggers the window to slide along.
00:08:02.066
So instead of being allowed to send packets one through six,
00:08:05.533
we're now allowed to send packets two
00:08:07.366
through seven. Because one packet has arrived,
00:08:09.833
that's opened up the window to allow
00:08:11.400
us to send one more packet.
00:08:13.733
And the acknowledgement indicates that packet one
00:08:16.200
has arrived. So just as we'd run
00:08:19.133
out of packets to send, just as
00:08:20.600
we've sent our six packets which are
00:08:22.533
allowed by the window, the acknowledgement arrives,
00:08:25.033
slides the window a long one,
00:08:26.566
tells us we can now send one more.
00:08:29.566
And the idea is that you size
00:08:31.600
the window such that you send just
00:08:33.366
enough packets that by the time the
00:08:35.600
acknowledgement comes back, you're ready to slide
00:08:37.800
the window along. You've sent everything that
00:08:40.000
was in your window.
00:08:41.766
And each acknowledgement releases the next packet
00:08:44.833
for transmission, if you get the window sized right.
00:08:48.766
And if there's a problem, if the acknowledgments
00:08:51.233
don't come back because something got lost,
00:08:54.033
then it stalls. You hadn't sent too
00:08:56.966
many excess packets, you're not just keeping
00:08:59.200
sending without getting acknowledgments,
00:09:01.466
you're just sending enough
00:09:02.933
that the acknowledgments come back, just as
00:09:04.966
you run out of things to send.
00:09:06.966
And everything just keeps it sort-of balanced.
00:09:09.300
Every acknowledgement triggers the next packet to
00:09:11.466
be sent, and it rolls along.
00:09:14.500
How big should the window be? Well,
00:09:17.166
it should be sized to match the
00:09:18.466
bandwidth times the delay on the path.
00:09:20.900
And you work it out in bytes.
00:09:23.266
It's the bandwidth of the path,
00:09:24.600
a gigabit in the previous example,
00:09:26.933
times the latency,
00:09:28.300
100th of a second, and you multiply
00:09:30.833
those together and that tells you how
00:09:32.366
many bytes can be in flight.
00:09:33.733
And you divide that by the packet
00:09:35.366
size, and that tells you how many packets you can send.
00:09:39.633
The problem is, the sender doesn't know
00:09:42.333
the bandwidth of the path, and it
00:09:44.700
doesn't know that latency. It doesn't know
00:09:46.633
the round trip time.
00:09:49.366
It can measure the round trip time,
00:09:51.433
but not until after it started sending.
00:09:53.733
Once it’s sent a packet, it can
00:09:55.800
wait for an acknowledgement to come back
00:09:57.566
and get an estimate of the round
00:09:58.966
trip time. But it can't do that
00:10:00.566
at the point where it starts sending.
00:10:02.566
And it can't know what is the
00:10:04.300
bandwidth. It knows the password for the
00:10:06.500
link it's connected to, but it doesn't
00:10:08.033
know the bandwidth for the rest of
00:10:09.500
the links throughout the network.
00:10:11.366
It doesn't know how many other TCP
00:10:13.566
flows it’s sharing the traffic with,
00:10:15.133
so it doesn't know how much of
00:10:16.600
that capacity it's got available.
00:10:19.000
And that this is the problem with
00:10:21.366
the sliding window algorithms. If you get
00:10:24.033
the window size right,
00:10:26.100
It allows you to do the ACK
00:10:27.900
clocking, it allows you to clock out
00:10:29.566
the packets at the right time,
00:10:31.100
just in time for the next packet to become available.
00:10:34.166
But, in order to pick the right
00:10:35.500
window size, you need to know the
00:10:36.833
bandwidth and the delay, and you don't
00:10:38.666
know either of those at the start of the connection.
00:10:44.333
TCP follows the sliding window approach.
00:10:47.700
TCP Reno is very much a sliding
00:10:51.266
window protocol, and it's optimised for not
00:10:53.966
knowing what the window sizes are.
00:10:58.466
And the challenge with TCP is to
00:11:01.033
pick what should be the initial window.
00:11:02.933
To pick how many packets you should
00:11:04.566
send, before you know anything about the
00:11:06.600
round trip time, or anything about bandwidth.
00:11:09.700
And how to find the path capacity,
00:11:11.633
how to figure out at what point
00:11:13.700
you've got the right size window.
00:11:15.866
And then how to adapt the window
00:11:18.833
to cope with changes in the capacity.
00:11:23.600
So there's two fundamental problems with TCP
00:11:26.766
Reno congestion control. Picking the initial window size
00:11:31.666
for the first set of packets you send.
00:11:34.833
And then, adapting that initial window size
00:11:37.500
to find the bottleneck capacity, and to
00:11:39.733
adapt to changes in that bottleneck capacity.
00:11:42.366
If you get the window size right,
00:11:44.500
you can make effective use of the
00:11:46.033
network capacity. If you get it wrong
00:11:48.633
you’ll either send too slowly, and end
00:11:50.900
up wasting capacity. Or you'll send too
00:11:53.033
quickly, and overload the network, and cause
00:11:55.200
packets to be lost because the queues fill.
00:12:01.800
So, how does TCP find the initial window?
00:12:05.966
Well, to start with, you have no
00:12:07.766
information. When you're making a TCP connection
00:12:10.900
to a host you haven't communicated with
00:12:12.966
before, you don't know the round trip
00:12:15.100
time to that host, you don’t know
00:12:16.633
how long it will take to get
00:12:17.800
a response, and you don't know the network capacity.
00:12:21.133
So you have no information to know
00:12:23.666
what an appropriately sized window should be.
00:12:27.966
The only safe thing you can do.
00:12:30.600
The only thing which is safe in
00:12:32.133
all circumstances, is to send one packet,
00:12:34.766
and see if it arrives, see if you get an ACK.
00:12:38.433
And if it works, send a little
00:12:39.933
bit faster next time.
00:12:42.500
And then gradually increase the rate at which you send.
00:12:46.100
The only safe thing to do
00:12:48.033
is to start at the lowest possible rate,
00:12:50.400
equivalent of stop-and-wait, and then gradually
00:12:53.700
increase your rate from there, once you know that it works.
00:12:58.366
The problem is, of course, that's pessimistic,
00:13:00.433
in most cases.
00:13:02.000
Most links are not the slowest possible link.
00:13:04.500
Most links, you can send faster than that.
00:13:09.233
What TCP has traditionally done, and the
00:13:12.466
traditional approach in TCP Reno, is declared
00:13:15.300
the initial window to be three packets.
00:13:18.533
So you can send three packets,
00:13:20.300
without getting any acknowledgments back.
00:13:23.300
And, by the time the third packet
00:13:24.800
has been sent, you should be just
00:13:27.033
about to get the acknowledgement back,
00:13:28.566
which will open it up for you to send the fourth.
00:13:30.933
And at that point, it starts ACK clocking.
00:13:34.700
And why is it three packets?
00:13:37.066
Because someone did some measurements,
00:13:38.933
and decided that was what safe.
00:13:42.500
More recently, I guess, about 10 years
00:13:45.666
ago now, Nandita Dukkipati and her group
00:13:49.333
at Google did another set of measurements,
00:13:52.333
and showed that was actually pessimistic.
00:13:55.066
The networks had gotten a lot faster
00:13:57.233
in the time since TCP was first
00:13:59.833
standardised, and they came to the conclusion,
00:14:02.900
based on the measurements of browsers accessing
00:14:05.733
the Google site, that about 10 packets
00:14:08.600
was a good starting point.
00:14:11.500
And the idea here is that 10
00:14:13.133
packets, you can send 10 packets at
00:14:15.533
the start of a connection, and after
00:14:18.500
you’ve sent 10 packets you should have
00:14:20.266
got an acknowledgement back.
00:14:22.666
Why ten?
00:14:24.633
Again, it's a balance between safety and
00:14:27.233
performance. If you send too many packets
00:14:31.633
onto a network which can't cope with
00:14:33.333
them, those packets will get queued up
00:14:35.533
and, in the best case, it’ll just
00:14:37.566
add latency because they're all queued up
00:14:39.666
somewhere. And in the worst case they'll
00:14:41.466
overflow the queues, and cause packet loss,
00:14:43.500
and you'll have to re-transmit them.
00:14:45.900
So you don't want to send too
00:14:47.733
fast. Equally, you don't want to send
00:14:49.700
too slow, because that just wastes capacity.
00:14:52.733
And the measurements that Google came up with
00:14:56.000
at this point, which was around 10
00:14:58.133
years ago, was that about 10 packets
00:15:00.433
was a good starting point for most connections.
00:15:03.466
It was unlikely to cause congestion in
00:15:06.800
most cases, and was also unlikely to
00:15:08.966
waste too much bandwidth.
00:15:11.900
And I think what we'd expect to
00:15:13.333
see, is that over time the initial
00:15:14.900
window will gradually increase, as network connections
00:15:17.233
around the world gradually get faster.
00:15:19.566
And it's balancing making good use of
00:15:22.766
connections in well-connected
00:15:25.133
first-world parts of the world, where there’s
00:15:28.633
good infrastructure,
00:15:30.800
against not overloading connections in parts of
00:15:34.333
the world where the infrastructure at less well developed.
00:15:40.233
The initial window lets you send something.
00:15:43.266
With a modern TCP, it lets you send 10 packets.
00:15:48.266
And you can send those 10 packets,
00:15:50.166
or whatever the initial window is,
00:15:52.200
without waiting for an acknowledgement to come back.
00:15:55.733
But it's probably not the right size;
00:15:58.333
it’s probably not the right window size.
00:16:01.300
If you're on a very fast connection,
00:16:04.200
in a well-connected part of the world,
00:16:06.033
you probably want a much bigger window than 10 packets.
00:16:09.033
And if you're on a poor quality
00:16:11.500
mobile connection, or in a part of
00:16:13.433
the world where the infrastructure is less
00:16:15.133
well developed, you probably want a smaller window.
00:16:18.433
So you need to somehow adapt
00:16:20.000
to match the network capacity.
00:16:23.466
And there's two parts to this.
00:16:25.700
What's called slow start, where you try
00:16:28.200
to quickly find the appropriate initial window,
00:16:32.366
where starting from initial window, you quickly
00:16:34.900
converge on what the right window is.
00:16:37.266
And congestion avoidance, where you adapt in
00:16:39.800
the long term to match changes in
00:16:42.633
capacity once the thing is running.
00:16:47.300
So how does slow start work?
00:16:49.400
Well, this is the phase at the beginning of the connection.
00:16:52.766
It's easiest to illustrate if you assume
00:16:55.066
that the initial window is one packet.
00:16:57.600
If the initial window is one packet,
00:16:59.966
you send one packet, and at some
00:17:02.066
point later an acknowledgement comes back.
00:17:05.200
And the way slow start works is
00:17:07.066
that each acknowledgment you get back
00:17:09.433
increases the window by one.
00:17:13.733
So if you send one packet,
00:17:15.833
and get one packet back, that increases
00:17:18.466
the window from one to two,
00:17:20.133
so you can send two packets the next time.
00:17:23.133
And you send those two packets,
00:17:25.333
and you get two acknowledgments back.
00:17:27.066
And each acknowledgments increases the window by
00:17:29.233
one, so it goes to three,
00:17:30.800
and then to four. So you can
00:17:32.166
send four packets the next time.
00:17:35.233
And then you get four acknowledgments back,
00:17:37.666
each of which increases the window,
00:17:39.433
so your window is now eight.
00:17:42.133
And, as we are all, I think,
00:17:45.400
painfully aware after the pandemic, this is
00:17:47.966
exponential growth.
00:17:50.233
The window is doubling each time.
00:17:52.300
So it's called slow start because it
00:17:54.366
starts very slow, with one packet or
00:17:56.500
three packets or 10 packets, depending on
00:17:58.600
the version of TCP you have.
00:18:00.466
But each round trip time the window doubles.
00:18:03.666
It doubles it's sending rate each time.
00:18:06.866
And this carries on until it loses
00:18:09.533
a packet. This carries on until it
00:18:11.766
fills the queues and overflows the capacity
00:18:14.300
of the network somewhere.
00:18:16.333
At which points it halves back to
00:18:18.266
its previous value, and drops out of
00:18:19.866
the slow start phase.
00:18:23.733
If we look at this graphically,
00:18:26.133
what we see on the graph at
00:18:27.800
the bottom of the slide, we have
00:18:29.600
time on the X axis, and the
00:18:31.666
congestion window, the size of the congestion
00:18:33.800
window, on the y axis.
00:18:35.700
And we're assuming an initial window of
00:18:37.433
one packet. We see that, on the
00:18:39.933
first round trip it sends the one
00:18:41.566
packet, gets the acknowledgement back. The second
00:18:44.766
round trip it sends two packets.
00:18:46.700
And then four, and then eight,
00:18:48.166
and then 16. And each time it
00:18:50.400
doubles it's sending rate.
00:18:52.366
So you have this exponential growth phase,
00:18:54.500
starting at whatever the initial window is,
00:18:57.466
and doubling each time until it reaches
00:18:59.366
the network capacity.
00:19:01.500
And eventually it fills the network.
00:19:03.600
Eventually some queue, somewhere in the network,
00:19:05.766
is full. And it overflows and the packet gets lost.
00:19:10.266
At that point the connection halves it’s
00:19:12.200
rate, back to the value just before
00:19:14.466
it last increased. In this example,
00:19:17.233
we see that it got up to
00:19:19.333
an initial window of 16, and then
00:19:21.900
something got lost, and then it halved
00:19:23.433
back down to a window of eight.
00:19:26.266
At that point TCP enters what's known
00:19:28.466
as the congestion avoidance phase.
00:19:33.500
The goal of congestion avoidance is to
00:19:37.500
adapt to changes in capacity.
00:19:41.300
After the slow start phase, you know
00:19:43.366
you've got approximately the right size window
00:19:45.466
for the path. It's telling you roughly
00:19:47.366
how many packets you should be sending
00:19:48.900
each round trip time. The goal,
00:19:51.266
once you’re in congestion avoidance, is to adapt to changes.
00:19:55.666
Maybe the capacity of the path changes.
00:19:58.900
Maybe you're on a mobile device,
00:20:00.900
with a wireless connection, and the quality
00:20:04.033
of the wireless connection changes.
00:20:06.400
Maybe the amount of cross traffic changes.
00:20:09.466
Maybe additional people start sharing the link
00:20:12.266
with you, and you have less capacity
00:20:14.033
because you’re sharing with more TCP flows.
00:20:16.666
Or maybe some of the cross traffic
00:20:18.033
goes away, and the amount of capacity
00:20:20.100
you have available increases because there's less
00:20:22.133
competing traffic.
00:20:24.433
And the congestion avoidance phase follows an
00:20:27.200
additive increase, multiplicative decrease,
00:20:29.300
approach to adapting
00:20:30.633
the congestion window when that happens.
00:20:34.866
So, in congestion avoidance,
00:20:38.166
if it successfully manages to send a
00:20:40.466
complete window of packets, and gets acknowledgments
00:20:43.300
back for each of those packets.
00:20:45.333
So it's sent out
00:20:47.900
eight packets, for example, and gets eight
00:20:50.600
acknowledgments back,
00:20:52.366
it knows the network can support that sending rate.
00:20:55.766
So it increases its window by one.
00:20:59.133
So the next time, it sends out nine packets
00:21:02.600
and expects to get nine acknowledgments back
00:21:05.333
over the next round trip cycle.
00:21:08.233
And if it successfully does that,
00:21:09.966
it increases the window again.
00:21:12.500
And it sends 10 packets, and expects
00:21:15.400
to get 10 acknowledgments back.
00:21:17.800
And we see that each round trip
00:21:20.000
it gradually increases the sending rate by
00:21:22.166
one. So it sends 8 packets,
00:21:24.566
then 9, then 10, then 11,
00:21:26.333
and 12, and keeps gradually, linearly,
00:21:29.166
increasing its rate.
00:21:31.900
Up until the point that something gets lost.
00:21:36.966
And if a packet gets lost?
00:21:40.300
You’ll be able to detect that because,
00:21:43.100
as we saw in the previous lecture,
00:21:44.733
you'll get a triple duplicates acknowledgement.
00:21:46.833
And that indicates that one of the
00:21:49.433
packets got lost, but the rest of
00:21:50.933
the data in the window was received.
00:21:54.666
And what you do at that point,
00:21:56.500
is you do a multiplicative decrease in
00:21:58.566
the window. You halve the window.
00:22:02.300
So, in this case, the sender was
00:22:04.533
sending with a window of
00:22:07.133
12 packets, and it successfully sent that.
00:22:10.200
And then it tried to send,
00:22:13.500
tried to increase its rate, realised it
00:22:17.066
didn't work, realised something got lost,
00:22:19.133
and so it halved its window back down to six.
00:22:23.500
And then it gradually switches back,
00:22:25.466
it switches back, and goes back to
00:22:27.400
the gradual additive increase.
00:22:29.733
And it follows this sawtooth pattern.
00:22:32.433
Gradual linear increase, one packet more each
00:22:35.666
round trip time.
00:22:37.633
Until it sends too fast, causes a
00:22:40.166
packet to be lost because it overflows
00:22:41.966
a queue, halves it’s sending rate,
00:22:44.133
and then gradually starts increasing it again.
00:22:47.833
It follows this sawtooth pattern. Gradual increase,
00:22:51.500
quick back-off; gradual increase, quick back-off.
00:22:57.433
The other way TCP can detect the
00:22:59.633
loss is by what’s known as a
00:23:01.266
time out. It’s sending the packets,
00:23:04.500
and suddenly the acknowledgements stop coming back entirely.
00:23:09.633
And this means that either the receiver
00:23:11.833
has crashed, the receiving system has gone
00:23:14.933
away, or perhaps more likely the network has failed.
00:23:18.733
And the data it’s sending is either
00:23:21.600
not reaching the sender, or the reverse path has failed,
00:23:24.766
and the acknowledgments are not coming back.
00:23:29.200
At that point, after nothing has come back for a while,
00:23:33.333
it assumes a timeout has happened,
00:23:37.466
and resets the window down to the initial window.
00:23:41.833
And in the example we see on
00:23:43.866
the slide, at time 14 we've got
00:23:45.933
a timeout, and it resets and the
00:23:48.500
initial window goes back to one packet.
00:23:51.566
At that point, it re-enters slow start.
00:23:53.633
It starts again from the beginning.
00:23:55.966
And whether your initial window is one
00:23:58.066
packet, or three packets, or ten packets,
00:24:00.233
it starts in the beginning, and it
00:24:02.066
re-enters slow start, and it tries again
00:24:04.100
for the connection.
00:24:06.466
And if this was a transient failure,
00:24:08.500
that will probably succeed. If it wasn’t,
00:24:11.366
it may end up in yet another
00:24:13.900
timeout, while it takes time for the
00:24:15.600
network to recover, or
00:24:17.933
for the system you're talking to,
00:24:19.866
to recover, and it will be a
00:24:21.266
while before it can successfully send a
00:24:22.966
packet. But, when it does, when the
00:24:24.766
network recovers, it starts sending again,
00:24:26.866
and resets the connection from the beginning.
00:24:30.366
How long, should the timeout be?
00:24:33.533
Well, the standard says a maximum of
00:24:37.200
one second, or the average round trip
00:24:39.900
time plus four times the statistical variance
00:24:42.200
in the round trip time.
00:24:45.200
And, if you're a statistician, you’ll recognise
00:24:47.666
that the RTT plus four times the
00:24:49.766
variance, if you're assuming a normal distribution of
00:24:54.233
round trip time samples, accounts for 99%
00:24:57.733
of the samples falling within range.
00:25:01.266
So it's finding the 99th percentile of
00:25:04.466
the expected time to get an acknowledgement back.
00:25:12.700
Now, TCP follows this saw tooth behaviour,
00:25:16.866
with gradual additive increase in the sending
00:25:19.466
rate, and then a back-off, halving it’s
00:25:22.333
sending rate, and then a gradual increase again.
00:25:25.633
And we see this in the top
00:25:27.166
graph on the slide which is showing a
00:25:29.766
measured congestion window for a real TCP flow.
00:25:34.166
And, after dynamics of the slow start
00:25:36.266
at the beginning, we see it follows this sawtooth pattern.
00:25:41.366
How does that affect the rest of the network?
00:25:45.033
Well, the packets are, at some point,
00:25:48.133
getting queued up at whatever the bottleneck link is.
00:25:53.733
And the second graph we see on
00:25:55.466
the left, going down, is the size of the queue.
00:25:58.866
And we see that as the sending
00:26:00.766
rate increases, the queue gradually builds up.
00:26:04.200
Initially the queue is empty, and as
00:26:06.566
it starts sending faster, the queue gradually gets fuller.
00:26:11.333
And at some point the queue gets full, and overflows.
00:26:17.866
And when the queue gets full,
00:26:19.633
when the queue overflows, when packets gets
00:26:21.800
lost, TCP halves it’s sending rate.
00:26:24.700
And that causes the queue to rapidly
00:26:27.166
empty, because there's less packets coming in,
00:26:29.566
so the queue drains.
00:26:31.466
But what we see is that just
00:26:33.266
as the queue is getting to empty,
00:26:35.666
the rate is starting to increase again.
00:26:38.566
Just as the queue gets the point
00:26:40.200
where it would have nothing to send,
00:26:41.833
the rate starts picking up, such that
00:26:44.033
the queue starts to gradually refill.
00:26:46.600
So the queues in the routers also
00:26:48.600
follow a sawtooth pattern. They gradually fill
00:26:51.500
up until they get to a full point,
00:26:55.200
And then the rate halves, the queue
00:26:58.433
empties rapidly because
00:27:00.133
there's much less traffic coming back,
00:27:02.133
and as it's emptying the rate at
00:27:04.233
which the sender is sending is gradually
00:27:06.500
filling up, and the queue size oscillates.
00:27:09.266
And we see the same thing happens
00:27:11.066
with the round trip time, in the
00:27:13.766
third of the graphs, as the queue gradually
00:27:17.000
fills up, the round trip time goes
00:27:18.900
up, and up, and up, it's taking
00:27:20.733
longer for the packets because they're queued up somewhere.
00:27:23.366
And then the rate reduces, the queue
00:27:26.266
drops, the round trip time drops.
00:27:28.733
And it gradually, as the rate picks up afterwards
00:27:33.066
back into congestion avoidance, the queue gradually
00:27:35.666
fills, the round trip time gradually increases.
00:27:38.466
So, both window size, and the queue
00:27:40.666
size, and the round trip time,
00:27:42.266
all follow this characteristic sawtooth pattern.
00:27:47.066
What's interesting though, if we look at
00:27:50.100
the fourth graph down on the left,
00:27:52.800
is we're looking at the rate at
00:27:54.333
which packets are arriving at the receiver.
00:27:56.966
And we see that the rate at
00:27:58.800
which packets are arriving at the receiver
00:28:00.533
is pretty much constant.
00:28:03.300
What's happening is that the packets are
00:28:05.266
being queued up at the link,
00:28:07.400
and as the queue fills there's more
00:28:09.833
and more packets queued up
00:28:11.900
at the bottleneck link. And when TCP
00:28:15.366
backs-off, when it reduces it's window,
00:28:19.000
that lets the queue drain. But the
00:28:21.866
queue never quite empties. We just see
00:28:25.133
very occasional drops where the queue gets
00:28:27.566
empty, but typically the queue always has
00:28:30.033
something in it.
00:28:31.800
It's emptying rapidly, it’s getting less and
00:28:34.166
less data in it, but the queue,
00:28:37.666
if the buffer is sized right,
00:28:39.866
if the window is chosen right, never quite empties.
00:28:43.800
So the TCP sender is following this
00:28:46.433
sawtooth pattern, with its sending window,
00:28:49.600
which is gradually filling up the queues.
00:28:51.966
And then the queues are gradually draining
00:28:53.966
when TCP backs-off and halves its rate,
00:28:56.933
but the queue never quite empties.
00:28:58.933
It always has some data to send,
00:29:00.633
so the receiver is always receiving data.
00:29:03.700
So, even though the sender's following the
00:29:05.766
sawtooth pattern, the receiver receives constant rate
00:29:08.266
data the whole time,
00:29:10.233
at approximately the bottleneck bandwidth.
00:29:13.866
And that's the genius of TCP.
00:29:16.566
It manages, by following this additive increase,
00:29:20.066
multiplicative decrease, approach, it manages to adapt
00:29:24.333
the rate such that the buffer never
00:29:27.200
quite empties, and the data continues to be delivered.
00:29:32.233
And for that to work, it needs
00:29:34.433
the router to have enough buffering capacity
00:29:37.400
in it. And the amount of buffering
00:29:39.600
the router needs, is the bandwidth times
00:29:42.166
the delay of the path. And too
00:29:44.333
little buffering in the router
00:29:47.033
leads to
00:29:49.933
the queue overflowing, and it not quite
00:29:52.633
managing to sustain the rate. Too much,
00:29:55.500
you just get what’s known as buffer bloat.
00:29:59.366
It's safe, I mean in terms of
00:30:00.700
throughput, it keeps receiving the data.
00:30:02.766
But the queues get very big,
00:30:04.800
and they never get anywhere near empty,
00:30:07.466
so the amount of data queued up
00:30:09.766
increases, and you just get increased latency.
00:30:15.033
So that's TCP Reno. It's really effective
00:30:18.100
at keeping the bottleneck fully utilised.
00:30:20.466
But it trades latency for throughput.
00:30:22.866
It tries to fill the queue,
00:30:24.766
it's continually pushing, it’s continually queuing up data.
00:30:28.066
Making sure the queue is never empty.
00:30:30.800
Making sure the queue is never empty,
00:30:32.500
so provided there’s enough buffering in the
00:30:34.800
network there are always packets being delivered.
00:30:37.566
And that's great, if your goal is
00:30:39.966
to maximise the rate at which information
00:30:42.400
is delivered. TCP is really good at
00:30:45.466
keeping the bottleneck link fully utilised.
00:30:47.800
It’s really, really good at delivering data
00:30:49.900
as fast as the network can support it.
00:30:52.333
But it trades that off for latency.
00:30:56.500
It's also really good at making sure
00:30:59.166
there are queues in the network,
00:31:01.066
and making sure that the network is
00:31:03.466
not operating at its lowest possible latency.
00:31:06.300
There's always some data queued up.
00:31:11.733
There are two other limitations,
00:31:13.966
other than increased latency.
00:31:16.700
First, is that TCP assumes that losses
00:31:19.066
are due to congestion.
00:31:21.600
And historically that's been true. Certainly in
00:31:24.466
wired links, packet loss is almost always
00:31:27.566
caused by a queue filling up,
00:31:30.433
overflowing, and a router not having space
00:31:34.133
to enqueue a packet.
00:31:36.666
In certain types of wireless links,
00:31:39.366
in 4G or in WiFi links,
00:31:41.500
that's not always the case, and you
00:31:43.733
do get packet loss due to corruption.
00:31:46.533
And TCP will treat this as a
00:31:49.000
signal to slow down. Which means that
00:31:51.166
TCP sometimes behaves sub-optimally on wireless links.
00:31:55.366
And there's a mechanism called Explicit Congestion
00:31:57.966
Notification, which we'll talk about in one
00:32:00.400
of the later parts of this lecture,
00:32:01.900
which tries to address that.
00:32:04.400
The other, is that the congestion avoidance
00:32:07.433
phase can take a long time to ramp up.
00:32:10.600
On very long distance links, very high capacity
00:32:16.133
links, it can take a long time
00:32:17.666
to get up to, after packet loss,
00:32:20.300
it can take a very long time
00:32:21.433
to get back up to an appropriate rate.
00:32:23.766
And there are some occasions with very
00:32:26.333
fast long distance links, where it performs
00:32:28.300
poorly, because of the way the congestion
00:32:31.066
avoidance works.
00:32:32.933
And there's an algorithm known as TCP
00:32:34.800
Cubic, which i'll talk about in the
00:32:36.500
next part, which tries to address that.
00:32:40.333
And that's the basics of TCP.
00:32:42.600
The basic TCP congestion control algorithm is
00:32:45.333
a sliding window algorithm, where the window
00:32:48.500
indicates how many packets you’re allowed to
00:32:50.800
send before getting an acknowledgement.
00:32:53.766
The goal of the slow start and
00:32:56.333
the congestion avoidance phases, and the additive
00:32:59.266
increase, multiplicative decrease, is to adapt the
00:33:02.166
size of the window to match the network capacity.
00:33:05.133
It always tries to match the size
00:33:07.166
of the window exactly to the capacity,
00:33:09.633
so it's making the most use of the network resources.
00:33:14.733
In the next part, I’ll move on
00:33:16.933
and talk about an extension to the
00:33:20.033
TCP Reno algorithm, known as TCP Cubic,
00:33:23.066
which is intended to improve performance on
00:33:25.533
very fast and long distance networks.
00:33:27.966
And then, in the later parts,
00:33:29.466
we'll talk about extensions to reduce latency,
00:33:32.600
and to work on wireless links where
00:33:35.933
there are non-congestive losses.
Part 3: TCP Cubic
The third part of the lecture talks about the TCP Cubic congestion
control algorithm, a widely used extension to TCP that improves its
performance on fast, long-distance, networks. The lecture discusses
the limitations of TCP Reno that led to the development of Cubic,
and outlines how Cubic congestion control improves performance but
retains fairness with Reno.
Slides for part 3
00:00:00.833
In the previous part, I spoke about TCP Reno.
00:00:04.133
TCP Reno is the default congestion control
00:00:07.033
algorithms for TCP, but it's actually not
00:00:09.566
particularly widely used in practice these days.
00:00:12.566
What most modern TCP versions use is,
00:00:14.966
instead, an algorithm known as TCP Cubic.
00:00:18.600
And the goal of TCP cubic is
00:00:20.666
to improve TCP performance on fast long distance networks.
00:00:26.033
So the problem with TCP Reno,
00:00:27.966
is that it’s performance can be comparatively
00:00:30.133
poor on networks with large bandwidth-delay products.
00:00:33.933
That is, networks where the product,
00:00:36.333
what you get when you multiply the
00:00:37.900
bandwidth of the network, in number of
00:00:39.766
bits per second, and the delay,
00:00:42.100
the round trip time of the network, is large.
00:00:45.833
Now, this is not a problem that
00:00:48.066
most people, have most of the time.
00:00:50.466
But, it's a problem that began to
00:00:52.400
become apparent in the early 2000s when
00:00:55.733
people working at organisations like CERN were
00:00:58.500
trying to transfer very large data files
00:01:01.033
across fast long distance
00:01:05.800
networks between CERN and the universities that
00:01:08.933
were analysing the data.
00:01:11.233
For example, CERN is based at Geneva,
00:01:13.800
in Switzerland, and some of the big
00:01:16.566
sites for analysing the data are based
00:01:19.533
at, for example, Fermilab just outside Chicago in the US.
00:01:23.900
And in order to get the data
00:01:26.166
from CERN to Fermilab, from Geneva to Chicago,
00:01:31.366
they put in place multi-gigabit transatlantic links.
00:01:37.566
And if you think about the congestion window needed to
00:01:42.666
make good use of a link like
00:01:44.666
that, you realise it actually becomes quite large.
00:01:48.066
If you assume the link is 10
00:01:50.766
gigabit per second, which was cutting edge
00:01:54.033
in the early 2000s, but it is
00:01:55.833
now relatively common for high-end links these days,
00:01:59.033
and assume 100 milliseconds round trip time,
00:02:02.100
which is possibly even slightly an under-estimate
00:02:04.933
for the path from Geneva to Chicago,
00:02:08.900
in order to make good use
00:02:11.166
of that, you need a congestion window
00:02:12.866
which equals the bandwidth times the delay.
00:02:15.200
And 10 gigabits per second, times 100
00:02:17.633
milliseconds, gives you a congestion window of
00:02:20.233
about 100,000 packets.
00:02:24.166
And, partly, it takes TCP a long
00:02:28.066
time, a comparatively long time, to slow
00:02:31.333
start up to a 100,000 packet window.
00:02:34.266
But that's not such a big issue,
00:02:36.533
because that only happens once at the
00:02:38.066
start of the connection. The issue,
00:02:40.166
though, is in congestion avoidance.
00:02:42.800
If one packet is lost on the
00:02:44.766
link, out of a window of 100,000,
00:02:47.266
that will cause TCP to back-off and
00:02:49.800
halve it’s window. And it then increases
00:02:53.066
sending rate again, by one packet every round trip time.
00:02:57.300
And backing off from 100,000 packet window
00:03:00.033
to a 50,000 packet window, and then
00:03:02.433
increasing by one each time, means it
00:03:04.766
takes 50,000 round trip times to recover
00:03:07.500
back up to the full window.
00:03:10.400
50,000 round trip times, when the round
00:03:13.000
trip time is 100 milliseconds, is about 1.4 hours.
00:03:17.600
So it takes TCP about one-and-a-half hours
00:03:20.966
to recover from a single packet loss.
00:03:24.300
And, with a window of 100,000 packets,
00:03:27.666
you're sending enough data, at 10 gigabits per second,
00:03:32.033
that the imperfections in the optical fibre,
00:03:35.433
and imperfections in the equipment that are
00:03:37.333
transmitting the packets, become significant.
00:03:40.233
And you're likely to just see occasional
00:03:43.300
random packet losses, just because of imperfections
00:03:46.100
in the transmission medium, even if there's
00:03:48.166
no congestion. And this was becoming a
00:03:50.466
limiting factor, this was becoming a bottleneck
00:03:52.666
in the transmission.
00:03:54.366
It was becoming not possible to build
00:03:56.400
a network that was reliable enough,
00:03:58.733
that it never lost any packets in
00:04:01.433
transferring several hundreds of billions of packets
00:04:03.966
of data,
00:04:05.100
to exchange the data between CERN and
00:04:11.500
the sites which were doing the analysis.
00:04:14.600
TCP cubic is one of a range
00:04:16.733
of algorithms which were developed to try
00:04:19.200
and address this problem. To try and
00:04:22.000
recover much faster than TCP Reno would,
00:04:24.466
in the case when you had very
00:04:26.400
large congestion windows, and small amounts of packet loss.
00:04:32.033
So the idea of TCP cubic,
00:04:34.866
is that it changes the way the
00:04:36.866
congestion control works in the congestion avoidance phase.
00:04:41.200
So, in congestion avoidance, TCP cubic will
00:04:46.033
increase the congestion window faster than TCP
00:04:49.000
Reno would, in cases where the window is large.
00:04:54.366
In cases where the window is relatively
00:04:56.700
small, in the types of networks were
00:04:59.233
Reno has good performance, TCP cubic behaves
00:05:03.800
in a very similar way.
00:05:05.466
But as the windows get bigger,
00:05:07.066
as it gets to a regime with
00:05:09.033
TCP Reno doesn't work effectively, TCP cubic
00:05:11.900
gets more aggressive in adapting its congestion
00:05:15.200
window, and increases the congestion window much
00:05:17.700
more quickly in response to loss.
00:05:21.833
However, as the rate of increase,
00:05:25.500
as the window approaches the value it
00:05:29.500
was before the loss, it slows its
00:05:31.333
rate of increase, so it starts increasing
00:05:33.833
rapidly, slows its rate of increase
00:05:36.000
as it approaches the previous value.
00:05:38.533
And if it then successfully manages to
00:05:41.666
send at that rate, if it successfully
00:05:44.166
moves above the previous sending rate,
00:05:47.600
then it gradually increases sending rate again.
00:05:51.800
It’s called TCP Cubic because it follows
00:05:54.733
a cubic equation to do this.
00:05:56.333
The shape of the equation, the shape
00:06:00.200
of the curve, we see on the
00:06:01.600
slide for TCP cubic is following a cubic graph.
00:06:05.600
The paper listed on the slide,
00:06:08.466
the paper shown on the slide,
00:06:09.900
from Injong Rhee and his collaborators,
00:06:13.633
is the paper which describes the algorithm in detail.
00:06:16.666
And it was eventually specified in IETF
00:06:19.833
RFC 8312 in 2018, although it's been
00:06:24.366
probably the most widely used TCP variant
00:06:27.666
for a number of years before that.
00:06:31.200
The details of how it works:
00:06:33.566
TCP cubic is a somewhat more complex
00:06:36.066
algorithm than Reno.
00:06:38.966
The two parts to the behaviour.
00:06:42.066
If a packet is lost when a
00:06:44.866
TCP cubic sender is in the congestion avoidance phase,
00:06:49.233
it does a multiplicative decrease.
00:06:52.133
However, unlike TCP Reno, which does a
00:06:55.300
multiplicative decrease by multiplying by a factor
00:06:58.766
of 0.5, that is, it halves its
00:07:01.566
sending rate if a single packets is lost,
00:07:04.533
TCP cubic multiples its rate by 0.7.
00:07:09.500
So, instead of dropping back down to
00:07:11.200
50% of its previous sending rate,
00:07:13.400
it drops down to 70% of the sending rate.
00:07:17.233
It backs-off less, it's more aggressive.
00:07:19.600
It’s more aggressive at using bandwidth.
00:07:23.300
It reduces it’s sending rate in response
00:07:25.733
to loss, but by smaller fraction.
00:07:31.866
After it's backed-off, TCP cubic also changes
00:07:36.233
the way in which it increases it’s sending rate in future.
00:07:40.733
So we saw in the previous slide,
00:07:42.500
TCP Reno increases it’s congestion window by
00:07:46.100
one, for every round trip when it
00:07:48.600
successfully sends data.
00:07:50.800
So if the window backs off to
00:07:53.033
10, then it goes to 11 the
00:07:54.900
next round trip time, then 12,
00:07:56.700
and 13, and so on, with a
00:07:58.466
linear increase in the window.
00:08:02.000
TCP cubic, on the other hand,
00:08:04.033
sets the window as we see in
00:08:06.766
the equation on the slide. It sets
00:08:08.766
the window to be a constant,
00:08:11.233
C, times T-K cubed, plus Wmax.
00:08:17.100
Where the constant, C, is set to
00:08:19.766
0.4, which is a threshold which controls
00:08:22.800
how fair it is to TCP Reno,
00:08:25.266
and was determined experimentally.
00:08:28.033
T is the time since the packet
00:08:29.933
loss. K is the time it will
00:08:32.200
increase, it will take to increase the window backup to
00:08:36.266
the maximum it was before the packet
00:08:40.066
loss, and Wmax is the maximum window
00:08:42.633
size it reached before the loss.
00:08:45.200
And this gives the cubic growth function,
00:08:47.866
which we saw on the previous slide,
00:08:49.600
where the window starts to increase quickly,
00:08:52.033
the growth slows as it approaches that previous value
00:08:55.433
it reached just before the loss,
00:08:57.933
and if it successfully passes through that
00:09:00.033
point, the rate of growth increases again.
00:09:03.766
Now, that's the high-level version. And we
00:09:06.666
can already see it's more complex than
00:09:09.266
the TCP Reno equation. The algorithm on
00:09:13.766
the right of the slide, which is
00:09:16.433
intentionally presented in a way which is
00:09:18.933
completely unreadable here,
00:09:21.166
shows the full details. The point is
00:09:24.233
that there's a lot of complexity here.
00:09:27.300
The basic equation, the basic back-off to
00:09:30.766
0.7 times and then follow the cubic
00:09:33.133
equation, to increase rapidly, slow the rate
00:09:36.666
of increase, and then increase rapidly again
00:09:39.100
if it successfully gets past the previous bottleneck point,
00:09:43.133
is enough to illustrate the key principle.
00:09:46.300
The rest of the details are there
00:09:48.133
to make sure it's fair with TCP
00:09:50.066
Reno on links which are slower,
00:09:52.366
or where the round trip time is shorter.
00:09:55.600
And so, in the regime where TCP
00:09:57.733
Reno can successfully make use of the
00:09:59.833
link, TCP Cubic behaves the same way.
00:10:02.866
And, as you get into a regime
00:10:05.000
where Reno can't effectively make use of
00:10:07.666
the capacity, because it can't sustain a
00:10:09.466
large enough congestion window,
00:10:11.133
then cubic starts to behave differently,
00:10:14.433
and starts to switch to the cubic
00:10:16.666
equation. And that allows it to recover
00:10:19.700
from losses more quickly, and to more
00:10:21.833
effectively continue to make use of higher
00:10:23.800
bandwidths and higher latency paths.
00:10:29.200
TCP cubic is the default in most
00:10:33.200
modern operating systems. It’s the default in
00:10:36.866
Linux, it's the default in FreeBSD,
00:10:39.733
I believe it's the default in macOS
00:10:42.733
and iPhones.
00:10:44.666
Microsoft Windows has an algorithm called Compound
00:10:48.566
TCP which is a different algorithm,
00:10:50.900
but has a similar effect.
00:10:54.166
It’s much more complex than TCP Reno.
00:10:56.900
The core response, the back off to
00:11:00.033
70% and then follow the characteristic cubic
00:11:03.900
curve, is conceptually relatively straightforward, but once
00:11:07.733
you start looking at the details of
00:11:09.966
how it behaves, there gets to be a lot of complexity.
00:11:13.833
And most of that is in there
00:11:16.333
to make sure it's reasonably fair to
00:11:19.433
TCP, to TCP Reno, in the regime
00:11:22.833
where Reno typically works. But it improves
00:11:26.233
performance for networks with longer round trip
00:11:28.366
times and higher bandwidths.
00:11:32.033
Both TCP Cubic, and TCP Reno,
00:11:35.933
use congestion control, use packet loss as
00:11:39.800
a congestion signal. And they both eventually
00:11:42.733
fill the router buffers.
00:11:44.533
And TCP cubic does so more aggressively
00:11:47.133
than Reno. So, in both cases,
00:11:49.400
they're trading off latency for throughput,
00:11:51.666
They're trying to make sure the buffers are full.
00:11:53.933
They're trying to make sure
00:11:56.166
the buffers in the intermediate routers are full.
00:11:58.866
And they're both making sure that they
00:12:02.066
keep the congestion window large enough to
00:12:04.433
keep the buffers fully utilised, so packets
00:12:08.633
keep arriving at the receiver at all times.
00:12:11.300
And that's very good for achieving high
00:12:13.033
throughput, but it pushes the latency up.
00:12:16.300
So, again, they’re trading-off increased latency for
00:12:19.933
good performance, for good throughput.
00:12:25.333
And that's what I want to say
00:12:26.666
about Cubic. Again, the goal is to
00:12:29.566
use a different response function to improve
00:12:32.333
throughput on very fast, long distance, links,
00:12:36.100
multi-gigabit per second transatlantic links, being the
00:12:39.833
common example.
00:12:42.300
And the goal is to make good
00:12:44.966
use of throughput.
00:12:47.633
In the next part I’ll talk about
00:12:50.600
alternatives which, rather than focusing on throughput,
00:12:53.800
focus on keeping latency bounded whilst achieving
00:12:57.533
reasonable throughput.
Part 4: Delay-based Congestion Control
The 4th part of the lecture discussed how both the Reno and Cubic
algorithms impact latency. It shows how their loss-based response
to congestion inevitably causes router queues to fill, increasing
path latency, and discusses how this is unavoidable with loss-based
congestion control. It introduces the idea of delay-based congestion
control and the TCP Vegas algorithm, highlights its potential benefits
and deployment challenges. Finally, TCP BBR is briefly introduced as
an experimental extension that aims to achieve some of the benefits
of delay-based congestion control, in a deployable manner.
Slides for part 4
00:00:00.566
In the previous parts, I’ve spoken about
00:00:02.700
TCP Reno and TCP cubic. These are
00:00:05.866
the standard, loss based, congestion control algorithms
00:00:08.966
that most TCP implementations use to adapt
00:00:11.933
their sending rate. These are the standard
00:00:14.933
congestion control algorithms for TCP.
00:00:17.566
What I want to do in this
00:00:19.100
part is recap, why these algorithms cause
00:00:23.033
additional latency in the network, and talk
00:00:25.933
about two alternatives which try to adapt
00:00:29.966
the sending rate of TCP without building
00:00:32.933
up queues, and without
00:00:34.800
overloading the network and causing too much latency.
00:00:40.400
So, as I mentioned, TCP Cubic and
00:00:42.900
TCP Reno both aim to fill up the network.
00:00:46.466
They use packet loss as a congestion signal.
00:00:50.300
So the way they work is they
00:00:52.733
gradually increase their sending rate, they’re in
00:00:55.900
either slow start or congestion avoidance phase,
00:00:58.900
and they’re always gradually increasing the sending
00:01:01.433
rates, gradually filling up the queues in
00:01:03.766
the network, until those queues overflow.
00:01:07.333
At that point a packet is lost.
00:01:09.733
The TCP backs-off it's sending rate,
00:01:13.466
it backs-off its window, which allows the
00:01:16.133
queue to drain, but as the queue
00:01:18.200
is draining, both
00:01:19.766
Reno and Cubic are increasing their sending
00:01:22.533
rate, are increasing the sending window,
00:01:25.366
so are to gradually start filling up
00:01:27.833
the queue again.
00:01:29.266
As, we saw, the queues in the
00:01:31.400
network oscillate, but they never quite empty.
00:01:34.333
And both Reno and Cubic, the goal
00:01:36.866
is to keep some packets queued up
00:01:39.766
in the network, make sure there's always
00:01:42.233
some data queued up, so they can
00:01:44.000
keep delivering data.
00:01:47.366
And, no matter how big a queue
00:01:50.300
you put in the network, no matter
00:01:52.200
how much memory you give the routers
00:01:53.866
in the network, TCP Reno and TCP
00:01:57.266
cubic will eventually cause it to overflow.
00:02:00.800
They will keep sending, they'll keep increasing
00:02:04.233
the sending rate, until whatever queue is
00:02:06.866
in the network it's full, and it overflows.
00:02:10.333
And the more memory in the routers,
00:02:12.133
the more buffer in the routers,
00:02:13.900
the longer that queue will get and
00:02:15.833
the worse the latency will be.
00:02:18.433
But in all cases, in order to
00:02:21.366
achieve very high throughput, in order to
00:02:23.533
keep the network busy, keep the bottleneck
00:02:25.433
link busy, TCP Reno and TCP cubic
00:02:29.033
queue some data up.
00:02:31.100
And this adds latency.
00:02:34.300
It means that, whenever there’s TCP Reno,
00:02:37.866
whenever there’s TCP cubic flows, using the
00:02:40.300
network, the queues will have data queued up.
00:02:45.800
There’ll always be data queued up for
00:02:47.800
delivery. There's always packets waiting for delivery.
00:02:50.933
So it forces the network to work
00:02:53.133
in a regime where there's always some
00:02:56.566
excess latency.
00:03:01.333
Now, this is a problem for real-time
00:03:05.066
applications. It’s a problem if you're running
00:03:07.233
a video conferencing tool, or a telephone
00:03:11.366
application, or a game, or a real
00:03:13.766
time control application, because you want low
00:03:16.633
latency for those applications.
00:03:19.133
So it will be desirable if we
00:03:21.166
could have a an alternative to TCP
00:03:23.600
Reno or TCP cubic that can achieve
00:03:25.800
good throughput for TCP, without forcing the
00:03:28.400
queues to be full.
00:03:31.433
One attempt at doing this was a proposal called TCP Vegas.
00:03:37.366
And the insight from TCP Vegas is that
00:03:42.800
you can watch the rate of growth,
00:03:45.800
or increase, of the queue, and use
00:03:48.633
that to infer whether you're sending faster,
00:03:50.700
or slower, than the network can support.
00:03:54.233
The insight was, if you're sending,
00:03:56.166
if a TCP is sending, faster than
00:03:58.366
the maximum capacity a network can deliver
00:04:00.933
at, the queue will gradually fill up.
00:04:03.500
And as the queue gradually fills up,
00:04:05.533
the latency, the round trip time, will gradually increase.
00:04:10.066
TCP Cubic, and TCP Reno, wait until
00:04:13.933
the queue overflows, wait until there's no
00:04:16.133
more space to put new packets in,
00:04:18.066
and a packet is lost, and at
00:04:19.800
that point they slow down.
00:04:22.666
The insight for TCP Vegas was to
00:04:25.300
watch as the delay increases, and as
00:04:28.500
it sees the delay increasing, it slows
00:04:31.300
down before the queue overflows.
00:04:34.533
So it uses the gradual increase in
00:04:36.366
the round trip time, as an indication
00:04:38.500
that it should send slower.
00:04:40.800
And as the round-trip time reduces,
00:04:43.033
as the round-trip time starts to drop,
00:04:45.066
it treats that as an indication that
00:04:46.933
the queue is draining, which means it can send faster.
00:04:50.766
It wants a constant round trip time.
00:04:53.366
And, if the round trip time increases,
00:04:55.300
it reduces its rate; and if the
00:04:57.933
round-trip time decreases, it increases its rate.
00:05:00.200
So, it's trying to balance it’s rate
00:05:03.033
with the round trip time, and not
00:05:04.866
build or shrink the queues.
00:05:08.333
And because you can detect the queue
00:05:10.966
building up before it overflows, you can
00:05:14.233
take action before the queue is completely
00:05:16.133
full. And that means the queue is
00:05:18.466
running with lower occupancy, so you have
00:05:21.000
lower latency across the network.
00:05:23.666
It also means that because packets are
00:05:25.533
not being lost, you don't need to
00:05:27.866
re-transmit as many packets. So it improves
00:05:30.700
the throughput that way, because you're not
00:05:32.600
resending data that you've already sent and has gotten lost.
00:05:36.633
And that's the fundamental idea of TCP
00:05:38.966
Vegas. It doesn't change the slow start behaviour at all.
00:05:42.566
But, once you're into congestion avoidance,
00:05:44.900
it looks at the variation in round
00:05:47.100
trip time rather than looking at packet
00:05:49.200
loss, and uses that to drive the
00:05:51.366
variation in the speed at which it’s sending.
00:05:56.566
The details of how it works.
00:05:59.466
Well, first, it tries to estimate what
00:06:01.766
it calls the base round trip time.
00:06:04.766
So every time it sends a packet,
00:06:07.033
it measures how long it takes to
00:06:08.733
get a response. And it tries to
00:06:10.800
find the smallest possible response time.
00:06:14.166
The idea being that the smallest time
00:06:17.366
it gets a response, would be the
00:06:18.833
time when the queue is that it's emptiest.
00:06:21.766
It may not get the actual,
00:06:23.466
completely empty, queue, but the smaller the
00:06:26.066
response time, it's trying to estimate the
00:06:29.866
time it takes when there's nothing else in the network.
00:06:34.066
And anything on top of that indicates
00:06:36.233
that there is data queued up somewhere in the network.
00:06:41.133
Then it calculates an expected sending rate.
00:06:45.266
It takes the window size, which indicates
00:06:48.033
how many packets it's supposed to send
00:06:50.533
in that round-trip time,
00:06:52.533
how many bytes of data it’s supposed
00:06:54.366
to send in that round-trip time,
00:06:56.066
and it divides it by the base
00:06:57.433
round trip time. So if you divide
00:07:00.633
number of bytes by time, you get
00:07:03.166
a bytes per second, and that gives
00:07:05.566
you the rate at which it should be sending data.
00:07:09.333
And if the network can
00:07:12.033
support sending at that rate, it should
00:07:14.366
be able to deliver that window of
00:07:17.800
packets within a complete round trip time.
00:07:20.866
And, if it can’t, it will take
00:07:22.566
longer than a round trip time to
00:07:24.300
deliver that window of packets, and the
00:07:25.866
queues will be gradually building up Alternatively,
00:07:28.866
if it takes less than a round
00:07:30.333
trip time, this is an indication that
00:07:31.900
the queues are decreasing.
00:07:35.500
And it measures the actual rate at
00:07:37.100
which it sends the packets.
00:07:39.466
And it compares them.
00:07:41.600
And if the actual rate at which
00:07:43.166
it's sending packets is less than the
00:07:45.466
expected rate, if it's taking longer than
00:07:47.733
a round-trip time to deliver the complete
00:07:49.633
window worth of packets, this is a
00:07:51.700
sign that the packets can’t all be delivered.
00:07:56.866
And it, you know, it's trying to send too
00:07:59.966
much. It’s trying to send at too
00:08:01.666
fast a rate, and it should reduce
00:08:03.166
its rate and let the queues drop.
00:08:05.900
Equally, in the other case it should
00:08:08.333
increase its rate, and measuring the difference
00:08:10.966
between the actual and the expected rates,
00:08:13.800
it can measure whether the queues growing or shrinking.
00:08:18.733
And TCP Vegas compares the expected rate,
00:08:21.966
which actually manages to send at,
00:08:24.566
the expected rate at which it gets
00:08:26.566
the acknowledgments back, with the actual rate.
00:08:30.600
And it adjusts the window.
00:08:34.333
And if the expected rate, minus the
00:08:37.700
actual rate, is less than some threshold,
00:08:40.700
that indicates that it should increase its
00:08:43.666
window. And if the expected rate,
00:08:45.933
minus the actual rate, is greater than
00:08:48.000
some other threshold, then it should decrease the window.
00:08:51.266
That is, if data is arriving at
00:08:53.633
the expected rate, or very close to
00:08:56.200
it, this is probably a sign that
00:08:58.366
the network can support a higher rate,
00:09:00.533
and you should try sending a little bit faster.
00:09:03.566
Alternatively, if data is arriving slower
00:09:06.133
than it's being sent,
00:09:07.133
this is a sign that you're sending too fast and you
00:09:09.233
should slow down.
00:09:10.833
And the two thresholds, R1 and R2,
00:09:12.933
determine how close you have to be
00:09:15.033
to the expected rate, and how far
00:09:16.866
away from it you have to be in order to slow down.
00:09:20.733
And the result is that TCP Vegas
00:09:24.700
follows a much smoother transmission rate.
00:09:28.300
Unlike TCP Reno, which follows the characteristic
00:09:31.700
sawtooth pattern, or TCP cubic which follows the
00:09:35.866
cubic equation to change it’s rate,
00:09:39.533
both of which adapt quite abruptly whenever
00:09:43.233
there's a packet loss,
00:09:44.933
TCP Vegas makes a gradual change.
00:09:47.266
It gradually increases, or decreases, it’s sending
00:09:50.466
rate in line with the variations in
00:09:52.833
the queues. So, it’s a much smoother
00:09:54.900
algorithm, which doesn't continually build up and
00:09:58.266
empty the queues.
00:10:01.166
Because the queues are not continuing building
00:10:03.966
up, not continually being filled, this keeps
00:10:08.366
the latency down
00:10:09.400
while still achieving recently good performance.
00:10:15.833
TCP Vegas is a good idea in principle.
00:10:21.633
This idea is known as delay-based congestion
00:10:24.600
control, and I think it's actually a
00:10:26.500
really good idea in principle. It reduces
00:10:29.666
the latency, because it doesn't fill the queues.
00:10:33.100
It reduces the packet loss, because it's
00:10:35.300
not causing, t's not pushing the queues
00:10:38.133
to overflow and causing packets to be
00:10:39.833
lost. So the only packet losses you
00:10:42.233
get are those caused by transmission problems.
00:10:45.433
And this reduces unnecessary, reduces you having
00:10:48.766
to transmit packets, because you forced the
00:10:50.633
network into overload, and forced it to
00:10:52.633
lose the packets, and it reduces the latency.
00:10:57.200
The problem with TCP Vegas is that
00:11:00.600
it doesn't work, doesn’t interwork work with,
00:11:03.900
TCP Reno or TCP cubic.
00:11:07.833
If you have any TCP Reno or
00:11:10.200
Cubic flows on the network, they will
00:11:12.300
aggressively increase their sending rate and try
00:11:15.300
to fill the queues, and the push
00:11:17.300
the queues into overload.
00:11:19.966
And this will increase the round-trip time,
00:11:22.966
reduce the rate at which Vegas can
00:11:26.300
send, and it will force TCP Vegas to slow down.
00:11:30.033
Because TCP Vegas sees the queues increasing,
00:11:33.033
because Cubic and Reno are intentionally trying
00:11:36.266
to fill those queues, and if the
00:11:38.333
queues increase, this causes Vegas to slow down.
00:11:41.200
That gradually means there's more space in
00:11:44.200
the queues, which Cubic and Reno will
00:11:46.633
gradually fill-up, which causes Vegas to slow
00:11:49.200
down, and they end up in a
00:11:50.900
spiral, where the TCP Vegas flows get
00:11:52.800
pushed down to zero, and the Reno
00:11:55.700
or Cubic flows use all of the capacity.
00:11:59.333
So if we only have TCP Vegas
00:12:01.400
in the network, I think it would
00:12:03.466
behave really nicely, and we get really
00:12:05.500
good, low latency, behaviour from the network.
00:12:08.900
Unfortunately we're in a world where Reno,
00:12:11.933
and Cubic, have been deployed everywhere.
00:12:14.733
And without a step change, without an
00:12:18.933
overnight switch where we turn of Cubic,
00:12:21.966
and we turn off Reno, and we
00:12:23.366
turn on Vegas, everywhere we can't deploy
00:12:25.900
TCP Vegas because always loses out to
00:12:28.866
Reno and Cubic.
00:12:31.166
So, it's a good idea in principle,
00:12:33.233
but in practice it can't be used
00:12:35.033
because of the deployment challenge.
00:12:40.600
As I say, it's a good idea
00:12:42.733
in principle, and the idea of using
00:12:45.433
delay as a congestion signal is a
00:12:47.766
good idea in principle, because we can
00:12:50.066
get something which achieves lower latency.
00:12:54.866
Is it possible to deploy a different
00:12:57.733
algorithm? Maybe the problem is not principal,
00:13:00.266
maybe the problem is the algorithm in TCP Vegas?
00:13:05.466
Well, people are trying alternatives which are delay based.
00:13:10.233
And the most recent attempt at this
00:13:12.966
is an algorithm called TCP BBR,
00:13:15.200
Bottleneck Bandwidth and Round-trip time.
00:13:18.466
And again, this is a proposal that
00:13:20.533
came out of Google. And one of
00:13:23.133
the co-authors, if you look at the
00:13:25.533
paper on the right, is Van Jacobson,
00:13:28.033
who was the original designer of TCP
00:13:30.300
congestion control. So there's clearly some smart
00:13:32.833
people behind this.
00:13:34.600
The idea is that it tries to explicitly
00:13:36.966
measure the round-trip time as it sends
00:13:39.500
the packets. It tries to explicitly measure
00:13:42.133
the sending rate in much the same way same way that
00:13:45.666
TCP Vegas does. And, based on those
00:13:48.233
measurements, and some probes where it varies
00:13:51.533
its rate to try and find if
00:13:53.400
it's got more capacity, or try and
00:13:55.400
sense if there is other traffic on the network.
00:13:58.533
It tries to directly set a congestion
00:14:01.066
window that matches the network capacity,
00:14:04.066
based on those measurements.
00:14:06.533
And, because this came out of Google,
00:14:08.600
it got a lot of press,
00:14:10.666
and Google turned it on for a
00:14:13.533
lot of their traffic. I know they
00:14:15.433
were running it for YouTube for a
00:14:16.866
while, and a lot of people saw
00:14:18.966
this, and jumped on the bandwagon.
00:14:21.333
And, for a while, it was starting
00:14:23.100
to get a reasonable amount of deployments.
00:14:27.100
The problem is, it turns out not to work very well.
00:14:31.066
And Justine Sherry at Carnegie Mellon University,
00:14:36.733
and her PhD student Ranysha Ware,
00:14:39.500
did a really nice bit of work
00:14:41.533
that showed that is incredibly unfair to
00:14:44.400
regular TCP traffic.
00:14:46.766
And, it's unfair in kind-of the opposite
00:14:49.633
way to Vegas. Whereas TCP Reno and
00:14:53.600
TCP Cubic would force TCP Vegas flows
00:14:56.400
down to nothing, TCP BBR is unfair
00:14:59.766
in the opposite way, and it demolishes
00:15:02.600
Reno and Cubic flows, and causes tremendous
00:15:05.266
amounts of packet loss for those flows.
00:15:08.266
So it's really much more aggressive than
00:15:11.133
the other flows in certain cases,
00:15:13.233
and this leads to really quite severe unfairness problems.
00:15:17.533
And the Vimeo link on the slide is a link to the talk at
00:15:24.133
the Internet Measurement Conference, where Ranysha talks
00:15:28.233
through that, and demonstrates really clearly that
00:15:30.966
TCP BBR version 2 is really quite problematic, and
00:15:36.033
not very safe to deploy on the current network.
00:15:41.066
And there's a there's a variant called
00:15:43.100
BBR v2, which is under development,
00:15:46.266
and seems to be changing,
00:15:48.566
certainly on a monthly basis, which is
00:15:51.433
trying to solve these problems. And this
00:15:53.866
is very much an active research area,
00:15:55.833
where people are looking to find better alternatives.
00:16:01.966
So that's the principle of delay-based congestion control.
00:16:05.400
Traditional TCP, the Reno algorithm and the
00:16:09.100
Cubic algorithms, intentionally try to fill the
00:16:12.166
queues, they intentionally try to cause latency.
00:16:16.633
TCP Vegas is one well-known algorithm which
00:16:20.833
tries to solve this, and
00:16:24.200
doesn't work in practice, but in principle
00:16:27.766
is a good idea, it just has
00:16:30.033
some deployment challenges, given the installed base
00:16:32.800
of Reno and Cubic.
00:16:35.366
And there are new algorithms, like TCP
00:16:38.200
BBR, which don't currently work well,
00:16:41.466
but have potential to solve this problem.
00:16:44.466
And, hopefully, in the future, a future
00:16:47.166
variant of BBR will work effectively,
00:16:51.800
and we'll be able to transition to
00:16:53.633
a lower latency version of TCP.
Part 5: Explicit Congestion Notification
The use of delay-based congestion control is one way of reducing
network latency. Another is to keep Reno and Cubic-style congestion
control, but to move away from using packet loss as an implicit
congestion signal, and instead provide an explicit congestion
notification from the network to the applications. This part of
the lecture introduces the ECN extension to TCP/IP that provides
such a feature, and discusses its operation and deployment.
Slides for part 5
00:00:00.433
In the previous parts of the lecture,
00:00:02.166
I’ve discussed TCP congestion control. I’ve discussed
00:00:05.566
how TCP tries to measure what the
00:00:07.700
network's doing and, based on those measurements,
00:00:10.266
adapt it’s sending rate to match the
00:00:12.433
available network capacity.
00:00:14.466
In this part, I want to talk
00:00:15.866
about an alternative technique, known as Explicit
00:00:18.300
Congestion Notification, which allows the network to
00:00:20.733
directly tell TCP when it's sending too
00:00:22.966
fast, and needs to reduce it’s transmission rate.
00:00:28.500
So, as we've discussed, TCP infers the
00:00:31.833
presence of congestion in the network through measurement.
00:00:36.066
If you're using TCP Reno or TCP
00:00:39.066
Cubic, like most TCP flows in the
00:00:42.466
network today, then the way it infers
00:00:45.500
that is because there's packet loss.
00:00:48.033
TCP Reno and TCP Cubic keep gradually
00:00:51.400
increasing their sending rates, trying to cause
00:00:54.333
the queues to overflow.
00:00:56.200
And they cause a queue overflow,
00:00:58.366
cause a packet to be lost,
00:00:59.800
and use that packet loss as the
00:01:01.366
signal that the network is busy,
00:01:04.200
that they've reached the network capacity,
00:01:05.966
and they should reduce the sending rate.
00:01:09.066
And this is problematic for two reasons.
00:01:11.866
First, is because it increases delay.
00:01:15.266
It's continually pushing the queues to be
00:01:18.266
full, which means the network’s operating with
00:01:20.833
full queues, with its maximum possible delay.
00:01:24.400
And the second is because it makes
00:01:27.066
it difficult to distinguish loss which is
00:01:29.533
caused because the queues overflowed, from loss
00:01:32.766
caused because of a transmission error on
00:01:35.900
a link, so called non-congestive loss,
00:01:38.533
which you might get due to interference or a wireless link.
00:01:43.766
The other approach people have discussed,
00:01:45.666
is the approach in TCP Vegas,
00:01:48.233
where look at variation in queuing latency
00:01:51.500
and use that as an indication of loss.
00:01:54.400
So, rather than pushing the queue until
00:01:56.333
it overflows, and detecting the overflow,
00:01:58.866
you watch to see as the queue
00:02:00.733
starts to get bigger, and use that
00:02:02.633
as an indication that you should reduce
00:02:04.233
your sending rate. Or, equally, you spot
00:02:07.300
the queue getting smaller, and use that
00:02:08.900
as an indication that you should maybe
00:02:10.466
increase your sending rate.
00:02:12.700
And this is conceptually a good idea,
00:02:14.566
as we discussed in the last part,
00:02:16.733
because it lets you run TCP with
00:02:18.866
lower latency. But it's difficult to deploy,
00:02:21.833
because it interacts poorly with TCP Cubic
00:02:25.333
and TCP Reno, both of which try
00:02:27.833
to fill the queues.
00:02:31.966
As a result, we're stuck with using
00:02:34.333
Reno and Cubic, and we're stuck with
00:02:36.333
full queues in the network. But we'd
00:02:38.900
like to avoid this, we'd like to
00:02:40.466
go for a lower latency way of
00:02:42.666
using TCP, and make the network work
00:02:45.533
without filling the queues.
00:02:49.300
So one way you might go about
00:02:50.766
doing this is, rather than have TCP
00:02:54.200
push the queues to overflow,
00:02:56.966
have the network rather tell TCP when
00:02:59.866
it's sending too fast.
00:03:02.433
Have something in the network tell the
00:03:04.933
TCP connections that they are congesting the
00:03:07.666
network, and they need to slow down.
00:03:11.233
And this thing is called Explicit Congestion Notification.
00:03:17.333
Explicit Congestion Notification, the ECN bits,
00:03:21.733
are present in the IP header.
00:03:25.266
The slide shows an IPv4 header with
00:03:27.833
the ECN bits indicated in red.
00:03:30.333
The same bits are also present in
00:03:32.500
IPv6, and they're located in the same
00:03:34.766
place in the packet in the IPv6 header.
00:03:38.066
The way these are used.
00:03:40.233
If the sender doesn't support ECN,
00:03:42.866
it sets these bits to zero when
00:03:44.700
it transmits the packet. And they stay
00:03:46.866
at zero, nothing touches them at that point.
00:03:50.233
However, if the sender does support ECN,
00:03:52.933
and it sets these bits to have
00:03:54.700
the value 01, so it sets bit
00:03:57.400
15 of the header to be 1,
00:04:00.433
and it transmits the IP packets as
00:04:02.933
normal, except with this one bit set
00:04:05.066
to indicate that the sender understands ECN.
00:04:10.000
If congestion occurs in the network,
00:04:12.966
if some queue in the network is
00:04:16.333
beginning to get full, it’s not yet
00:04:19.266
at the point of overflow but it's
00:04:20.733
beginning to get full, such that some
00:04:22.800
router in the network thinks it's about
00:04:24.833
to start experiencing congestion,
00:04:27.200
then that router, that router in the
00:04:30.100
network, changes those bits in the IP
00:04:32.433
packets, of some of the packets going
00:04:34.233
past, and sets both of the ECN bits to one.
00:04:38.266
This is known as an ECN Congestion Experienced mark.
00:04:42.333
It's a signal. It's a signal from
00:04:44.966
the network to the endpoints, that the
00:04:47.500
network thinks it's getting busy, and the
00:04:49.266
endpoint should slow down.
00:04:53.266
And that's all it does. It monitors
00:04:55.466
the occupancy in the queues, and if
00:04:57.766
the queue occupancy is higher than some
00:04:59.466
threshold, it sets the ECN bits in
00:05:01.666
the packets going past, to indicate that
00:05:04.766
threshold has been reached and the network
00:05:06.766
is starting to get busy.
00:05:09.233
If the queue overflows,
00:05:11.133
if the endpoints keep sending faster and
00:05:13.866
the queue overflows, then it drops the
00:05:15.466
packet so as normal. The only difference
00:05:17.433
is that there's some intermediate point where
00:05:19.766
the network is starting to get busy,
00:05:21.500
but the queue has not yet overflowed.
00:05:23.966
And at that point, the network marks
00:05:25.666
the packets indicate that it's getting busy.
00:05:32.100
A receiver might get a TCP packet,
00:05:35.133
a TCP segment, delivered within an IP
00:05:37.866
packet, where that IP packet has the
00:05:40.700
ECN Congestion Experienced mark set. Where the
00:05:43.666
network has changed those two bits in
00:05:45.766
the IP header to 11, to indicate
00:05:48.800
that it's experiencing congestion.
00:05:52.366
What it does that point at that
00:05:54.666
point, is it sets a bit in
00:05:58.100
the TCP header of the acknowledgement packet
00:06:01.600
it sends back to the sender.
00:06:04.266
That bit’s known as the ECN Echo
00:06:06.866
field, the ECE field. It sets this
00:06:09.933
bit in the TCP header equal to
00:06:12.633
one on the next packet it sends
00:06:15.600
back to the sender, after it received
00:06:18.033
the IP packet, containing the TCP segment,
00:06:21.400
where that IP packet was marked Congestion Experienced.
00:06:26.133
So the receiver doesn't really do anything
00:06:28.833
with the Congestion Experienced mark, other than
00:06:31.233
mark, set the equivalent mark in the
00:06:33.533
packet it sends back to the sender.
00:06:35.866
So it's telling the sender, “I got
00:06:37.733
a Congestion Experienced mark in one of
00:06:39.900
the packets you sent”.
00:06:43.600
When that packet gets to the sender,
00:06:46.600
the sender sees this bit in the
00:06:48.866
TCP header, the ECN Echo bit set
00:06:52.133
to one, and it realises that the
00:06:54.200
data it was sending
00:06:56.433
caused a router on the path to
00:07:00.000
set the ECN Congestion Experienced mark,
00:07:03.000
which the receiver has then fed back to it.
00:07:07.333
And what it does at that point,
00:07:09.100
is it reduces its congestion window.
00:07:11.800
It acts as-if a packet had been
00:07:15.000
lost, in terms of how it changes its congestion window.
00:07:19.066
So if it's a TCP Reno sender,
00:07:21.733
it will halve its congestion window,
00:07:24.200
the same way it would if a packet was lost.
00:07:27.000
If it's a TCP Cubic sender,
00:07:29.200
it will back off its congestion window
00:07:31.533
to 70%, and then enter the weird
00:07:35.533
cubic equation for changing its congestion window.
00:07:41.033
After it does that, it sets another
00:07:43.900
bit in the header of the next
00:07:47.366
TCP segment it sends out. It sets
00:07:49.900
the CWR bit, the Congestion Window Reduced
00:07:52.533
bit, in the header to tell the
00:07:54.533
network and the receiver that it's done it.
00:07:59.200
So the end result of this,
00:08:00.933
is that rather than a packet being lost
00:08:03.900
because the queue overflowed, and then the
00:08:06.500
acknowledgments coming back indicating, via the triple
00:08:09.466
duplicate ACK, that's a packet had been
00:08:11.166
lost, and then TCP reducing its congestion
00:08:14.266
window and re-transmitting that lost packet.
00:08:17.866
What happens is,
00:08:20.633
the IP packets, TCP packets, in the
00:08:24.366
outbound direction gets a Congestion Experienced mark
00:08:27.400
set, to indicate that the network is
00:08:29.566
starting to get full.
00:08:31.633
The ECN Echo bit is set on
00:08:33.500
the reply, and at that point the
00:08:35.666
sender reduces its window,
00:08:37.733
as-if the loss had occurred.
00:08:42.700
And then carries on sending with the
00:08:44.633
CWR bit set to one on that
00:08:46.533
next packet. So it has the same
00:08:49.000
effect, in terms of reducing the congestion window, as would
00:08:52.600
dropping a packet, but without dropping a
00:08:54.766
packet. So there's no actual packet loss
00:08:56.933
here, there’s just a mark to indicate
00:08:58.833
that the network was getting busy.
00:09:00.500
So it doesn't have to retransmit data,
00:09:02.666
and this happens before the queue is
00:09:04.500
full, so you get lower latency.
00:09:08.300
So ECN is a mechanism to allow
00:09:11.766
TCP to react to congestion before packet loss occurs.
00:09:16.600
It allows routers in the network to
00:09:18.700
signal congestion before the queue overflows.
00:09:21.866
It allows routers in the network to
00:09:23.500
say to TCP, “if you don't slow
00:09:25.566
down, this queue is going to overflow,
00:09:27.900
and I’m going to throw your packets away”.
00:09:31.533
it's independent of how TCP then responds,
00:09:34.366
whether it follows Reno or Cubic or
00:09:37.466
Vegas that doesn't really matter, it's just
00:09:39.600
an indication that it needs to slow
00:09:41.266
down because the queues are starting to
00:09:43.166
build up, and will overflow soon if it doesn't.
00:09:47.466
And if TCP reacts to that,
00:09:49.400
reacts to the ECN Echo bit going
00:09:51.566
back, and the sender reduces its rate,
00:09:53.966
the queues will empty, the router will
00:09:55.700
stop marking the packets, and everything will
00:09:57.900
settle down at a slightly slower rates
00:10:00.300
without causing any packet loss.
00:10:02.733
And the system will adapt, and it
00:10:05.600
will it will achieve the same sort
00:10:07.800
of throughput, it will just react earlier,
00:10:11.100
so you have smaller queues and lower latency.
00:10:14.500
And this gives you the same throughput
00:10:16.966
as you would with TCP Reno or
00:10:20.400
TCP Cubic, but with low latency,
00:10:22.333
which means it's better for competing video
00:10:25.100
conferencing or gaming traffic.
00:10:28.433
And I’ve described the mechanism for TCP,
00:10:31.066
but there are similar ECN extensions for
00:10:33.833
QUIC and for RTP, which is the
00:10:36.566
video conferencing protocol, all designed to achieve
00:10:39.933
the same goal.
00:10:44.400
So ECN, I think, is unambiguously a
00:10:47.100
good thing. It’s a signal from the
00:10:48.866
network to the endpoints that the network
00:10:50.966
is starting to get congested, and the
00:10:52.866
endpoints should slow down.
00:10:54.500
And if the endpoints believe it,
00:10:56.666
if they back off,
00:10:58.500
they reduce their sending rate before the
00:11:00.900
network is overloaded, and we end up
00:11:03.966
in a world where h we still
00:11:06.966
achieve good congestion control, good throughput,
00:11:11.133
but with lower latency.
00:11:13.100
And, if the endpoints don't believe it
00:11:15.200
well, eventually, the routers, the queues,
00:11:17.233
overflow and they lose packets, and we’re
00:11:19.100
no worse-off than we are now.
00:11:22.133
In order to deploy ECN, though,
00:11:25.600
we need to make changes. We need
00:11:27.900
to change the endpoints, to change the
00:11:29.700
end systems, to support these bits in
00:11:31.766
the IP header, and to support,
00:11:33.766
to add support for this into TCP.
00:11:36.500
And we need to update the routers,
00:11:38.666
to actually mark the packets when they're
00:11:40.333
starting to get overloaded.
00:11:44.333
Updating the end points has pretty much
00:11:47.066
been done by now.
00:11:49.200
I think every TCP implementation,
00:11:54.100
implemented in the last 15-20 years or
00:11:57.200
so, supports ECN, and these days,
00:12:00.000
most of them have it turned on by default.
00:12:04.266
And I think we actually have Apple
00:12:06.866
to thank for this.
00:12:09.033
ECN, for a long time, was implemented
00:12:12.900
but turned off by default, because there’d
00:12:15.233
been problems with some old firewalls which
00:12:17.900
reacted badly to it, 20 or so years ago.
00:12:22.233
And, relatively recently, Apple decided that they
00:12:25.666
wanted these lower latency benefits, and they
00:12:29.833
thought ECN should be deployed. So they
00:12:32.566
started turning it on by default in the iPhone.
00:12:37.100
And they kind-of followed an interesting approach.
00:12:40.100
In that for iOS nine, a random
00:12:43.133
subset of 5% of iPhones would turn
00:12:46.233
on ECN for some of their connections.
00:12:51.433
And they measured what happened. And they
00:12:54.233
found out that in the overwhelming majority
00:12:56.433
of cases this worked fine, and occasionally
00:12:59.133
it would fail.
00:13:01.400
And they would call up the network
00:13:03.966
operators, who's networks were showing problems,
00:13:07.433
and they would say “your network doesn't
00:13:10.333
work with iPhones; and currently it's not
00:13:12.800
working well with 5% of iPhones but
00:13:15.233
we're going to increase that number,
00:13:16.933
and maybe you should fix it”.
00:13:19.600
And then, a year later, when iOS
00:13:21.633
10 came out, they did this 50%
00:13:24.066
of connections made by iPhones. And then
00:13:26.933
a year later, for all of the connections.
00:13:30.000
And it's amazing what impact a
00:13:34.200
popular vendor calling up a network operator connect can
00:13:41.433
have on getting them to fix the equipment.
00:13:45.066
And, as a result,
00:13:47.200
ECN is now widely enabled by default
00:13:50.500
in the phones, and the network seems
00:13:53.333
to support it just fine.
00:13:56.300
Most of the routers also support ECN.
00:13:58.833
Although currently relatively few of them seem
00:14:01.400
to enable it by default. So most
00:14:04.066
of the endpoints are now
00:14:05.633
at the stage of sending ECN enabled
00:14:08.166
traffic, and are able to react to
00:14:10.900
the ECN marks, but most of the
00:14:13.400
networks are not currently setting the ECN marks.
00:14:16.933
This is, I think, starting to change.
00:14:19.533
Some of the recent DOCSIS, which is
00:14:22.266
the cable modem standards, are starting to
00:14:26.400
support you ECN. We’re starting to see
00:14:29.500
cable modems, cable Internet connections, which enable
00:14:33.566
ECN by default.
00:14:35.866
And, we're starting to see interest from
00:14:38.900
3GPP, which is the mobile phone standards
00:14:41.100
body to enable this in 5G,
00:14:43.933
6G, networks, so I think it's coming.
00:14:47.100
but it's going to take time.
00:14:49.066
And, I think, as it comes,
00:14:51.233
as ECN gradually gets deployed, we’ll gradually
00:14:53.766
see a reduction in latency across the
00:14:56.000
networks. It’s not going to be dramatic.
00:14:59.400
It's not going to suddenly transform the
00:15:01.300
way the network behaves, but hopefully over
00:15:04.033
the next 5 or 10 years we’ll
00:15:06.166
gradually see the latency reducing as ECN
00:15:09.433
gets more widely deployed.
00:15:13.900
So that's what I want to say
00:15:15.800
about ECN. It’s a mechanism by which
00:15:17.966
the network can signal to the applications
00:15:20.133
that the network is starting to get
00:15:22.033
overloaded, and allow the applications to back
00:15:24.433
off more quickly, in a way which
00:15:26.966
reduces latency and reduces packet loss.
Part 6: Light Speed?
The final part of the lecture moves on from congestion control and
queueing, and discusses another factor that affects latency: the
network propagation delay. It outlines what is the propagation delay
and ways in which it can be reduced, including more direct paths and
the use of low-Earth orbit satellite constellations.
Slides for part 6
00:00:00.433
In this final part of the lecture,
00:00:02.100
I want to move on from talking
00:00:03.600
about congestion control, and the impact of
00:00:05.733
queuing delays on latency, and talk instead
00:00:08.233
about the impact of propagation delays.
00:00:12.300
So, if you think about the latency
00:00:15.166
for traffic being delivered across the network,
00:00:17.433
there are two factors which impact that latency.
00:00:21.433
The first is the time packets spent
00:00:23.933
queued up at various routers within the network.
00:00:28.033
As we've seen in the previous parts
00:00:29.733
of this lecture, this is highly influenced
00:00:32.033
by the choice of TCP congestion control,
00:00:35.100
and whether Explicit Congestion Notification
00:00:37.566
is enabled or not.
00:00:39.533
The other factor, that we've not really
00:00:41.900
discussed to date, is the time it
00:00:44.066
takes the packets to actually propagate down
00:00:46.333
the links between the routers. This depends
00:00:48.700
on the speed at which the signal
00:00:50.500
propagates down the transmission medium.
00:00:53.233
If you're using an optical fibre to
00:00:55.233
transmit the packets, it depends on the
00:00:57.333
speed at which the light propagates through the fibre.
00:01:00.700
If you're using electrical signals in a
00:01:03.133
cable, it depends on the speed at
00:01:04.933
which electrical field propagates down the cable.
00:01:07.600
And if you're using radio signals,
00:01:09.366
it depends on the speed of light,
00:01:11.100
the speed at which the radio signals
00:01:12.666
propagate through the air.
00:01:17.000
As you might expect, physically shorter links
00:01:21.100
have lower propagation delays.
00:01:23.533
A lot of the time it takes
00:01:25.600
a packet to get down a long
00:01:27.233
distance link is just the time it
00:01:29.400
takes the signal to physically transmit along
00:01:32.133
the link. If you make the link
00:01:33.633
shorter it takes less time.
00:01:37.300
And what is perhaps not so obvious,
00:01:40.500
though, is that you can actually get
00:01:43.000
significantly significant latency benefits in certain paths,
00:01:48.166
because the existing network links follow quite
00:01:51.533
indirect routes.
00:01:53.766
For example, if you look at the
00:01:55.566
path the network links take, if you're
00:01:58.066
sending data from Europe to Japan.
00:02:01.066
Quite often, that data goes from Europe,
00:02:03.900
across the Atlantic to, for example,
00:02:06.533
New York or Boston, or somewhere like
00:02:08.900
that, across the US to
00:02:12.866
San Francisco, or Los Angeles, or Seattle,
00:02:17.000
or somewhere along those lines, and then
00:02:19.600
from there, in a cable across the
00:02:21.966
Pacific to Japan.
00:02:25.133
Or alternatively, it goes from Europe through
00:02:27.733
the Mediterranean, the Suez Canal and the
00:02:30.433
Middle East, and across India, and so
00:02:32.800
on, until it eventually reaches Japan the
00:02:35.600
other way around. But neither of these
00:02:38.100
is a particularly direct route.
00:02:40.666
And it turns out that there is
00:02:42.933
a much more direct, a much faster
00:02:44.900
route, to get from Europe to Japan,
00:02:48.033
which is to lay a an optical fibre
00:02:51.233
through the Northwest Passage, across Northern Canada,
00:02:55.233
through the Arctic Ocean, and down through
00:02:57.733
the Bering Strait, and past Russia to
00:02:59.866
get directly to Japan. It's much closer
00:03:03.200
to the great circle route around the
00:03:04.966
globe, and it's much shorter than the
00:03:07.066
route that the networks currently take.
00:03:10.000
And, historically, this hasn't been possible because
00:03:12.566
of the ice in the Arctic.
00:03:14.666
But, with global warming, the Northwest Passage
00:03:17.800
is now ice-free for enough of the
00:03:20.766
year that people are starting to talk
00:03:23.100
about laying optical fibres along that route,
00:03:26.266
because they can get a noticeable latency
00:03:28.733
reduction, for certain amounts of traffic,
00:03:31.400
by just following the physically shorter route.
00:03:38.400
Another factor which influences the propagation delay
00:03:42.600
is the speed of light in the transmission media.
00:03:47.433
Now, if you're sending data using radio links,
00:03:52.000
or using lasers in a vacuum,
00:03:57.033
then these propagate at the speed of light in the vacuum.
00:04:01.100
Which is about 300 million meters per second.
00:04:05.700
The speed of light in optical fibre,
00:04:07.733
though, is slower. The speed at which
00:04:09.900
light propagates down that down a fibre,
00:04:12.633
the speed at which light propagates through
00:04:14.566
glass, is only about 200,000.
00:04:17.000
kilometres per second, 200 million meters per
00:04:19.466
second. So it’s about two thirds of
00:04:21.800
the speed at which it propagates in a vacuum.
00:04:25.633
And this is the reason for systems
00:04:28.100
such as StarLink, which SpaceX is deploying.
00:04:32.900
And the idea of these systems is
00:04:34.700
that, rather than sending the Internet signals
00:04:38.000
down an optical fibre,
00:04:40.133
you send them 100, or a couple
00:04:42.300
of hundred miles, up to a satellite,
00:04:44.733
and they then go around between various
00:04:47.466
satellites in the constellation, in low earth
00:04:50.033
orbit, and then down to a receiver
00:04:53.700
near the destination.
00:04:55.833
And by propagating through vacuum, rather than
00:04:58.833
through optical fibre, the speed of light
00:05:02.800
in vacuum is significantly faster, it's about
00:05:05.300
50% faster than the speed of light
00:05:07.966
in fibre, and this can reduce the latency.
00:05:11.166
And the estimates show that if you
00:05:14.533
have a large enough constellation of satellites,
00:05:17.300
and SpaceX is planning on deploying around
00:05:19.666
4000 satellites, I believe, and with careful
00:05:23.133
routing, you can get about a 40,
00:05:25.800
45, 50% reduction in latency.
00:05:28.566
Just because the signals are transmitting via
00:05:31.866
radio waves, and via inter-satellite laser links,
00:05:35.733
which are in a vacuum, rather than
00:05:39.700
being transmitted through a fibre optic cable.
00:05:42.166
Just because of the differences in the
00:05:44.100
speed of light between the two mediums.
00:05:47.100
And the link on the slide points
00:05:49.733
to some simulations of the StarLink network,
00:05:52.333
which try and demonstrate how this would
00:05:54.966
work, and how it can achieve
00:05:57.366
both network paths that closely follow the
00:06:01.266
great circle routes, and
00:06:03.366
how it can reduce the latency because
00:06:07.566
of the use of satellites.
00:06:13.433
So, what we see is that people
00:06:15.133
are clearly going to some quite extreme
00:06:17.100
lengths to reduce latency.
00:06:19.500
I mean, what we spoke about in
00:06:21.933
the previous part was the use of
00:06:24.366
ECN marking to reduce latency by reducing
00:06:26.766
the amount of queuing. And that's just
00:06:29.200
a configuration change, it’s a software change
00:06:31.466
to some routers. And that seems to
00:06:33.666
me like a reasonable approach to reducing latency.
00:06:36.900
But some people are clearly willing to
00:06:39.633
go to the effort of
00:06:41.833
launching thousands of satellites, or
00:06:44.666
perhaps the slightly less extreme case of
00:06:49.033
laying new optical fibres through the Arctic Ocean.
00:06:53.000
So why are people doing this? Why
00:06:54.933
do people care so much about reducing
00:06:57.100
latency, that they're willing to spend billions
00:06:59.900
of dollars launching thousands of satellites,
00:07:02.833
or running new undersea cables, to do this?
00:07:06.833
Well, you'll be surprised to hear that
00:07:09.233
this is not to improve your gaming
00:07:11.166
experience. And this is not to improve
00:07:13.500
the experience of your zoom calls.
00:07:16.033
Why are people doing this? High frequency share trading.
00:07:20.800
Share traders believe they can make a
00:07:23.600
lot of money, by getting a few milliseconds worth
00:07:27.900
of latency reduction compared to their competitors.
00:07:33.600
Whether that's a good use of a
00:07:35.833
few billion dollars i'll let you decide.
00:07:38.800
But the end result may be,
00:07:41.433
hopefully, that we will get lower latency
00:07:43.866
for the rest of us as well.
00:07:48.733
And that concludes this lecture.
00:07:52.433
There are a bunch of reasons why
00:07:54.566
we have latency in the network.
00:07:56.600
Some of this is due to propagation
00:07:59.200
delays. Some of this, perhaps most of
00:08:01.166
it, in many cases, is due to
00:08:02.866
queuing at intermediate routers.
00:08:05.733
The propagation delays are driven by the speed of light.
00:08:09.200
And unless you can launch many satellites,
00:08:12.966
or lay more optical fibres, that's pretty
00:08:17.500
much a fixed constant, and there's not
00:08:19.833
much we can do about it.
00:08:22.966
Queuing delays, though, are things which we
00:08:25.833
can change. And a lot of the
00:08:28.066
queuing delays in the network are caused
00:08:30.000
because of TCP Reno and TCP Cubic,
00:08:34.400
which push for the queues to be full.
00:08:37.733
Hopefully, we will see improved TCP congestion
00:08:41.366
control algorithms. And TCP Vegas was one
00:08:44.600
attempt in this direction, which unfortunately proved
00:08:48.066
not to be deployable in practice,
00:08:50.833
TCP BBR was another attempt which
00:08:54.233
was problematic for other reasons, because of
00:08:57.433
its unfairness. But people are certainly working
00:09:00.066
on an alternative algorithms in this space,
00:09:02.866
and hopefully we'll see things deployed before too long.
Discussion
Lecture 6 discussed TCP congestion control and its impact on latency.
It discussed the principles of congestion control (e.g., the sliding
window algorithm, AIMD, conservation of packets), and their realisation
in TCP Reno. It reviewed the choice of TCP initial window, slow start,
and the congestion avoidance phase, and the response of TCP to packet
loss as a congestion signal.
The lecture noted that TCP Reno cannot effectively make use of fast
and long distance paths (e.g., gigabit per second flows, running on
transatlantic links). It discussed the TCP Cubic algorithm, that
changes the behaviour of TCP in the congestion avoidance phase to
make more effective use of such paths.
And it noted that both TCP Reno and TCP Cubic will try to increase
their sending rate until packet loss occurs, and will use that loss
as a signal to slow down. The fills the in-network queues at routers
on the path, causing latency.
The lecture briefly discussed TCP Vegas, and the idea of using delay
changes as a congestion signal instead of packet loss, and it noted
that TCP Vegas is not deployable in parallel with TCP Reno or Cubic.
It highlighted ongoing research with TCP BBR to address some of the
limitations of TCP Vegas.
Finally, the lecture highlighted the possible use of Explicit Congestion
Notification as a way of signalling congestion to the endpoints, and of
causing TCP to reduce its sending rate, before the in-network queues
overflow. This potentially offers a way to reduce latency.
Discussion will focus on the behaviour of TCP Reno congestion control,
and of understanding how this leads to increased latency. It will discuss
the applicability and ease of deployment of ways of reducing that latency.