This lecture discusses some of the factors that affect the latency of
a TCP congestion. It considers TCP congestion control, the TCP Reno
and Cubic congestion control algorithms, and their behaviour and
performance in terms of throughput and latency. It then considers
alternative congestion control, such as the TCP Vegas and BBR
algorithms, and the use of explicit congestion notification (ECN),
as options to lower latency. Finally, it considers the impact of
sub-optimal Internet paths on latency, and the rationale for deploying
low-Earth orbit satellite constellations to reduce latency of Internet
This first part of the lecture outlines the principle of congestion
control. It discusses packet loss as a congestion signal, conservation
of packets in flight, and the additive increase, multiplicative
decrease requirements for stability.
Slides for part 1
In this lecture I’d like to move
on from talking about how to transfer
data reliably, and talk about mechanisms and
means by which transport protocols go about
lowering the latency of the communication.
One of the key limiting factors of
performance of network systems, as we've discussed
in some of the previous lectures, is latency.
Part of that is the latency for
establishing connections, and we've spoken about that
in detail already, where a lot of
the issue is the number of round
trip times needed to set up a connection.
And, especially when secure connections are in
use, if you're using TCP and TLS,
for example, as we discussed, there’s a
large number of round trips needed to
actually get to the point where you
can establish a connection, negotiate security parameters,
and start to exchange data.
And we've already spoken about how the
QUIC Transport Protocol
has been developed to try and improve
latency in terms of establishing a connection.
The other aspects of latency, and reducing
the latency of communications, is actually in
terms of data transfer.
How you deliver data across the network
in a way which doesn't lead to
excessive delays, and how you can gradually
find ways of reducing the latency,
and making the network better suited to
real time applications, such as telephony,
and video conferencing, and gaming, and high
frequency trading, and
Internet of Things, and control applications.
A large aspect of that is in
terms of how you go about building
congestion control, and a lot of the
focus in this lecture is going to
be on how TCP
congestion control works, and how other protocols
do congestion, to deliver data in a
low latency manner.
But I’ll also talk a bit about
explicit congestion notification, and changes to the
way queuing happen in the network,
and about services such as SpaceX’s StarLink
which are changing the way the network
is built to reduce latency.
I want to start by talking about congestion control,
and TCP congestion control in particular.
And, what I want to do in
this part, is talk about some of
the principles of congestion
control. And talk about what is the
problem that's being solved, and how can
we go about adapting the rate at
which a TCP connection delivers data over
to make best use of the network
capacity, and to do so in a
way which doesn't build up queues in
the network and induce too much latency.
So in this part I’ll talk about
congestion control principles. In the next part
I move on to talk about loss-based
congestion control, and talk about TCP Reno
and TCP Cubic,
which are ways of making very effective
use of the overall network capacity,
and then move on to talk about
ways of lowering latency.
I’ll talk about latency reducing congestion control
algorithms, such as TCP Vegas or Google's
TCP BBR proposal. And then I’ll finish
up by talking a little bit about
Explicit Congestion Notification
in one of the later parts of the lecture.
TCP is a
complex and very highly optimised protocol,
especially when it comes to congestion control
and loss recovery mechanisms.
I'm going to attempt to give you
a flavour of the way congestion control
works in this lecture, but be aware
that this is a very simplified review
of some quite complex issues.
The document listed on the slide is
entitled “A roadmap for TCP Specification Documents”,
and it's the latest IETF standard that describes
how TCP works, and points to the
details of the different proposals.
This is a very long and complex
document. It’s about, if I remember right,
60 or 70 pages long.
And all it is, is a list
of references to other specifications, with one
paragraph about each one describing why that
specification is important.
And the complete specification for TCP is
several thousand pages of text. This is
a complex protocol with a lot of
features in it, and I’m necessarily giving
a simplified overview.
I’m going to talk about TCP.
I’m not going to talk much,
if at all, about QUIC in this lecture.
That's not because QUIC isn't interesting,
it's because QUIC essentially adopts the same
congestion control mechanisms as TCP.
The QUIC version one standard says to
use TCP Reno, use the same congestion
control algorithm as TCP Reno.
And, in practice, most of the QUIC
implementations use the Cubic or the BBR
congestion control algorithms,
which we'll talk about later on.
QUIC is basically adopting the same mechanisms
as does TCP, and for that reason
that I’m not going to talk about
them too much separately.
So what is the goal of congestion
control? What are the principles of congestion control?
Well, the idea of congestion control is
to find the right transmission rate for
We're trying to find the fastest sending
rate which you can send at to
match the capacity of the network,
and to do so in a way
that doesn't build up queues, doesn't overload,
doesn't congest the network.
So we're looking to adapt the transmission
rate of a flow of TCP traffic
over the network, to match the available
And as the network capacity changes,
perhaps because other flows of traffic start
up, or perhaps because you're on a
mobile device and you move into an
area with different radio coverage,
the speed at which the TCP is
delivering the data should adapt to match
the changes and available capacity.
The fundamental principles of congestion control,
as applied in TCP,
were first described by Van Jacobson,
who we see on the picture on
the top right of the slide,
in the paper “Congestion Avoidance and Control”.
And those principles are that TCP responds
to packet loss as a congestion signal.
It treats the loss of a packet,
because the Internet is a best effort
packet network, and it loses, it discards
packets, if it can't deliver them,
and TCP treats that discard, that loss
of a packet, as a congestion signal,
and as a signal of it's sending
too fast and should slow down.
It relies on the principle of conservation
of packets. It tries to keep the
number of packets, which are traversing the
network roughly constant,
assuming nothing changes in the network.
And it relies on the principles of
additive increase, multiplicative decrease.
If it has to increase its sending
rate, it does so relatively slowly,
an additive increase in the rate.
And if it has to reduce its
sending rate, it does so quickly, a multiplicative decrease.
And these are the fundamental principles that
Van Jacobson elucidated for TCP congestion control,
and for congestion control in general.
And it was Van Jacobson who did
the initial implementation of these into TCP
in the mid-1980s, about 1984, ’85, or so.
Since then, the algorithms, the congestion control
algorithms, for TCP in general have been
maintained by a large number of people.
A lot of people have developed this.
Probably one of the leading people in
this space for the last 20 years
or so, is Sally Floyd who was
very much responsible for taking
the TCP standards, making them robust,
pushing them through the IETF to get
them standardised, and making sure they work,
and making sure they work and get really high performance.
And she very much drove the development
to make these robust, and effective,
and high performance standards, and to make
TCP work as well as it does today.
And Sally sadly passed away a year
or so back, which is a tremendous
shame, but we're grateful for her legacy
in moving things forward.
So to go back to the principles.
The first principle of congestion control in
the Internet, and in TCP, is that
packet loss is an indication that the
network is congested.
Data flowing across the Internet flows from
the sender to the receiver through a
series of routers. The IP routers connect
together the different links that comprise the network.
And routers perform two functions:
they perform a routing function, and a forwarding function.
The purpose of the routing function is
to figure out how packets should get
to their destination. They receive a packet
from some network link, look at the
destination IP address, and decide which direction
to forward that packet. They’re responsible for
finding the right path through the network.
But they're also responsible for forwarding,
which is actually putting the packets into
the queue of outgoing traffic for the
link, and managing that queue of packets
to actually transmit the packets across the network.
And routers in the network have a
set of different links; the whole point
of a router is to connect different
links. And at each link, they have
a queue of packets, which are enqueued
to be delivered on that link.
And, perhaps obviously, if packets are arriving
faster than the link can deliver those
packets, then the queue gradually builds up.
More and more packets get enqueued in
the router waiting to be delivered.
And if packets are arriving slower than
they can be forwarded,
then the queue gradually empties as the
packets get transmitted.
Obviously the router has a limited amount
of memory, and at some point it's
going to run out of space to
enqueue packets. So, if packets are being
delivered faster than they,
if packets arriving at the router faster
than they can be delivered down the
link, the queue will build up and
gradually fill, until it reaches its maximum
size. At that point, the router has
no space to keep the newly arrived
packets, and so it discards the packets.
And this is what TCP is using
as the congestion signal. It’s using the
fact that the queue of packets on
an outgoing link at a router has
filled up. It's using that as an indication that
the queue fills up, the packet gets
lost, it uses that packet loss as
an indication that it's sending too fast.
It’s sending faster than the packets can
be delivered, and as a result the
queue has overflowed, a packet has been
lost, and so it needs to slow down.
And that's the fundamental congestion signal in
the network. Packet loss is interpreted as
a sign that devices are sending too
fast, and should go slower. And if
they slow down, the queues will gradually
empty, and packets will stop being lost.
So that's the first fundamental principle.
The second principle is that
we want to keep the number of
packets in the network roughly constant.
TCP, as we saw in the last
lecture, sends acknowledgments for packets. When a
packet is transmitted it has a sequence
number, and the response will come back
from the receiver acknowledging receipt of that
The general approach for TCP, once the
connection has got going, is that every
time it gets an acknowledgement, it uses
that as a signal that a packet
has been received.
And if a packet has been received,
something has left the network. One of
the packets sent into the network has
reached the other side, and has been
removed from the network at the receiver.
That means there should be space to
put another packet into the network.
And it's an approach that’s called ACK
clocking. Every time a packet arrives at
the receiver, and you get an acknowledgement
back saying it was received, that indicates
you can put another packet in.
So the total number of packets in
transit across the network ends up being
roughly constant. One packet out, you put
another packet in.
And it has the advantage that if
you're clocking out new packets in receipt
of acknowledgments, if, for some reason,
the network gets congested, and it takes
longer for acknowledgments to come back,
because it's taking longer for them to
work their way across the network,
then that will automatically slow down the
rate at which you send. Because it
takes longer for the next acknowledgment to
come back, therefore it's longer before you
send your next packet.
So, as the network starts to get
busy, as the queue starts to build
up, but before the queue has overflowed,
it takes longer for the acknowledgments to
come back, because the packets are queued
up in the intermediate links, and that
gradually slows down the behaviour of TCP.
It reduces the rate at which you can send.
So it’s, to at least some extent,
self adjusting. The network gets busier,
the ACKs come back slower, therefore you
send a little bit slower.
And that's the second principle: conservation of
packets. One out, one in.
And the principle of conservation of packets
is great, provided the network is in
the steady state.
But you also need to be able
to adapt the rate at which you're sending.
The way TCP adapts is very much
focused on starting slowly and gradually increasing.
When it needs to increase it’s sending
rate, TCP increases linearly. It adds a
small amounts to the sending rate each round trip time.
So it just gradually, slowly, increases the
sending rating. It gradually
pushes up the rate
until it spots a loss. Until it
loses a packet. Until it overflows a queue.
And then it responds to congestion by
rapidly decreasing its rate. If a congestion
event happens, if a packet is lost,
TCP halves its rate. It responds faster
than it increases, it slows down faster than it increases.
And this is the final principle,
what’s known as additive increase, multiplicative decrease.
The goal is to keep the network
stable. The goal is to not overload the network.
If you can, keep going at a
steady rate. Follow the ACK clocking approach.
Gradually, just slowly, increase the rate a
bit. Keep pushing, just in case there’s
more capacity than you think. So just
gradually keep probing to increase the rate.
If you overload the network, if you
cause congestion, if you overflow the queues,
cause a packet to be lost,
slow down rapidly. Halve your sending rate,
and gradually build up again.
The fact that you slow down faster
than you speed up, the fact that
you follow the one in, one out approach,
keeps the network stable. It makes sure
it doesn't overload the network, and it
means that if the network does overload,
it responds and recovers quickly The goal
is to keep the traffic moving.
And TCP is very effective at doing this.
So those are the fundamental principles of
TCP congestion control. Packet loss as an
indication of congestion.
Conservation of packets, and ACK clocking.
One in, one out, where possible.
If you need to increase the sending
rate, increase slowly. If a problem happens,
decrease quickly. And that will keep the network stable.
In the next part I’ll talk about
TCP Reno, which is one of the
more popular approaches for doing this in practice.
The second part of the lecture discusses TCP Reno congestion control.
It outlines the principles of window based congestion control, and
describes how they are implemented in TCP. The choice of initial
window, and how the recommended initial window has changed over time,
is discussed, along with the slow start algorithm for finding the
path capacity and the congestion avoidance algorithm for adapting
the congestion window.
Slides for part 2
In the previous part, I spoke about
the principles of TCP congestion control in
general terms. I spoke about the idea
of packet loss as a congestion signal,
about the conservation of packets, and about
the idea of additive increase multiplicative decrease
– increase slowly, decrease the sending quite
quickly as a way of achieving stability.
In this part I want to talk
about TCP Reno, and some of the
details of how TCP congestion control works in practice.
I’ll talk about the basic TCP congestion
control algorithm, how the sliding window algorithm
works to adapt the sending rate,
and the slow start and congestion avoidance
phases of congestion control.
TCP is what's known as a window
based congestion control protocol.
That is, it maintains what's known as
a sliding window of data which is
available to be sent over the network.
And the sliding window determines what range
of sequence numbers can be sent by
TCP onto the network.
It uses the additive increase multiplicative decrease
approach to grow and shrink the window.
And that determines, at any point,
how much data TCP sender can send
onto the network.
It augments these with algorithms known as
slow start and congestion avoidance. Slow start
being the approach TCP uses to get
a connection going in a safe way,
and congestion avoidance being the approach it
uses to maintain the sending rate once
the flow has got started.
The fundamental goal of TCP is that
if you have several TCP flows sharing
a link, sharing a bottleneck link in the network,
each of those flows should get an
approximately equal share of the bandwidth.
So, if you have four TCP flows
sharing a link, they should each get
approximately one quarter of the capacity of that link.
And TCP does this reasonably well.
It’s not perfect. It, to some extent,
biases against long distance flows,
and shorter flows tend to win out
a little over long distance flows.
But, in general, it works pretty well,
and does give flows roughly a roughly
equal share of the bandwidth.
The basic algorithm it uses to do
this, the basic congestion control algorithm,
is an approach known as TCP Reno.
And this is the state of the
art in TCP as of about 1990.
TCP is an ACK based protocol.
You send a packet, and sometime later
an acknowledgement comes back telling you that
the packet arrived, and indicating the sequence
number of the next packet which is expected.
The simplest way you might think that
would work, is you send a packet.
You wait for the acknowledgment. You send
another packet You wait for the acknowledgement. And so on.
The problem with that, is that it
tends to perform very poorly.
It takes a certain amount of time
to send a packet down a link.
That depends on the size of the
packet, and the link bandwidth.
The size of the packet is expressed
as some number of bits to be sent.
The link bandwidth is expressed in some
number of bits it can deliver each
second. And if you did divide the
packet size by the bandwidth, that gives
you the number of seconds it takes to send each packet.
It takes a certain amount of time
for that packet to propagate down the
link to the receiver, and for the
acknowledgment come back to you, depending on
the round trip of the link.
And you can measure the round trip time of the link.
And you can divide one by the other.
You can take the time it takes to send a packet, and the
time it takes for the acknowledgment to
come back, and divide one by the
other, to get the link utilisation.
And, ideally, you want that fraction be
close to one. You want to be
spending most of the time sending packets,
and not much time waiting for the
acknowledgments to come back before you can
send the next packet.
The problem is that's often not the case.
For example, if we assume we're trying
to send data, and we have a
gigabit link, which is connecting the machine
we're sending data from, and we’re trying
to go from Glasgow to London.
And this might be the case you would find if you had a one
of the machines in the Boyd Orr
labs, which is connected to the University's
gigabit Ethernet, and the University has a
10 gigabit per second link to the
rest of the Internet, so the bottleneck is that Ethernet.
If you're talking to a machine in London,
let's make some assumptions on how long this will take.
You’re sending using Ethernet, and the biggest
packet an Ethernet can deliver is 1500
bytes. So 1500 bytes, multiplied by eight
bits per byte, gives you a number
of bits in the packet. And it’s
a gigabit Ethernet, so it's sending a
billion bits per second.
So 1500 bytes, times eight bits,
divided by a billion bits per second.
It will take 12 microseconds, 0.000012 of
a second, 12 microseconds to send a
packet down the link. And that’s just
the time it takes to physically serialise
1500 bytes down a gigabit per second link.
The round trip time to London, if you measure it, is about
a 100th of a second, about 10 milliseconds.
If you divide one by the other,
you find that the utilisation is 0.0012.
0.12% of the link is in use.
The time it takes to send a
packet is tiny compared to the time
it takes to get a response.
So if you're just sending one packet,
and waiting for a response, the link
is idle 99.9% of the time.
The idea of a sliding window protocol
is to not just send one packet
and wait for an acknowledgement.
It’s to send several packets,
and wait for the acknowledgments. And the
window is the number of packets that
can be outstanding before the acknowledgement comes back.
The ideas is, you can start several
packets going, and eventually the acknowledgement comes
back, and that starts triggering the next
packets to be clocked out. This idea
is to improve the utilisation by sending
more than one packet before you get an acknowledgment.
And this is the fundamental approach to
sliding window protocols. The sender starts sending
data packets, and there's what's known as
a congestion window that's that specifies how
many packets that's it’s allowed to send
before it gets an acknowledgement.
And, in this example, the congestion window is six packets.
And the sender starts. It sends the
first data packet, and that gets sent
and starts its way traveling down the link.
And at some point later it sends
the next packet, and then the next packet, and so.
After a certain amount of time that
first packet arrives at the receiver,
and the receiver generates the acknowledgments which
comes back towards the sender.
And while this is happening, the sender
is sending more of the packets from its window.
And the receiver’s gradually receiving those and
sending the acknowledgments. And, at some point later,
the acknowledgement makes it back to the sender.
And in this case we've set the
window size to be six packets.
And it just so happens that the
acknowledgement for the first packet arrives back
at the sender, just as it has finished sending packet six.
And that triggers the window to increase.
That triggers the window to slide along.
So instead of being allowed to send packets one through six,
we're now allowed to send packets two
through seven. Because one packet has arrived,
that's opened up the window to allow
us to send one more packet.
And the acknowledgement indicates that packet one
has arrived. So just as we'd run
out of packets to send, just as
we've sent our six packets which are
allowed by the window, the acknowledgement arrives,
slides the window a long one,
tells us we can now send one more.
And the idea is that you size
the window such that you send just
enough packets that by the time the
acknowledgement comes back, you're ready to slide
the window along. You've sent everything that
was in your window.
And each acknowledgement releases the next packet
for transmission, if you get the window sized right.
And if there's a problem, if the acknowledgments
don't come back because something got lost,
then it stalls. You hadn't sent too
many excess packets, you're not just keeping
sending without getting acknowledgments,
you're just sending enough
that the acknowledgments come back, just as
you run out of things to send.
And everything just keeps it sort-of balanced.
Every acknowledgement triggers the next packet to
be sent, and it rolls along.
How big should the window be? Well,
it should be sized to match the
bandwidth times the delay on the path.
And you work it out in bytes.
It's the bandwidth of the path,
a gigabit in the previous example,
times the latency,
100th of a second, and you multiply
those together and that tells you how
many bytes can be in flight.
And you divide that by the packet
size, and that tells you how many packets you can send.
The problem is, the sender doesn't know
the bandwidth of the path, and it
doesn't know that latency. It doesn't know
the round trip time.
It can measure the round trip time,
but not until after it started sending.
Once it’s sent a packet, it can
wait for an acknowledgement to come back
and get an estimate of the round
trip time. But it can't do that
at the point where it starts sending.
And it can't know what is the
bandwidth. It knows the password for the
link it's connected to, but it doesn't
know the bandwidth for the rest of
the links throughout the network.
It doesn't know how many other TCP
flows it’s sharing the traffic with,
so it doesn't know how much of
that capacity it's got available.
And that this is the problem with
the sliding window algorithms. If you get
the window size right,
It allows you to do the ACK
clocking, it allows you to clock out
the packets at the right time,
just in time for the next packet to become available.
But, in order to pick the right
window size, you need to know the
bandwidth and the delay, and you don't
know either of those at the start of the connection.
TCP follows the sliding window approach.
TCP Reno is very much a sliding
window protocol, and it's optimised for not
knowing what the window sizes are.
And the challenge with TCP is to
pick what should be the initial window.
To pick how many packets you should
send, before you know anything about the
round trip time, or anything about bandwidth.
And how to find the path capacity,
how to figure out at what point
you've got the right size window.
And then how to adapt the window
to cope with changes in the capacity.
So there's two fundamental problems with TCP
Reno congestion control. Picking the initial window size
for the first set of packets you send.
And then, adapting that initial window size
to find the bottleneck capacity, and to
adapt to changes in that bottleneck capacity.
If you get the window size right,
you can make effective use of the
network capacity. If you get it wrong
you’ll either send too slowly, and end
up wasting capacity. Or you'll send too
quickly, and overload the network, and cause
packets to be lost because the queues fill.
So, how does TCP find the initial window?
Well, to start with, you have no
information. When you're making a TCP connection
to a host you haven't communicated with
before, you don't know the round trip
time to that host, you don’t know
how long it will take to get
a response, and you don't know the network capacity.
So you have no information to know
what an appropriately sized window should be.
The only safe thing you can do.
The only thing which is safe in
all circumstances, is to send one packet,
and see if it arrives, see if you get an ACK.
And if it works, send a little
bit faster next time.
And then gradually increase the rate at which you send.
The only safe thing to do
is to start at the lowest possible rate,
equivalent of stop-and-wait, and then gradually
increase your rate from there, once you know that it works.
The problem is, of course, that's pessimistic,
in most cases.
Most links are not the slowest possible link.
Most links, you can send faster than that.
What TCP has traditionally done, and the
traditional approach in TCP Reno, is declared
the initial window to be three packets.
So you can send three packets,
without getting any acknowledgments back.
And, by the time the third packet
has been sent, you should be just
about to get the acknowledgement back,
which will open it up for you to send the fourth.
And at that point, it starts ACK clocking.
And why is it three packets?
Because someone did some measurements,
and decided that was what safe.
More recently, I guess, about 10 years
ago now, Nandita Dukkipati and her group
at Google did another set of measurements,
and showed that was actually pessimistic.
The networks had gotten a lot faster
in the time since TCP was first
standardised, and they came to the conclusion,
based on the measurements of browsers accessing
the Google site, that about 10 packets
was a good starting point.
And the idea here is that 10
packets, you can send 10 packets at
the start of a connection, and after
you’ve sent 10 packets you should have
got an acknowledgement back.
Again, it's a balance between safety and
performance. If you send too many packets
onto a network which can't cope with
them, those packets will get queued up
and, in the best case, it’ll just
add latency because they're all queued up
somewhere. And in the worst case they'll
overflow the queues, and cause packet loss,
and you'll have to re-transmit them.
So you don't want to send too
fast. Equally, you don't want to send
too slow, because that just wastes capacity.
And the measurements that Google came up with
at this point, which was around 10
years ago, was that about 10 packets
was a good starting point for most connections.
It was unlikely to cause congestion in
most cases, and was also unlikely to
waste too much bandwidth.
And I think what we'd expect to
see, is that over time the initial
window will gradually increase, as network connections
around the world gradually get faster.
And it's balancing making good use of
connections in well-connected
first-world parts of the world, where there’s
against not overloading connections in parts of
the world where the infrastructure at less well developed.
The initial window lets you send something.
With a modern TCP, it lets you send 10 packets.
And you can send those 10 packets,
or whatever the initial window is,
without waiting for an acknowledgement to come back.
But it's probably not the right size;
it’s probably not the right window size.
If you're on a very fast connection,
in a well-connected part of the world,
you probably want a much bigger window than 10 packets.
And if you're on a poor quality
mobile connection, or in a part of
the world where the infrastructure is less
well developed, you probably want a smaller window.
So you need to somehow adapt
to match the network capacity.
And there's two parts to this.
What's called slow start, where you try
to quickly find the appropriate initial window,
where starting from initial window, you quickly
converge on what the right window is.
And congestion avoidance, where you adapt in
the long term to match changes in
capacity once the thing is running.
So how does slow start work?
Well, this is the phase at the beginning of the connection.
It's easiest to illustrate if you assume
that the initial window is one packet.
If the initial window is one packet,
you send one packet, and at some
point later an acknowledgement comes back.
And the way slow start works is
that each acknowledgment you get back
increases the window by one.
So if you send one packet,
and get one packet back, that increases
the window from one to two,
so you can send two packets the next time.
And you send those two packets,
and you get two acknowledgments back.
And each acknowledgments increases the window by
one, so it goes to three,
and then to four. So you can
send four packets the next time.
And then you get four acknowledgments back,
each of which increases the window,
so your window is now eight.
And, as we are all, I think,
painfully aware after the pandemic, this is
The window is doubling each time.
So it's called slow start because it
starts very slow, with one packet or
three packets or 10 packets, depending on
the version of TCP you have.
But each round trip time the window doubles.
It doubles it's sending rate each time.
And this carries on until it loses
a packet. This carries on until it
fills the queues and overflows the capacity
of the network somewhere.
At which points it halves back to
its previous value, and drops out of
the slow start phase.
If we look at this graphically,
what we see on the graph at
the bottom of the slide, we have
time on the X axis, and the
congestion window, the size of the congestion
window, on the y axis.
And we're assuming an initial window of
one packet. We see that, on the
first round trip it sends the one
packet, gets the acknowledgement back. The second
round trip it sends two packets.
And then four, and then eight,
and then 16. And each time it
doubles it's sending rate.
So you have this exponential growth phase,
starting at whatever the initial window is,
and doubling each time until it reaches
the network capacity.
And eventually it fills the network.
Eventually some queue, somewhere in the network,
is full. And it overflows and the packet gets lost.
At that point the connection halves it’s
rate, back to the value just before
it last increased. In this example,
we see that it got up to
an initial window of 16, and then
something got lost, and then it halved
back down to a window of eight.
At that point TCP enters what's known
as the congestion avoidance phase.
The goal of congestion avoidance is to
adapt to changes in capacity.
After the slow start phase, you know
you've got approximately the right size window
for the path. It's telling you roughly
how many packets you should be sending
each round trip time. The goal,
once you’re in congestion avoidance, is to adapt to changes.
Maybe the capacity of the path changes.
Maybe you're on a mobile device,
with a wireless connection, and the quality
of the wireless connection changes.
Maybe the amount of cross traffic changes.
Maybe additional people start sharing the link
with you, and you have less capacity
because you’re sharing with more TCP flows.
Or maybe some of the cross traffic
goes away, and the amount of capacity
you have available increases because there's less
And the congestion avoidance phase follows an
additive increase, multiplicative decrease,
approach to adapting
the congestion window when that happens.
So, in congestion avoidance,
if it successfully manages to send a
complete window of packets, and gets acknowledgments
back for each of those packets.
So it's sent out
eight packets, for example, and gets eight
it knows the network can support that sending rate.
So it increases its window by one.
So the next time, it sends out nine packets
and expects to get nine acknowledgments back
over the next round trip cycle.
And if it successfully does that,
it increases the window again.
And it sends 10 packets, and expects
to get 10 acknowledgments back.
And we see that each round trip
it gradually increases the sending rate by
one. So it sends 8 packets,
then 9, then 10, then 11,
and 12, and keeps gradually, linearly,
increasing its rate.
Up until the point that something gets lost.
And if a packet gets lost?
You’ll be able to detect that because,
as we saw in the previous lecture,
you'll get a triple duplicates acknowledgement.
And that indicates that one of the
packets got lost, but the rest of
the data in the window was received.
And what you do at that point,
is you do a multiplicative decrease in
the window. You halve the window.
So, in this case, the sender was
sending with a window of
12 packets, and it successfully sent that.
And then it tried to send,
tried to increase its rate, realised it
didn't work, realised something got lost,
and so it halved its window back down to six.
And then it gradually switches back,
it switches back, and goes back to
the gradual additive increase.
And it follows this sawtooth pattern.
Gradual linear increase, one packet more each
round trip time.
Until it sends too fast, causes a
packet to be lost because it overflows
a queue, halves it’s sending rate,
and then gradually starts increasing it again.
It follows this sawtooth pattern. Gradual increase,
quick back-off; gradual increase, quick back-off.
The other way TCP can detect the
loss is by what’s known as a
time out. It’s sending the packets,
and suddenly the acknowledgements stop coming back entirely.
And this means that either the receiver
has crashed, the receiving system has gone
away, or perhaps more likely the network has failed.
And the data it’s sending is either
not reaching the sender, or the reverse path has failed,
and the acknowledgments are not coming back.
At that point, after nothing has come back for a while,
it assumes a timeout has happened,
and resets the window down to the initial window.
And in the example we see on
the slide, at time 14 we've got
a timeout, and it resets and the
initial window goes back to one packet.
At that point, it re-enters slow start.
It starts again from the beginning.
And whether your initial window is one
packet, or three packets, or ten packets,
it starts in the beginning, and it
re-enters slow start, and it tries again
for the connection.
And if this was a transient failure,
that will probably succeed. If it wasn’t,
it may end up in yet another
timeout, while it takes time for the
network to recover, or
for the system you're talking to,
to recover, and it will be a
while before it can successfully send a
packet. But, when it does, when the
network recovers, it starts sending again,
and resets the connection from the beginning.
How long, should the timeout be?
Well, the standard says a maximum of
one second, or the average round trip
time plus four times the statistical variance
in the round trip time.
And, if you're a statistician, you’ll recognise
that the RTT plus four times the
variance, if you're assuming a normal distribution of
round trip time samples, accounts for 99%
of the samples falling within range.
So it's finding the 99th percentile of
the expected time to get an acknowledgement back.
Now, TCP follows this saw tooth behaviour,
with gradual additive increase in the sending
rate, and then a back-off, halving it’s
sending rate, and then a gradual increase again.
And we see this in the top
graph on the slide which is showing a
measured congestion window for a real TCP flow.
And, after dynamics of the slow start
at the beginning, we see it follows this sawtooth pattern.
How does that affect the rest of the network?
Well, the packets are, at some point,
getting queued up at whatever the bottleneck link is.
And the second graph we see on
the left, going down, is the size of the queue.
And we see that as the sending
rate increases, the queue gradually builds up.
Initially the queue is empty, and as
it starts sending faster, the queue gradually gets fuller.
And at some point the queue gets full, and overflows.
And when the queue gets full,
when the queue overflows, when packets gets
lost, TCP halves it’s sending rate.
And that causes the queue to rapidly
empty, because there's less packets coming in,
so the queue drains.
But what we see is that just
as the queue is getting to empty,
the rate is starting to increase again.
Just as the queue gets the point
where it would have nothing to send,
the rate starts picking up, such that
the queue starts to gradually refill.
So the queues in the routers also
follow a sawtooth pattern. They gradually fill
up until they get to a full point,
And then the rate halves, the queue
empties rapidly because
there's much less traffic coming back,
and as it's emptying the rate at
which the sender is sending is gradually
filling up, and the queue size oscillates.
And we see the same thing happens
with the round trip time, in the
third of the graphs, as the queue gradually
fills up, the round trip time goes
up, and up, and up, it's taking
longer for the packets because they're queued up somewhere.
And then the rate reduces, the queue
drops, the round trip time drops.
And it gradually, as the rate picks up afterwards
back into congestion avoidance, the queue gradually
fills, the round trip time gradually increases.
So, both window size, and the queue
size, and the round trip time,
all follow this characteristic sawtooth pattern.
What's interesting though, if we look at
the fourth graph down on the left,
is we're looking at the rate at
which packets are arriving at the receiver.
And we see that the rate at
which packets are arriving at the receiver
is pretty much constant.
What's happening is that the packets are
being queued up at the link,
and as the queue fills there's more
and more packets queued up
at the bottleneck link. And when TCP
backs-off, when it reduces it's window,
that lets the queue drain. But the
queue never quite empties. We just see
very occasional drops where the queue gets
empty, but typically the queue always has
something in it.
It's emptying rapidly, it’s getting less and
less data in it, but the queue,
if the buffer is sized right,
if the window is chosen right, never quite empties.
So the TCP sender is following this
sawtooth pattern, with its sending window,
which is gradually filling up the queues.
And then the queues are gradually draining
when TCP backs-off and halves its rate,
but the queue never quite empties.
It always has some data to send,
so the receiver is always receiving data.
So, even though the sender's following the
sawtooth pattern, the receiver receives constant rate
data the whole time,
at approximately the bottleneck bandwidth.
And that's the genius of TCP.
It manages, by following this additive increase,
multiplicative decrease, approach, it manages to adapt
the rate such that the buffer never
quite empties, and the data continues to be delivered.
And for that to work, it needs
the router to have enough buffering capacity
in it. And the amount of buffering
the router needs, is the bandwidth times
the delay of the path. And too
little buffering in the router
the queue overflowing, and it not quite
managing to sustain the rate. Too much,
you just get what’s known as buffer bloat.
It's safe, I mean in terms of
throughput, it keeps receiving the data.
But the queues get very big,
and they never get anywhere near empty,
so the amount of data queued up
increases, and you just get increased latency.
So that's TCP Reno. It's really effective
at keeping the bottleneck fully utilised.
But it trades latency for throughput.
It tries to fill the queue,
it's continually pushing, it’s continually queuing up data.
Making sure the queue is never empty.
Making sure the queue is never empty,
so provided there’s enough buffering in the
network there are always packets being delivered.
And that's great, if your goal is
to maximise the rate at which information
is delivered. TCP is really good at
keeping the bottleneck link fully utilised.
It’s really, really good at delivering data
as fast as the network can support it.
But it trades that off for latency.
It's also really good at making sure
there are queues in the network,
and making sure that the network is
not operating at its lowest possible latency.
There's always some data queued up.
There are two other limitations,
other than increased latency.
First, is that TCP assumes that losses
are due to congestion.
And historically that's been true. Certainly in
wired links, packet loss is almost always
caused by a queue filling up,
overflowing, and a router not having space
to enqueue a packet.
In certain types of wireless links,
in 4G or in WiFi links,
that's not always the case, and you
do get packet loss due to corruption.
And TCP will treat this as a
signal to slow down. Which means that
TCP sometimes behaves sub-optimally on wireless links.
And there's a mechanism called Explicit Congestion
Notification, which we'll talk about in one
of the later parts of this lecture,
which tries to address that.
The other, is that the congestion avoidance
phase can take a long time to ramp up.
On very long distance links, very high capacity
links, it can take a long time
to get up to, after packet loss,
it can take a very long time
to get back up to an appropriate rate.
And there are some occasions with very
fast long distance links, where it performs
poorly, because of the way the congestion
And there's an algorithm known as TCP
Cubic, which i'll talk about in the
next part, which tries to address that.
And that's the basics of TCP.
The basic TCP congestion control algorithm is
a sliding window algorithm, where the window
indicates how many packets you’re allowed to
send before getting an acknowledgement.
The goal of the slow start and
the congestion avoidance phases, and the additive
increase, multiplicative decrease, is to adapt the
size of the window to match the network capacity.
It always tries to match the size
of the window exactly to the capacity,
so it's making the most use of the network resources.
In the next part, I’ll move on
and talk about an extension to the
TCP Reno algorithm, known as TCP Cubic,
which is intended to improve performance on
very fast and long distance networks.
And then, in the later parts,
we'll talk about extensions to reduce latency,
and to work on wireless links where
there are non-congestive losses.
The third part of the lecture talks about the TCP Cubic congestion
control algorithm, a widely used extension to TCP that improves its
performance on fast, long-distance, networks. The lecture discusses
the limitations of TCP Reno that led to the development of Cubic,
and outlines how Cubic congestion control improves performance but
retains fairness with Reno.
Slides for part 3
In the previous part, I spoke about TCP Reno.
TCP Reno is the default congestion control
algorithms for TCP, but it's actually not
particularly widely used in practice these days.
What most modern TCP versions use is,
instead, an algorithm known as TCP Cubic.
And the goal of TCP cubic is
to improve TCP performance on fast long distance networks.
So the problem with TCP Reno,
is that it’s performance can be comparatively
poor on networks with large bandwidth-delay products.
That is, networks where the product,
what you get when you multiply the
bandwidth of the network, in number of
bits per second, and the delay,
the round trip time of the network, is large.
Now, this is not a problem that
most people, have most of the time.
But, it's a problem that began to
become apparent in the early 2000s when
people working at organisations like CERN were
trying to transfer very large data files
across fast long distance
networks between CERN and the universities that
were analysing the data.
For example, CERN is based at Geneva,
in Switzerland, and some of the big
sites for analysing the data are based
at, for example, Fermilab just outside Chicago in the US.
And in order to get the data
from CERN to Fermilab, from Geneva to Chicago,
they put in place multi-gigabit transatlantic links.
And if you think about the congestion window needed to
make good use of a link like
that, you realise it actually becomes quite large.
If you assume the link is 10
gigabit per second, which was cutting edge
in the early 2000s, but it is
now relatively common for high-end links these days,
and assume 100 milliseconds round trip time,
which is possibly even slightly an under-estimate
for the path from Geneva to Chicago,
in order to make good use
of that, you need a congestion window
which equals the bandwidth times the delay.
And 10 gigabits per second, times 100
milliseconds, gives you a congestion window of
about 100,000 packets.
And, partly, it takes TCP a long
time, a comparatively long time, to slow
start up to a 100,000 packet window.
But that's not such a big issue,
because that only happens once at the
start of the connection. The issue,
though, is in congestion avoidance.
If one packet is lost on the
link, out of a window of 100,000,
that will cause TCP to back-off and
halve it’s window. And it then increases
sending rate again, by one packet every round trip time.
And backing off from 100,000 packet window
to a 50,000 packet window, and then
increasing by one each time, means it
takes 50,000 round trip times to recover
back up to the full window.
50,000 round trip times, when the round
trip time is 100 milliseconds, is about 1.4 hours.
So it takes TCP about one-and-a-half hours
to recover from a single packet loss.
And, with a window of 100,000 packets,
you're sending enough data, at 10 gigabits per second,
that the imperfections in the optical fibre,
and imperfections in the equipment that are
transmitting the packets, become significant.
And you're likely to just see occasional
random packet losses, just because of imperfections
in the transmission medium, even if there's
no congestion. And this was becoming a
limiting factor, this was becoming a bottleneck
in the transmission.
It was becoming not possible to build
a network that was reliable enough,
that it never lost any packets in
transferring several hundreds of billions of packets
to exchange the data between CERN and
the sites which were doing the analysis.
TCP cubic is one of a range
of algorithms which were developed to try
and address this problem. To try and
recover much faster than TCP Reno would,
in the case when you had very
large congestion windows, and small amounts of packet loss.
So the idea of TCP cubic,
is that it changes the way the
congestion control works in the congestion avoidance phase.
So, in congestion avoidance, TCP cubic will
increase the congestion window faster than TCP
Reno would, in cases where the window is large.
In cases where the window is relatively
small, in the types of networks were
Reno has good performance, TCP cubic behaves
in a very similar way.
But as the windows get bigger,
as it gets to a regime with
TCP Reno doesn't work effectively, TCP cubic
gets more aggressive in adapting its congestion
window, and increases the congestion window much
more quickly in response to loss.
However, as the rate of increase,
as the window approaches the value it
was before the loss, it slows its
rate of increase, so it starts increasing
rapidly, slows its rate of increase
as it approaches the previous value.
And if it then successfully manages to
send at that rate, if it successfully
moves above the previous sending rate,
then it gradually increases sending rate again.
It’s called TCP Cubic because it follows
a cubic equation to do this.
The shape of the equation, the shape
of the curve, we see on the
slide for TCP cubic is following a cubic graph.
The paper listed on the slide,
the paper shown on the slide,
from Injong Rhee and his collaborators,
is the paper which describes the algorithm in detail.
And it was eventually specified in IETF
RFC 8312 in 2018, although it's been
probably the most widely used TCP variant
for a number of years before that.
The details of how it works:
TCP cubic is a somewhat more complex
algorithm than Reno.
The two parts to the behaviour.
If a packet is lost when a
TCP cubic sender is in the congestion avoidance phase,
it does a multiplicative decrease.
However, unlike TCP Reno, which does a
multiplicative decrease by multiplying by a factor
of 0.5, that is, it halves its
sending rate if a single packets is lost,
TCP cubic multiples its rate by 0.7.
So, instead of dropping back down to
50% of its previous sending rate,
it drops down to 70% of the sending rate.
It backs-off less, it's more aggressive.
It’s more aggressive at using bandwidth.
It reduces it’s sending rate in response
to loss, but by smaller fraction.
After it's backed-off, TCP cubic also changes
the way in which it increases it’s sending rate in future.
So we saw in the previous slide,
TCP Reno increases it’s congestion window by
one, for every round trip when it
successfully sends data.
So if the window backs off to
10, then it goes to 11 the
next round trip time, then 12,
and 13, and so on, with a
linear increase in the window.
TCP cubic, on the other hand,
sets the window as we see in
the equation on the slide. It sets
the window to be a constant,
C, times T-K cubed, plus Wmax.
Where the constant, C, is set to
0.4, which is a threshold which controls
how fair it is to TCP Reno,
and was determined experimentally.
T is the time since the packet
loss. K is the time it will
increase, it will take to increase the window backup to
the maximum it was before the packet
loss, and Wmax is the maximum window
size it reached before the loss.
And this gives the cubic growth function,
which we saw on the previous slide,
where the window starts to increase quickly,
the growth slows as it approaches that previous value
it reached just before the loss,
and if it successfully passes through that
point, the rate of growth increases again.
Now, that's the high-level version. And we
can already see it's more complex than
the TCP Reno equation. The algorithm on
the right of the slide, which is
intentionally presented in a way which is
completely unreadable here,
shows the full details. The point is
that there's a lot of complexity here.
The basic equation, the basic back-off to
0.7 times and then follow the cubic
equation, to increase rapidly, slow the rate
of increase, and then increase rapidly again
if it successfully gets past the previous bottleneck point,
is enough to illustrate the key principle.
The rest of the details are there
to make sure it's fair with TCP
Reno on links which are slower,
or where the round trip time is shorter.
And so, in the regime where TCP
Reno can successfully make use of the
link, TCP Cubic behaves the same way.
And, as you get into a regime
where Reno can't effectively make use of
the capacity, because it can't sustain a
large enough congestion window,
then cubic starts to behave differently,
and starts to switch to the cubic
equation. And that allows it to recover
from losses more quickly, and to more
effectively continue to make use of higher
bandwidths and higher latency paths.
TCP cubic is the default in most
modern operating systems. It’s the default in
Linux, it's the default in FreeBSD,
I believe it's the default in macOS
Microsoft Windows has an algorithm called Compound
TCP which is a different algorithm,
but has a similar effect.
It’s much more complex than TCP Reno.
The core response, the back off to
70% and then follow the characteristic cubic
curve, is conceptually relatively straightforward, but once
you start looking at the details of
how it behaves, there gets to be a lot of complexity.
And most of that is in there
to make sure it's reasonably fair to
TCP, to TCP Reno, in the regime
where Reno typically works. But it improves
performance for networks with longer round trip
times and higher bandwidths.
Both TCP Cubic, and TCP Reno,
use congestion control, use packet loss as
a congestion signal. And they both eventually
fill the router buffers.
And TCP cubic does so more aggressively
than Reno. So, in both cases,
they're trading off latency for throughput,
They're trying to make sure the buffers are full.
They're trying to make sure
the buffers in the intermediate routers are full.
And they're both making sure that they
keep the congestion window large enough to
keep the buffers fully utilised, so packets
keep arriving at the receiver at all times.
And that's very good for achieving high
throughput, but it pushes the latency up.
So, again, they’re trading-off increased latency for
good performance, for good throughput.
And that's what I want to say
about Cubic. Again, the goal is to
use a different response function to improve
throughput on very fast, long distance, links,
multi-gigabit per second transatlantic links, being the
And the goal is to make good
use of throughput.
In the next part I’ll talk about
alternatives which, rather than focusing on throughput,
focus on keeping latency bounded whilst achieving
The 4th part of the lecture discussed how both the Reno and Cubic
algorithms impact latency. It shows how their loss-based response
to congestion inevitably causes router queues to fill, increasing
path latency, and discusses how this is unavoidable with loss-based
congestion control. It introduces the idea of delay-based congestion
control and the TCP Vegas algorithm, highlights its potential benefits
and deployment challenges. Finally, TCP BBR is briefly introduced as
an experimental extension that aims to achieve some of the benefits
of delay-based congestion control, in a deployable manner.
Slides for part 4
In the previous parts, I’ve spoken about
TCP Reno and TCP cubic. These are
the standard, loss based, congestion control algorithms
that most TCP implementations use to adapt
their sending rate. These are the standard
congestion control algorithms for TCP.
What I want to do in this
part is recap, why these algorithms cause
additional latency in the network, and talk
about two alternatives which try to adapt
the sending rate of TCP without building
up queues, and without
overloading the network and causing too much latency.
So, as I mentioned, TCP Cubic and
TCP Reno both aim to fill up the network.
They use packet loss as a congestion signal.
So the way they work is they
gradually increase their sending rate, they’re in
either slow start or congestion avoidance phase,
and they’re always gradually increasing the sending
rates, gradually filling up the queues in
the network, until those queues overflow.
At that point a packet is lost.
The TCP backs-off it's sending rate,
it backs-off its window, which allows the
queue to drain, but as the queue
is draining, both
Reno and Cubic are increasing their sending
rate, are increasing the sending window,
so are to gradually start filling up
the queue again.
As, we saw, the queues in the
network oscillate, but they never quite empty.
And both Reno and Cubic, the goal
is to keep some packets queued up
in the network, make sure there's always
some data queued up, so they can
keep delivering data.
And, no matter how big a queue
you put in the network, no matter
how much memory you give the routers
in the network, TCP Reno and TCP
cubic will eventually cause it to overflow.
They will keep sending, they'll keep increasing
the sending rate, until whatever queue is
in the network it's full, and it overflows.
And the more memory in the routers,
the more buffer in the routers,
the longer that queue will get and
the worse the latency will be.
But in all cases, in order to
achieve very high throughput, in order to
keep the network busy, keep the bottleneck
link busy, TCP Reno and TCP cubic
queue some data up.
And this adds latency.
It means that, whenever there’s TCP Reno,
whenever there’s TCP cubic flows, using the
network, the queues will have data queued up.
There’ll always be data queued up for
delivery. There's always packets waiting for delivery.
So it forces the network to work
in a regime where there's always some
Now, this is a problem for real-time
applications. It’s a problem if you're running
a video conferencing tool, or a telephone
application, or a game, or a real
time control application, because you want low
latency for those applications.
So it will be desirable if we
could have a an alternative to TCP
Reno or TCP cubic that can achieve
good throughput for TCP, without forcing the
queues to be full.
One attempt at doing this was a proposal called TCP Vegas.
And the insight from TCP Vegas is that
you can watch the rate of growth,
or increase, of the queue, and use
that to infer whether you're sending faster,
or slower, than the network can support.
The insight was, if you're sending,
if a TCP is sending, faster than
the maximum capacity a network can deliver
at, the queue will gradually fill up.
And as the queue gradually fills up,
the latency, the round trip time, will gradually increase.
TCP Cubic, and TCP Reno, wait until
the queue overflows, wait until there's no
more space to put new packets in,
and a packet is lost, and at
that point they slow down.
The insight for TCP Vegas was to
watch as the delay increases, and as
it sees the delay increasing, it slows
down before the queue overflows.
So it uses the gradual increase in
the round trip time, as an indication
that it should send slower.
And as the round-trip time reduces,
as the round-trip time starts to drop,
it treats that as an indication that
the queue is draining, which means it can send faster.
It wants a constant round trip time.
And, if the round trip time increases,
it reduces its rate; and if the
round-trip time decreases, it increases its rate.
So, it's trying to balance it’s rate
with the round trip time, and not
build or shrink the queues.
And because you can detect the queue
building up before it overflows, you can
take action before the queue is completely
full. And that means the queue is
running with lower occupancy, so you have
lower latency across the network.
It also means that because packets are
not being lost, you don't need to
re-transmit as many packets. So it improves
the throughput that way, because you're not
resending data that you've already sent and has gotten lost.
And that's the fundamental idea of TCP
Vegas. It doesn't change the slow start behaviour at all.
But, once you're into congestion avoidance,
it looks at the variation in round
trip time rather than looking at packet
loss, and uses that to drive the
variation in the speed at which it’s sending.
The details of how it works.
Well, first, it tries to estimate what
it calls the base round trip time.
So every time it sends a packet,
it measures how long it takes to
get a response. And it tries to
find the smallest possible response time.
The idea being that the smallest time
it gets a response, would be the
time when the queue is that it's emptiest.
It may not get the actual,
completely empty, queue, but the smaller the
response time, it's trying to estimate the
time it takes when there's nothing else in the network.
And anything on top of that indicates
that there is data queued up somewhere in the network.
Then it calculates an expected sending rate.
It takes the window size, which indicates
how many packets it's supposed to send
in that round-trip time,
how many bytes of data it’s supposed
to send in that round-trip time,
and it divides it by the base
round trip time. So if you divide
number of bytes by time, you get
a bytes per second, and that gives
you the rate at which it should be sending data.
And if the network can
support sending at that rate, it should
be able to deliver that window of
packets within a complete round trip time.
And, if it can’t, it will take
longer than a round trip time to
deliver that window of packets, and the
queues will be gradually building up Alternatively,
if it takes less than a round
trip time, this is an indication that
the queues are decreasing.
And it measures the actual rate at
which it sends the packets.
And it compares them.
And if the actual rate at which
it's sending packets is less than the
expected rate, if it's taking longer than
a round-trip time to deliver the complete
window worth of packets, this is a
sign that the packets can’t all be delivered.
And it, you know, it's trying to send too
much. It’s trying to send at too
fast a rate, and it should reduce
its rate and let the queues drop.
Equally, in the other case it should
increase its rate, and measuring the difference
between the actual and the expected rates,
it can measure whether the queues growing or shrinking.
And TCP Vegas compares the expected rate,
which actually manages to send at,
the expected rate at which it gets
the acknowledgments back, with the actual rate.
And it adjusts the window.
And if the expected rate, minus the
actual rate, is less than some threshold,
that indicates that it should increase its
window. And if the expected rate,
minus the actual rate, is greater than
some other threshold, then it should decrease the window.
That is, if data is arriving at
the expected rate, or very close to
it, this is probably a sign that
the network can support a higher rate,
and you should try sending a little bit faster.
Alternatively, if data is arriving slower
than it's being sent,
this is a sign that you're sending too fast and you
should slow down.
And the two thresholds, R1 and R2,
determine how close you have to be
to the expected rate, and how far
away from it you have to be in order to slow down.
And the result is that TCP Vegas
follows a much smoother transmission rate.
Unlike TCP Reno, which follows the characteristic
sawtooth pattern, or TCP cubic which follows the
cubic equation to change it’s rate,
both of which adapt quite abruptly whenever
there's a packet loss,
TCP Vegas makes a gradual change.
It gradually increases, or decreases, it’s sending
rate in line with the variations in
the queues. So, it’s a much smoother
algorithm, which doesn't continually build up and
empty the queues.
Because the queues are not continuing building
up, not continually being filled, this keeps
the latency down
while still achieving recently good performance.
TCP Vegas is a good idea in principle.
This idea is known as delay-based congestion
control, and I think it's actually a
really good idea in principle. It reduces
the latency, because it doesn't fill the queues.
It reduces the packet loss, because it's
not causing, t's not pushing the queues
to overflow and causing packets to be
lost. So the only packet losses you
get are those caused by transmission problems.
And this reduces unnecessary, reduces you having
to transmit packets, because you forced the
network into overload, and forced it to
lose the packets, and it reduces the latency.
The problem with TCP Vegas is that
it doesn't work, doesn’t interwork work with,
TCP Reno or TCP cubic.
If you have any TCP Reno or
Cubic flows on the network, they will
aggressively increase their sending rate and try
to fill the queues, and the push
the queues into overload.
And this will increase the round-trip time,
reduce the rate at which Vegas can
send, and it will force TCP Vegas to slow down.
Because TCP Vegas sees the queues increasing,
because Cubic and Reno are intentionally trying
to fill those queues, and if the
queues increase, this causes Vegas to slow down.
That gradually means there's more space in
the queues, which Cubic and Reno will
gradually fill-up, which causes Vegas to slow
down, and they end up in a
spiral, where the TCP Vegas flows get
pushed down to zero, and the Reno
or Cubic flows use all of the capacity.
So if we only have TCP Vegas
in the network, I think it would
behave really nicely, and we get really
good, low latency, behaviour from the network.
Unfortunately we're in a world where Reno,
and Cubic, have been deployed everywhere.
And without a step change, without an
overnight switch where we turn of Cubic,
and we turn off Reno, and we
turn on Vegas, everywhere we can't deploy
TCP Vegas because always loses out to
Reno and Cubic.
So, it's a good idea in principle,
but in practice it can't be used
because of the deployment challenge.
As I say, it's a good idea
in principle, and the idea of using
delay as a congestion signal is a
good idea in principle, because we can
get something which achieves lower latency.
Is it possible to deploy a different
algorithm? Maybe the problem is not principal,
maybe the problem is the algorithm in TCP Vegas?
Well, people are trying alternatives which are delay based.
And the most recent attempt at this
is an algorithm called TCP BBR,
Bottleneck Bandwidth and Round-trip time.
And again, this is a proposal that
came out of Google. And one of
the co-authors, if you look at the
paper on the right, is Van Jacobson,
who was the original designer of TCP
congestion control. So there's clearly some smart
people behind this.
The idea is that it tries to explicitly
measure the round-trip time as it sends
the packets. It tries to explicitly measure
the sending rate in much the same way same way that
TCP Vegas does. And, based on those
measurements, and some probes where it varies
its rate to try and find if
it's got more capacity, or try and
sense if there is other traffic on the network.
It tries to directly set a congestion
window that matches the network capacity,
based on those measurements.
And, because this came out of Google,
it got a lot of press,
and Google turned it on for a
lot of their traffic. I know they
were running it for YouTube for a
while, and a lot of people saw
this, and jumped on the bandwagon.
And, for a while, it was starting
to get a reasonable amount of deployments.
The problem is, it turns out not to work very well.
And Justine Sherry at Carnegie Mellon University,
and her PhD student Ranysha Ware,
did a really nice bit of work
that showed that is incredibly unfair to
regular TCP traffic.
And, it's unfair in kind-of the opposite
way to Vegas. Whereas TCP Reno and
TCP Cubic would force TCP Vegas flows
down to nothing, TCP BBR is unfair
in the opposite way, and it demolishes
Reno and Cubic flows, and causes tremendous
amounts of packet loss for those flows.
So it's really much more aggressive than
the other flows in certain cases,
and this leads to really quite severe unfairness problems.
And the Vimeo link on the slide is a link to the talk at
the Internet Measurement Conference, where Ranysha talks
through that, and demonstrates really clearly that
TCP BBR version 2 is really quite problematic, and
not very safe to deploy on the current network.
And there's a there's a variant called
BBR v2, which is under development,
and seems to be changing,
certainly on a monthly basis, which is
trying to solve these problems. And this
is very much an active research area,
where people are looking to find better alternatives.
So that's the principle of delay-based congestion control.
Traditional TCP, the Reno algorithm and the
Cubic algorithms, intentionally try to fill the
queues, they intentionally try to cause latency.
TCP Vegas is one well-known algorithm which
tries to solve this, and
doesn't work in practice, but in principle
is a good idea, it just has
some deployment challenges, given the installed base
of Reno and Cubic.
And there are new algorithms, like TCP
BBR, which don't currently work well,
but have potential to solve this problem.
And, hopefully, in the future, a future
variant of BBR will work effectively,
and we'll be able to transition to
a lower latency version of TCP.
The use of delay-based congestion control is one way of reducing
network latency. Another is to keep Reno and Cubic-style congestion
control, but to move away from using packet loss as an implicit
congestion signal, and instead provide an explicit congestion
notification from the network to the applications. This part of
the lecture introduces the ECN extension to TCP/IP that provides
such a feature, and discusses its operation and deployment.
Slides for part 5
In the previous parts of the lecture,
I’ve discussed TCP congestion control. I’ve discussed
how TCP tries to measure what the
network's doing and, based on those measurements,
adapt it’s sending rate to match the
available network capacity.
In this part, I want to talk
about an alternative technique, known as Explicit
Congestion Notification, which allows the network to
directly tell TCP when it's sending too
fast, and needs to reduce it’s transmission rate.
So, as we've discussed, TCP infers the
presence of congestion in the network through measurement.
If you're using TCP Reno or TCP
Cubic, like most TCP flows in the
network today, then the way it infers
that is because there's packet loss.
TCP Reno and TCP Cubic keep gradually
increasing their sending rates, trying to cause
the queues to overflow.
And they cause a queue overflow,
cause a packet to be lost,
and use that packet loss as the
signal that the network is busy,
that they've reached the network capacity,
and they should reduce the sending rate.
And this is problematic for two reasons.
First, is because it increases delay.
It's continually pushing the queues to be
full, which means the network’s operating with
full queues, with its maximum possible delay.
And the second is because it makes
it difficult to distinguish loss which is
caused because the queues overflowed, from loss
caused because of a transmission error on
a link, so called non-congestive loss,
which you might get due to interference or a wireless link.
The other approach people have discussed,
is the approach in TCP Vegas,
where look at variation in queuing latency
and use that as an indication of loss.
So, rather than pushing the queue until
it overflows, and detecting the overflow,
you watch to see as the queue
starts to get bigger, and use that
as an indication that you should reduce
your sending rate. Or, equally, you spot
the queue getting smaller, and use that
as an indication that you should maybe
increase your sending rate.
And this is conceptually a good idea,
as we discussed in the last part,
because it lets you run TCP with
lower latency. But it's difficult to deploy,
because it interacts poorly with TCP Cubic
and TCP Reno, both of which try
to fill the queues.
As a result, we're stuck with using
Reno and Cubic, and we're stuck with
full queues in the network. But we'd
like to avoid this, we'd like to
go for a lower latency way of
using TCP, and make the network work
without filling the queues.
So one way you might go about
doing this is, rather than have TCP
push the queues to overflow,
have the network rather tell TCP when
it's sending too fast.
Have something in the network tell the
TCP connections that they are congesting the
network, and they need to slow down.
And this thing is called Explicit Congestion Notification.
Explicit Congestion Notification, the ECN bits,
are present in the IP header.
The slide shows an IPv4 header with
the ECN bits indicated in red.
The same bits are also present in
IPv6, and they're located in the same
place in the packet in the IPv6 header.
The way these are used.
If the sender doesn't support ECN,
it sets these bits to zero when
it transmits the packet. And they stay
at zero, nothing touches them at that point.
However, if the sender does support ECN,
and it sets these bits to have
the value 01, so it sets bit
15 of the header to be 1,
and it transmits the IP packets as
normal, except with this one bit set
to indicate that the sender understands ECN.
If congestion occurs in the network,
if some queue in the network is
beginning to get full, it’s not yet
at the point of overflow but it's
beginning to get full, such that some
router in the network thinks it's about
to start experiencing congestion,
then that router, that router in the
network, changes those bits in the IP
packets, of some of the packets going
past, and sets both of the ECN bits to one.
This is known as an ECN Congestion Experienced mark.
It's a signal. It's a signal from
the network to the endpoints, that the
network thinks it's getting busy, and the
endpoint should slow down.
And that's all it does. It monitors
the occupancy in the queues, and if
the queue occupancy is higher than some
threshold, it sets the ECN bits in
the packets going past, to indicate that
threshold has been reached and the network
is starting to get busy.
If the queue overflows,
if the endpoints keep sending faster and
the queue overflows, then it drops the
packet so as normal. The only difference
is that there's some intermediate point where
the network is starting to get busy,
but the queue has not yet overflowed.
And at that point, the network marks
the packets indicate that it's getting busy.
A receiver might get a TCP packet,
a TCP segment, delivered within an IP
packet, where that IP packet has the
ECN Congestion Experienced mark set. Where the
network has changed those two bits in
the IP header to 11, to indicate
that it's experiencing congestion.
What it does that point at that
point, is it sets a bit in
the TCP header of the acknowledgement packet
it sends back to the sender.
That bit’s known as the ECN Echo
field, the ECE field. It sets this
bit in the TCP header equal to
one on the next packet it sends
back to the sender, after it received
the IP packet, containing the TCP segment,
where that IP packet was marked Congestion Experienced.
So the receiver doesn't really do anything
with the Congestion Experienced mark, other than
mark, set the equivalent mark in the
packet it sends back to the sender.
So it's telling the sender, “I got
a Congestion Experienced mark in one of
the packets you sent”.
When that packet gets to the sender,
the sender sees this bit in the
TCP header, the ECN Echo bit set
to one, and it realises that the
data it was sending
caused a router on the path to
set the ECN Congestion Experienced mark,
which the receiver has then fed back to it.
And what it does at that point,
is it reduces its congestion window.
It acts as-if a packet had been
lost, in terms of how it changes its congestion window.
So if it's a TCP Reno sender,
it will halve its congestion window,
the same way it would if a packet was lost.
If it's a TCP Cubic sender,
it will back off its congestion window
to 70%, and then enter the weird
cubic equation for changing its congestion window.
After it does that, it sets another
bit in the header of the next
TCP segment it sends out. It sets
the CWR bit, the Congestion Window Reduced
bit, in the header to tell the
network and the receiver that it's done it.
So the end result of this,
is that rather than a packet being lost
because the queue overflowed, and then the
acknowledgments coming back indicating, via the triple
duplicate ACK, that's a packet had been
lost, and then TCP reducing its congestion
window and re-transmitting that lost packet.
What happens is,
the IP packets, TCP packets, in the
outbound direction gets a Congestion Experienced mark
set, to indicate that the network is
starting to get full.
The ECN Echo bit is set on
the reply, and at that point the
sender reduces its window,
as-if the loss had occurred.
And then carries on sending with the
CWR bit set to one on that
next packet. So it has the same
effect, in terms of reducing the congestion window, as would
dropping a packet, but without dropping a
packet. So there's no actual packet loss
here, there’s just a mark to indicate
that the network was getting busy.
So it doesn't have to retransmit data,
and this happens before the queue is
full, so you get lower latency.
So ECN is a mechanism to allow
TCP to react to congestion before packet loss occurs.
It allows routers in the network to
signal congestion before the queue overflows.
It allows routers in the network to
say to TCP, “if you don't slow
down, this queue is going to overflow,
and I’m going to throw your packets away”.
it's independent of how TCP then responds,
whether it follows Reno or Cubic or
Vegas that doesn't really matter, it's just
an indication that it needs to slow
down because the queues are starting to
build up, and will overflow soon if it doesn't.
And if TCP reacts to that,
reacts to the ECN Echo bit going
back, and the sender reduces its rate,
the queues will empty, the router will
stop marking the packets, and everything will
settle down at a slightly slower rates
without causing any packet loss.
And the system will adapt, and it
will it will achieve the same sort
of throughput, it will just react earlier,
so you have smaller queues and lower latency.
And this gives you the same throughput
as you would with TCP Reno or
TCP Cubic, but with low latency,
which means it's better for competing video
conferencing or gaming traffic.
And I’ve described the mechanism for TCP,
but there are similar ECN extensions for
QUIC and for RTP, which is the
video conferencing protocol, all designed to achieve
the same goal.
So ECN, I think, is unambiguously a
good thing. It’s a signal from the
network to the endpoints that the network
is starting to get congested, and the
endpoints should slow down.
And if the endpoints believe it,
if they back off,
they reduce their sending rate before the
network is overloaded, and we end up
in a world where h we still
achieve good congestion control, good throughput,
but with lower latency.
And, if the endpoints don't believe it
well, eventually, the routers, the queues,
overflow and they lose packets, and we’re
no worse-off than we are now.
In order to deploy ECN, though,
we need to make changes. We need
to change the endpoints, to change the
end systems, to support these bits in
the IP header, and to support,
to add support for this into TCP.
And we need to update the routers,
to actually mark the packets when they're
starting to get overloaded.
Updating the end points has pretty much
been done by now.
I think every TCP implementation,
implemented in the last 15-20 years or
so, supports ECN, and these days,
most of them have it turned on by default.
And I think we actually have Apple
to thank for this.
ECN, for a long time, was implemented
but turned off by default, because there’d
been problems with some old firewalls which
reacted badly to it, 20 or so years ago.
And, relatively recently, Apple decided that they
wanted these lower latency benefits, and they
thought ECN should be deployed. So they
started turning it on by default in the iPhone.
And they kind-of followed an interesting approach.
In that for iOS nine, a random
subset of 5% of iPhones would turn
on ECN for some of their connections.
And they measured what happened. And they
found out that in the overwhelming majority
of cases this worked fine, and occasionally
it would fail.
And they would call up the network
operators, who's networks were showing problems,
and they would say “your network doesn't
work with iPhones; and currently it's not
working well with 5% of iPhones but
we're going to increase that number,
and maybe you should fix it”.
And then, a year later, when iOS
10 came out, they did this 50%
of connections made by iPhones. And then
a year later, for all of the connections.
And it's amazing what impact a
popular vendor calling up a network operator connect can
have on getting them to fix the equipment.
And, as a result,
ECN is now widely enabled by default
in the phones, and the network seems
to support it just fine.
Most of the routers also support ECN.
Although currently relatively few of them seem
to enable it by default. So most
of the endpoints are now
at the stage of sending ECN enabled
traffic, and are able to react to
the ECN marks, but most of the
networks are not currently setting the ECN marks.
This is, I think, starting to change.
Some of the recent DOCSIS, which is
the cable modem standards, are starting to
support you ECN. We’re starting to see
cable modems, cable Internet connections, which enable
ECN by default.
And, we're starting to see interest from
3GPP, which is the mobile phone standards
body to enable this in 5G,
6G, networks, so I think it's coming.
but it's going to take time.
And, I think, as it comes,
as ECN gradually gets deployed, we’ll gradually
see a reduction in latency across the
networks. It’s not going to be dramatic.
It's not going to suddenly transform the
way the network behaves, but hopefully over
the next 5 or 10 years we’ll
gradually see the latency reducing as ECN
gets more widely deployed.
So that's what I want to say
about ECN. It’s a mechanism by which
the network can signal to the applications
that the network is starting to get
overloaded, and allow the applications to back
off more quickly, in a way which
reduces latency and reduces packet loss.
The final part of the lecture moves on from congestion control and
queueing, and discusses another factor that affects latency: the
network propagation delay. It outlines what is the propagation delay
and ways in which it can be reduced, including more direct paths and
the use of low-Earth orbit satellite constellations.
Slides for part 6
In this final part of the lecture,
I want to move on from talking
about congestion control, and the impact of
queuing delays on latency, and talk instead
about the impact of propagation delays.
So, if you think about the latency
for traffic being delivered across the network,
there are two factors which impact that latency.
The first is the time packets spent
queued up at various routers within the network.
As we've seen in the previous parts
of this lecture, this is highly influenced
by the choice of TCP congestion control,
and whether Explicit Congestion Notification
is enabled or not.
The other factor, that we've not really
discussed to date, is the time it
takes the packets to actually propagate down
the links between the routers. This depends
on the speed at which the signal
propagates down the transmission medium.
If you're using an optical fibre to
transmit the packets, it depends on the
speed at which the light propagates through the fibre.
If you're using electrical signals in a
cable, it depends on the speed at
which electrical field propagates down the cable.
And if you're using radio signals,
it depends on the speed of light,
the speed at which the radio signals
propagate through the air.
As you might expect, physically shorter links
have lower propagation delays.
A lot of the time it takes
a packet to get down a long
distance link is just the time it
takes the signal to physically transmit along
the link. If you make the link
shorter it takes less time.
And what is perhaps not so obvious,
though, is that you can actually get
significantly significant latency benefits in certain paths,
because the existing network links follow quite
For example, if you look at the
path the network links take, if you're
sending data from Europe to Japan.
Quite often, that data goes from Europe,
across the Atlantic to, for example,
New York or Boston, or somewhere like
that, across the US to
San Francisco, or Los Angeles, or Seattle,
or somewhere along those lines, and then
from there, in a cable across the
Pacific to Japan.
Or alternatively, it goes from Europe through
the Mediterranean, the Suez Canal and the
Middle East, and across India, and so
on, until it eventually reaches Japan the
other way around. But neither of these
is a particularly direct route.
And it turns out that there is
a much more direct, a much faster
route, to get from Europe to Japan,
which is to lay a an optical fibre
through the Northwest Passage, across Northern Canada,
through the Arctic Ocean, and down through
the Bering Strait, and past Russia to
get directly to Japan. It's much closer
to the great circle route around the
globe, and it's much shorter than the
route that the networks currently take.
And, historically, this hasn't been possible because
of the ice in the Arctic.
But, with global warming, the Northwest Passage
is now ice-free for enough of the
year that people are starting to talk
about laying optical fibres along that route,
because they can get a noticeable latency
reduction, for certain amounts of traffic,
by just following the physically shorter route.
Another factor which influences the propagation delay
is the speed of light in the transmission media.
Now, if you're sending data using radio links,
or using lasers in a vacuum,
then these propagate at the speed of light in the vacuum.
Which is about 300 million meters per second.
The speed of light in optical fibre,
though, is slower. The speed at which
light propagates down that down a fibre,
the speed at which light propagates through
glass, is only about 200,000.
kilometres per second, 200 million meters per
second. So it’s about two thirds of
the speed at which it propagates in a vacuum.
And this is the reason for systems
such as StarLink, which SpaceX is deploying.
And the idea of these systems is
that, rather than sending the Internet signals
down an optical fibre,
you send them 100, or a couple
of hundred miles, up to a satellite,
and they then go around between various
satellites in the constellation, in low earth
orbit, and then down to a receiver
near the destination.
And by propagating through vacuum, rather than
through optical fibre, the speed of light
in vacuum is significantly faster, it's about
50% faster than the speed of light
in fibre, and this can reduce the latency.
And the estimates show that if you
have a large enough constellation of satellites,
and SpaceX is planning on deploying around
4000 satellites, I believe, and with careful
routing, you can get about a 40,
45, 50% reduction in latency.
Just because the signals are transmitting via
radio waves, and via inter-satellite laser links,
which are in a vacuum, rather than
being transmitted through a fibre optic cable.
Just because of the differences in the
speed of light between the two mediums.
And the link on the slide points
to some simulations of the StarLink network,
which try and demonstrate how this would
work, and how it can achieve
both network paths that closely follow the
great circle routes, and
how it can reduce the latency because
of the use of satellites.
So, what we see is that people
are clearly going to some quite extreme
lengths to reduce latency.
I mean, what we spoke about in
the previous part was the use of
ECN marking to reduce latency by reducing
the amount of queuing. And that's just
a configuration change, it’s a software change
to some routers. And that seems to
me like a reasonable approach to reducing latency.
But some people are clearly willing to
go to the effort of
launching thousands of satellites, or
perhaps the slightly less extreme case of
laying new optical fibres through the Arctic Ocean.
So why are people doing this? Why
do people care so much about reducing
latency, that they're willing to spend billions
of dollars launching thousands of satellites,
or running new undersea cables, to do this?
Well, you'll be surprised to hear that
this is not to improve your gaming
experience. And this is not to improve
the experience of your zoom calls.
Why are people doing this? High frequency share trading.
Share traders believe they can make a
lot of money, by getting a few milliseconds worth
of latency reduction compared to their competitors.
Whether that's a good use of a
few billion dollars i'll let you decide.
But the end result may be,
hopefully, that we will get lower latency
for the rest of us as well.
And that concludes this lecture.
There are a bunch of reasons why
we have latency in the network.
Some of this is due to propagation
delays. Some of this, perhaps most of
it, in many cases, is due to
queuing at intermediate routers.
The propagation delays are driven by the speed of light.
And unless you can launch many satellites,
or lay more optical fibres, that's pretty
much a fixed constant, and there's not
much we can do about it.
Queuing delays, though, are things which we
can change. And a lot of the
queuing delays in the network are caused
because of TCP Reno and TCP Cubic,
which push for the queues to be full.
Hopefully, we will see improved TCP congestion
control algorithms. And TCP Vegas was one
attempt in this direction, which unfortunately proved
not to be deployable in practice,
TCP BBR was another attempt which
was problematic for other reasons, because of
its unfairness. But people are certainly working
on an alternative algorithms in this space,
and hopefully we'll see things deployed before too long.
Lecture 6 discussed TCP congestion control and its impact on latency.
It discussed the principles of congestion control (e.g., the sliding
window algorithm, AIMD, conservation of packets), and their realisation
in TCP Reno. It reviewed the choice of TCP initial window, slow start,
and the congestion avoidance phase, and the response of TCP to packet
loss as a congestion signal.
The lecture noted that TCP Reno cannot effectively make use of fast
and long distance paths (e.g., gigabit per second flows, running on
transatlantic links). It discussed the TCP Cubic algorithm, that
changes the behaviour of TCP in the congestion avoidance phase to
make more effective use of such paths.
And it noted that both TCP Reno and TCP Cubic will try to increase
their sending rate until packet loss occurs, and will use that loss
as a signal to slow down. The fills the in-network queues at routers
on the path, causing latency.
The lecture briefly discussed TCP Vegas, and the idea of using delay
changes as a congestion signal instead of packet loss, and it noted
that TCP Vegas is not deployable in parallel with TCP Reno or Cubic.
It highlighted ongoing research with TCP BBR, a new proposal that
aims to make a deployable congestion controller that is latency
sensitive, and some of the fairness problems with BBR v1.
Finally, the lecture highlighted the possible use of Explicit Congestion
Notification as a way of signalling congestion to the endpoints, and of
causing TCP to reduce its sending rate, before the in-network queues
overflow. This potentially offers a way to reduce latency.
Discussion will focus on the behaviour of TCP Reno congestion control,
to understand the basic dynamics of TCP, why these are so effective at
keeping the network occupied, and understanding how this leads to high
latency. We will then discuss the applicability and ease of deployment
of several alternatives (Cubic, Vegas, BBR, and ECN) and how they
change performance and latency.