Networked Systems H (2022-2023)
Lecture 5: Reliability and Data Transfer
Lecture 5 discusses reliable and unreliable data transfer in the
Internet. It explains the best-effort nature of packet delivery,
the end-to-end argument, and the timeliness-vs-reliability trade-off
inherent in the design of the Internet. And it discusses three
transport protocols in use in the Internet, UDP, TCP, and QUIC,
and how the provide different degrees of timeliness and reliability,
and offer different services to applications.
Part 1: Packet Loss in the Internet
The first part of the lecture discusses packet loss in the Internet.
It talks about the causes of packet loss, the end-to-end argument,
and the timeliness-reliability trade-off.
Slides for part 1
00:00:00.633
In this lecture I want to move
00:00:02.333
on from the discussion of connection establishment,
00:00:04.800
and talk instead about reliability and effective
00:00:07.433
data transfer across the network.
00:00:10.000
There are four parts to this.
00:00:12.000
In this first part, I’ll talk briefly
00:00:14.066
about packet loss in the Internet,
00:00:15.866
and the trade-off between reliability and timeliness.
00:00:19.000
Then, I’ll move on to discuss unreliable
00:00:21.633
data using UDP, and talk about the
00:00:23.833
types of applications that benefit from this.
00:00:26.700
In part three, I’ll talk about reliable
00:00:29.000
data transfer with TCP. I’ll discuss the
00:00:31.966
TCP service model, how TCP ensures data
00:00:34.866
is delivered reliably, and some of the
00:00:37.066
limitations of TCP relating to head-of-line blocking.
00:00:41.000
Then, in the final part, I’ll conclude
00:00:43.100
by discussing how QUIC transfers data and
00:00:45.433
how this differs from TCP.
00:00:49.866
I want to start by discussing packet loss in the Internet.
00:00:52.833
What we mean when we say that the Internet
00:00:55.000
provides a best effort service.
00:00:57.066
The end-to-end argument.
00:00:58.733
And the timeliness vs reliability trade-off inherent
00:01:01.400
in the design of the Internet.
00:01:05.833
As we discussed back in lecture 1,
00:01:08.066
the Internet is a best effort packet delivery network.
00:01:11.933
This means that it’s unreliable by design.
00:01:15.000
IP packets can be lost, delayed,
00:01:17.533
reordered, or corrupted in transit. And this
00:01:20.900
is regarded as a feature, rather than a bug.
00:01:23.766
A network that can’t deliver
00:01:25.766
a packet is supposed to discard it.
00:01:29.000
There are many reasons why a packet
00:01:31.133
can get lost or discarded. It could
00:01:33.733
be due to a transmission error,
00:01:35.433
where electrical noise of wireless interference corrupts
00:01:37.833
the packet in transit, making the packet unreadable.
00:01:41.833
Or it could be because too much
00:01:43.300
traffic is arriving at some intermediate link
00:01:45.500
in the network, so an intermediate router
00:01:48.433
runs out of buffer space. If traffic
00:01:51.033
is arriving at a router from several
00:01:52.666
different incoming links, but all going to
00:01:55.200
the same destination, so it’s arriving faster
00:01:57.700
than it can be delivered, a queue
00:01:59.900
of packets will build up, waiting for transmission.
00:02:03.033
If this situation persists, the queue might
00:02:05.400
grow so much that a router runs
00:02:07.633
out of memory, and has no choice
00:02:09.400
but to discard the packets.
00:02:12.000
Or packets could be lost because of
00:02:13.966
a link failure. Or a router bug.
00:02:15.833
Or for other reasons.
00:02:18.000
How often this happens varies significantly.
00:02:22.000
The packet loss rate depends on the type of link.
00:02:26.100
Wireless links tend to be less reliable
00:02:28.433
than wired links, for example.
00:02:30.966
It’s reasonably likely that packet sent over a wireless
00:02:34.533
link, such as WiFi or 4G,
00:02:36.466
will be corrupted in transit due to
00:02:38.433
noise, interference, or cross traffic.
00:02:41.066
This is very unlikely on an Ethernet
00:02:43.633
or optical fibre link.
00:02:46.000
The packet loss rate also depends on
00:02:48.933
the overall quality and robustness of the infrastructure.
00:02:52.366
Countries with well developed
00:02:53.800
and well maintained infrastructure
00:02:55.400
tend to have reliable Internet links;
00:02:58.366
countries with less robust or lower
00:03:00.500
capacity infrastructure tend to see more problems.
00:03:04.833
And the loss rate depends on the protocol.
00:03:07.966
Some protocols intentionally try to push
00:03:10.000
links to capacity, causing temporary overload as
00:03:12.500
they try to find the limit,
00:03:14.633
as they try to find the maximum
00:03:16.666
transmission rate they can achieve.
00:03:19.000
TCP and QUIC do this in many cases,
00:03:22.033
depending on the congestion control algorithm
00:03:24.366
used, as we’ll see in lecture 6.
00:03:28.000
Other applications, such as telephony or video
00:03:30.200
conferencing, tend to have an upper bound
00:03:32.400
in the amount of data they can send.
00:03:35.066
Whatever the reason, though,
00:03:36.933
some packet loss is inevitable.
00:03:40.000
The transport layer needs to recognise this.
00:03:42.533
It must detect packet loss. And,
00:03:44.866
if the application needs reliability, it must
00:03:47.133
retransmit or otherwise repair any lost data.
00:03:53.000
That the Internet provides best effort packet
00:03:55.266
delivery is a result of the end-to-end argument.
00:03:58.966
The end-to-end argument considers whether it’s better
00:04:02.133
to place functionality inside the network or
00:04:04.300
at the end points.
00:04:06.833
For example, rather than provide best effort
00:04:09.633
delivery, we could try to make the
00:04:11.766
network deliver packets reliably. We could design
00:04:15.466
some way to detect packet loss on
00:04:17.133
a particular link, and request that the
00:04:19.166
lost packets be retransmitted locally,
00:04:21.466
somewhere within the network.
00:04:23.666
And, indeed, some network links do this.
00:04:27.000
In WiFi networks, for example, the base
00:04:29.666
station acknowledges packets it receives from the
00:04:31.800
clients, and requests any corrupted packets are
00:04:34.966
re-sent, to correct the error.
00:04:38.000
The problem is, that unless this mechanism
00:04:40.333
is 100% perfect all the time,
00:04:43.033
then end systems will still need to
00:04:44.966
check if the data has been received
00:04:46.600
correctly, and will still need some way
00:04:48.600
of retransmitting packets in the case of problems.
00:04:52.000
And if they’ve got that, why bother
00:04:54.133
with the in-network retransmission and repair?
00:04:58.000
Often times, if you add features into
00:05:00.233
the network routers, they end up duplicating
00:05:03.000
functionality that the network endpoints need to
00:05:05.500
provide anyway.
00:05:08.600
Maybe the performance benefit of adding features
00:05:11.833
to the network is so big that it’s worth while.
00:05:16.000
But often, the right thing to do
00:05:17.566
is to keep the network simple.
00:05:19.733
Omit anything that can be done by the endpoints.
00:05:22.633
And favour simplicity over the
00:05:24.533
absolute optimal performance.
00:05:28.300
The end-to-end argument is one of the
00:05:29.933
defining principles of the Internet. And I
00:05:32.900
think it’s still a good approach to
00:05:34.566
take, when possible. Keep the network simple, if you can
00:05:39.000
The paper linked from the slide talks
00:05:40.866
about this subject in a lot more detail.
00:05:46.000
Irrespective of whether retransmission of lost packets
00:05:49.033
happen between the endpoints or within the
00:05:51.766
network, it takes time.
00:05:54.566
This leads to a fundamental trade-off in
00:05:56.400
the design of the network.
00:05:59.000
If a connection is to be reliable,
00:06:01.266
it cannot guarantee timeliness.
00:06:04.400
It’s not possible to build absolutely perfect
00:06:07.066
network links, that never discard or corrupt
00:06:09.433
packets. There’s always some risk that the
00:06:12.566
data is lost and needs to be
00:06:14.833
retransmitted. And retransmitting a packet will always
00:06:18.133
take time, and so disrupt the timeliness of the delivery.
00:06:22.400
And similarly, if a connection is to
00:06:24.600
be timely, it cannot guarantee reliability.
00:06:27.800
There’s a trade-off to be made.
00:06:31.100
Protocols like UDP are timely but don’t
00:06:33.966
attempt to be reliable. They send packets,
00:06:36.800
and if they get lost, they get lost.
00:06:40.533
TCP and QUIC, on the other hand,
00:06:42.566
aim to be reliable. They send the
00:06:45.733
packets, and if they get lost,
00:06:47.366
they retransmit them.
00:06:49.666
And if the retransmission gets lost? They
00:06:52.200
try again, until the data eventually arrives.
00:06:55.533
As we’ll see in part 3 of
00:06:57.533
this lecture, this causes head of line
00:06:59.266
blocking, making the protocol less timely.
00:07:03.000
And other protocols, such as the Real-time
00:07:05.466
Transport Protocol, RTP, that I’ll talk about
00:07:09.166
in lecture 7, or the partially reliable
00:07:11.566
version of the Stream Control Transport Protocol,
00:07:13.800
SCTP, aim for a middle ground.
00:07:17.466
They try to correct some, but not
00:07:19.100
all, of the transmission errors. The try
00:07:22.000
to achieve a balance, a middle-ground,
00:07:24.233
between timeliness and reliability.
00:07:29.266
The different protocols exist because different applications
00:07:32.400
make different trade-offs.
00:07:34.233
Some applications prefer timeliness,
00:07:36.533
some prefer reliability.
00:07:39.366
For applications like web browsing, email,
00:07:41.833
or messaging, you want to receive all
00:07:44.533
the data. If I’m loading a web
00:07:47.333
site, I’d like it to load quickly,
00:07:49.300
sure. But I prefer for it to
00:07:51.800
load slowly, and be uncorrupted, rather than
00:07:54.433
load quickly with some parts missing.
00:07:57.466
For a video conferencing tool, like Zoom,
00:08:00.100
though, the trade-off is different. If I’m
00:08:03.200
having a conversation with someone, it’s more
00:08:05.166
important that the latency is low,
00:08:07.066
than the picture quality is perfect.
00:08:10.000
The same may be true for gaming.
00:08:13.000
And this has implications for the way
00:08:15.166
we design the network.
00:08:17.000
It means that the IP layer needs
00:08:18.933
to be unreliable. It needs to be
00:08:21.066
a best effort network.
00:08:23.400
If the IP layer is unreliable,
00:08:25.700
protocols like TCP and QUIC can sit
00:08:28.100
on top and retransmit packets to make
00:08:30.200
it reliable. A transport protocol can make
00:08:33.533
an unreliable network into a reliable one.
00:08:37.366
But if the IP layer is reliable,
00:08:39.666
if the IP layer retransmits packets itself,
00:08:42.700
then the network, the applications, the transport
00:08:45.366
protocols, can’t undo that.
00:08:51.466
So this concludes the discussion of packet
00:08:53.533
loss and why the Internet opts to
00:08:55.433
provide an unreliable, best-effort, service.
00:08:58.566
In the next part, I’ll talk about
00:09:00.233
UDP and how to make use of
00:09:02.100
an unreliable transport protocol.
Part 2: Unreliable Data Using UDP
The second part of the lecture discusses UDP. It outlines the UDP
service model, and reviews how to send and receive data using UDP
sockets, and the implications of unreliable delivery for applications
using UDP. It discusses how UDP is suitable for real-time applications
that prioritise low-latency over reliability. And is discusses the use
of UDP as a substrate on which alternative transport protocols can be
implemented, avoiding some of the challenges of protocol ossification.
Slides for part 2
00:00:00.300
In this part, I’ll move on to
00:00:02.166
discuss how to send unreliable data using UDP.
00:00:05.400
I’ll talk about the UDP service model,
00:00:07.900
how to send and receive packets,
00:00:09.833
and how to layer protocols on top of UDP.
00:00:14.000
UDP provides an unreliable,
00:00:16.300
connectionless, datagram service.
00:00:18.600
It adds only two features on top
00:00:20.566
of the IP layer: port numbers and a checksum.
00:00:24.000
The checksum is used to detect whether
00:00:26.300
the packet has been corrupted in transit.
00:00:28.666
If so, the packet will be discarded
00:00:30.933
by the UDP code in the operating
00:00:33.066
system of the receiver, and won’t be
00:00:34.833
delivered to the application.
00:00:37.000
The port numbers determine what application receives
00:00:39.866
the UDP datagrams when they arrive at
00:00:42.066
the destination. They’re set by the bind()
00:00:44.500
call, once the socket has been created.
00:00:47.566
The Internet Assigned Numbers Authority, the IANA,
00:00:50.566
maintains a list of well-known UDP port
00:00:52.633
numbers which you should use for particular
00:00:55.200
applications. This is linked from the bottom of the slide.
00:00:59.400
UDP is very minimal. It doesn’t provide
00:01:02.166
reliability, or ordering, or congestion control.
00:01:05.533
It just delivers packets to an application,
00:01:08.400
that’s bound to a particular port.
00:01:11.000
Mostly, UDP is used as a substrate.
00:01:13.800
It’s a base on which higher-layer protocols are built.
00:01:17.666
QUIC is an example of this,
00:01:19.433
as we discussed in the last lecture.
00:01:21.600
Others are the Real-time Transport Protocol,
00:01:24.100
and the DNS protocol,
00:01:25.466
that we’ll talk about later in the course.
00:01:29.666
UDP is connectionless. It’s got no notion
00:01:32.633
of clients or servers, or of establishing
00:01:34.933
a connection before it can be used.
00:01:38.000
To use UDP, you first create a socket.
00:01:41.433
Then you call bind(),
00:01:42.766
to choose the local port on which that socket
00:01:44.833
listens for incoming datagrams.
00:01:46.900
They you call recvfrom() if you want
00:01:49.400
to receive a datagram on that socket,
00:01:51.800
or sendto() if you want to send a datagram.
00:01:55.000
You don’t need to connect.
00:01:56.966
You don’t need to accept connections.
00:01:59.266
You just send and receive data.
00:02:01.866
And maybe that data is delivered.
00:02:05.000
When you’re finished, you close the socket.
00:02:08.433
Protocols that run on top of UDP,
00:02:10.700
such as QUIC, might add support for
00:02:13.133
connections, reliability, ordering,
00:02:15.333
congestion control, and so on,
00:02:17.166
but UDP itself supports none of this.
00:02:22.866
To send a UDP datagram, you use
00:02:25.000
the sendto() function.
00:02:27.033
This work similarly to the send() function
00:02:29.566
you used to send data over a
00:02:31.033
TCP connection in the labs, except that
00:02:33.933
it takes two additional parameters to indicate
00:02:36.566
the address to which the datagram should
00:02:39.066
be sent, and the size of that address.
00:02:41.800
When using TCP, you establish a connection
00:02:45.133
between a socket, bound to a local
00:02:46.933
address and port, and a server listening
00:02:49.600
on a particular port on some remote
00:02:51.400
IP address. And once the connection is
00:02:54.033
established, all the data goes over that
00:02:55.966
connection, to the same destination.
00:02:59.033
UDP is not like that.
00:03:01.533
Every time you call sendto(), you specify
00:03:04.333
the destination address. Every packet you send
00:03:07.633
from a UDP socket can go to
00:03:09.800
a different destination, if you want.
00:03:12.166
There’s no notion of connections.
00:03:15.400
Now, you can call connect() on a
00:03:17.600
UDP socket, if you like, but it doesn’t actually create
00:03:20.233
a connection. Rather, it just remembers the
00:03:23.666
address you give it, so you can
00:03:25.400
call send(), rather than sendto() in future,
00:03:28.400
to save having to specify the address each time.
00:03:33.000
To receive a UDP datagram, you call
00:03:36.133
the recvfrom() function, as shown on the slide.
00:03:39.800
This is like the recv() call you
00:03:42.000
use with TCP, but again it has
00:03:44.566
two additional parameters. These allow it to
00:03:47.433
record the address that the received datagram
00:03:49.700
came from, so you can use them
00:03:52.133
in the sendto() function to send a reply.
00:03:54.766
You can also call recv(), rather than
00:03:57.000
recvfrom(), like with TCP, and it works,
00:04:00.566
but it doesn’t give you the return
00:04:02.233
address, so it’s not very useful.
00:04:05.366
The important point with UDP is that
00:04:07.900
packets can be lost, delayed, or reordered
00:04:10.166
in transit, and UDP doesn’t attempt to
00:04:12.500
recover from this.
00:04:14.900
Just because you send a datagram,
00:04:17.000
doesn’t mean it will arrive. And if
00:04:19.500
datagrams do arrive, they won’t necessarily arrive
00:04:21.933
in the order sent.
00:04:27.900
Unlike TCP, where data written to a
00:04:30.700
connection in a single send() call might
00:04:32.933
end up being split across multiple read()
00:04:35.066
calls at the receiver, a single UDP
00:04:37.900
send generates exactly one datagram.
00:04:41.566
If it’s delivered at all, the data
00:04:43.800
sent by a single call to sendto()
00:04:45.866
will be delivered by a single call
00:04:47.566
to recvfrom(). UDP doesn’t split messages.
00:04:52.233
But UDP is otherwise unreliable.
00:04:54.966
Datagrams can be lost, delayed, reordered,
00:04:57.766
or duplicated in transit.
00:05:00.400
Data sent with sendto() might never arrive.
00:05:03.300
Or it might arrive more than once.
00:05:05.966
Or data sent in consecutive calls to
00:05:08.266
sendto() might arrive out of order,
00:05:10.633
with data sent later arriving first.
00:05:14.700
UDP doesn’t attempt to correct any of these things.
00:05:19.500
The protocol you build on top of
00:05:21.200
UDP might choose to do so.
00:05:23.900
For example, we saw that QUIC adds
00:05:26.000
packet sequence numbers and acknowledgement frames to
00:05:28.433
the data it sends within UDP packets.
00:05:31.366
This lets it put the data back
00:05:33.100
into the correct order, and retransmit any
00:05:34.966
missing packets.
00:05:36.800
But there’s no requirement that the protocol
00:05:38.566
running over UDP is reliable.
00:05:41.500
RTP, the Real-time Transport Protocol, that’s used
00:05:44.933
for video conferencing apps, puts sequence numbers
00:05:47.600
and timestamps inside the UDP datagrams it
00:05:50.666
sends, so it can know if any
00:05:53.033
data is missing, and it can conceal
00:05:55.000
loss or reconstruct the packet playout time,
00:05:58.466
but it generally doesn’t retransmit missing data.
00:06:03.000
UDP gives the application the choice of
00:06:05.566
building reliability, if it wants it.
00:06:07.866
But it doesn’t require that the applications
00:06:09.966
deliver data reliably.
00:06:14.000
Applications that use UDP need to organise
00:06:16.766
the data they send, so it’s useful
00:06:18.400
if some data is lost.
00:06:21.133
Different applications do this in different ways,
00:06:23.633
depending on their needs.
00:06:26.000
QUIC, for example, organises the data into
00:06:28.533
sub-streams within a connection,
00:06:30.300
and retransmits missing data.
00:06:33.266
Video conferencing applications
00:06:34.766
tend to do something different.
00:06:37.333
The way video compression works, is that
00:06:39.700
the codec sends occasional full frames of
00:06:41.700
video, known as I-frames, index frames,
00:06:44.700
every few seconds. And in between these
00:06:48.000
it sends only the differences from the
00:06:49.566
previous frame, known as P-frames, predicted frames.
00:06:53.866
In a video call, it’s common for
00:06:55.900
the background to stay the same,
00:06:57.433
while the person moves in the foreground,
00:06:59.766
so a lot of the frame is
00:07:01.166
the same each time. By only sending
00:07:03.866
the differences, video compression saves bandwidth.
00:07:07.733
But this affects how the application treats
00:07:09.533
the different datagrams.
00:07:12.000
If a UDP datagram containing a predicted
00:07:14.500
frame is lost, it’s not that important.
00:07:17.433
You’ll get a glitch in one frame of video.
00:07:20.500
But if a UDP datagram containing an
00:07:22.700
index frame, or part of an index
00:07:25.300
frame, is lost, then that matters a
00:07:27.533
lot more because the next few seconds
00:07:29.700
worth of video are predicted based on
00:07:31.600
that index frame. Losing an index frame
00:07:34.766
corrupts several seconds worth of video.
00:07:38.000
For this reason, many video conferencing apps
00:07:40.166
running over UDP try to determine if
00:07:42.866
missing packets contained an index frame or
00:07:45.233
not. And they try to retransmit index
00:07:48.000
frames, but not predicted frames.
00:07:51.000
The details of how they do this
00:07:52.833
aren’t really important, unless you’re building a
00:07:54.966
video conferencing app.
00:07:56.766
What’s important though, is that UDP gives
00:07:58.966
the application flexibility to be unreliable for
00:08:02.033
some of the datagrams it sends,
00:08:04.266
while trying to deliver other datagrams reliably.
00:08:08.000
You don’t have that flexibility with TCP.
00:08:12.333
UDP is harder to use, because it
00:08:14.766
provides very few services to help your
00:08:16.600
application, but it’s more flexible because you
00:08:19.333
can build exactly the services you need
00:08:21.200
on top of UDP.
00:08:25.100
Fundamentally, UDP doesn’t make any attempt to
00:08:28.233
provide sequencing, reliability,
00:08:30.500
timing recovery, or congestion control.
00:08:33.766
It just delivers datagrams on a best effort basis.
00:08:38.000
It lets you build any type of
00:08:39.933
transport protocol you want, running inside UDP packets.
00:08:44.000
Maybe that transport protocols has sequence numbers
00:08:46.766
and acknowledgements, and retransmits some or all
00:08:49.533
of the lost packets.
00:08:52.100
Maybe, instead, it uses error correcting codes,
00:08:55.066
to allow some of the packets to
00:08:57.233
be repaired without retransmission.
00:08:59.666
Maybe it includes timestamps, so the receiver
00:09:02.233
can carefully reconstruct the timing.
00:09:04.733
Maybe it contains other information.
00:09:07.166
The point is that UDP gives you
00:09:09.133
flexibility, but at the cost of having
00:09:11.500
to implement these features yourself. At the
00:09:14.066
cost of adding complexity.
00:09:18.000
There’s a lot to think about when
00:09:19.866
writing a UDP-based protocol or a UDP-
00:09:22.133
based application.
00:09:24.033
If you use a transport protocol,
00:09:26.200
like QUIC or like RTP, that runs
00:09:28.633
over UDP, then the designers of that
00:09:31.366
protocol have made these decisions, and will
00:09:33.833
have given you a library you can use.
00:09:36.500
If not, if you’re designing your own
00:09:38.566
protocol that runs over UDP, then the
00:09:41.733
IETF has written some guidelines, highlighting the
00:09:44.066
issues you need to think about,
00:09:45.600
in RFC 8085.
00:09:48.200
Please read this before you try and
00:09:49.866
write applications that use UDP. There are
00:09:52.666
a lot of non-obvious things that can catch you out.
00:09:57.500
So, that concludes our discussion of UDP.
00:10:00.233
In the next part, I’ll talk about
00:10:01.866
how TCP delivers data reliably.
Part 3: Reliable Data with TCP
The third part of the lecture discusses TCP. It outlines the TCP
service model and shows how to send and receive data using a TCP
connection. It explains how TCP ensures reliable and order data
transfer, using sequence numbers and acknowledgements. And it
explains TCP loss detection using timeouts and triple-duplicate
acknowledgements. The issue of head-of-line blocking in TCP
connections is discussed, as an example of the timeliness vs
reliability trade-off.
Slides for part 3
00:00:00.233
In this part I want to talk
00:00:02.000
about how reliable data is delivered using
00:00:03.966
TCP connections. I’ll talk about the TCP
00:00:06.866
service model, how TCP uses sequence numbers
00:00:10.400
and acknowledgments, and how packet loss detection
00:00:13.500
and recovery works in TCP.
00:00:17.033
Thinking about the TCP service model,
00:00:19.266
as we've seen in previous lectures,
00:00:21.600
TCP provides a reliable, ordered, byte stream
00:00:24.966
delivery service that runs over IP.
00:00:27.966
The applications write data into the TCP
00:00:30.733
socket, that buffers it up in the
00:00:32.833
sending system, and then delivers it over
00:00:35.533
a sequence of data segments over the IP layer.
00:00:38.866
When these data packets, these data segments,
00:00:41.766
are received, they are accumulated in a
00:00:44.533
receive buffer at the receiver. If anything
00:00:47.200
is lost, or arrives out of order,
00:00:49.066
it's re-transmitted, and eventually the data is
00:00:51.433
delivered to the application.
00:00:53.333
The data delivered to the application is
00:00:55.666
always delivered reliably, and in the order sent.
00:00:58.733
If something is lost, if something needs
00:01:01.600
to be re-transmitted, this stalls the delivery
00:01:04.400
of the later data, to make sure
00:01:06.766
that everything is always delivered in order.
00:01:10.966
TCP delivers, as we say, an order,
00:01:13.766
reliable, byte stream.
00:01:16.366
After the connection has been established,
00:01:18.466
after the SYN, SYN-ACK, ACK handshake,
00:01:20.866
the client and the server can send
00:01:22.633
and receive data.
00:01:24.700
The data can flow in either direction
00:01:27.166
within that TCP connection.
00:01:29.400
It’s usual that the data follows a
00:01:31.900
request response pattern. You open the connection.
00:01:35.100
The client sends a request to the
00:01:36.733
server. The server replies with a response.
00:01:39.400
The client makes another request. The server
00:01:41.566
replies with another response, and so on.
00:01:44.566
But TCP doesn't make any requirements on
00:01:46.900
this. There’s no requirement that the data
00:01:49.266
flows in a request response pattern,
00:01:51.300
and the client and the server can
00:01:53.666
send data in any order they feel like.
00:01:56.366
TCP does ensure that the data is
00:01:58.400
delivered reliably, and in the order it
00:02:00.600
was sent, though.
00:02:02.766
TCP sends acknowledgments for each data segment
00:02:05.600
as it's received. And if any data
00:02:07.733
is lost, it retransmits that lost data.
00:02:10.733
And if segments are delayed and arrive
00:02:13.033
out of order, or if a segment
00:02:15.166
has to be re-transmitted and arrives out
00:02:17.166
of order, then TCP will reconstruct the
00:02:19.300
order before giving the segments back to the application.
00:02:25.533
In order to send data over a
00:02:27.300
TCP connection you use the send() function.
00:02:30.766
This transmits a block of data over
00:02:33.500
the TCP connection. The parameters are the
00:02:37.133
file descriptor representing the socket – the
00:02:39.500
TCP socket, the data, the length of
00:02:42.066
the data, and a flag. And the
00:02:44.366
flag field is usually zero.
00:02:47.466
The send() function blocks until all the
00:02:49.600
data can be written.
00:02:51.400
And it might take a significant amount
00:02:53.900
of time to do this, depending on
00:02:55.866
the available capacity of the network.
00:02:59.433
It also might not be able to
00:03:00.800
send all the data.
00:03:02.833
If the connection is congested, and can't
00:03:05.233
accept any more data, then the send()
00:03:06.966
function will return to indicate that it
00:03:10.566
wasn't able to successfully send all the
00:03:12.766
data that was requested.
00:03:15.300
The return value from the send() function
00:03:17.300
is the amount of data it actually
00:03:18.766
managed to send on the connection.
00:03:20.266
And that can be less than the
00:03:21.866
amount it was asked to send.
00:03:23.566
In which case, you need to figure
00:03:25.533
out what data was not sent,
00:03:27.100
by looking at the return value,
00:03:29.800
and the amount you asked for,
00:03:31.233
and re-send just the missing part in another call.
00:03:34.833
Similarly, if an error occurs, if the
00:03:37.333
connection has failed for some reason,
00:03:39.500
the send() function will return -1,
00:03:41.166
and it will set the global variable
00:03:42.566
errno to indicate that.
00:03:46.800
On the receiving side you call the
00:03:49.100
recv() function to receive data on a
00:03:50.966
TCP connection.
00:03:53.200
The recv() function blocks until data is
00:03:55.833
available, or until the connection is closed.
00:03:59.833
It’s passed a size,
00:04:01.333
It’s passed a buffer, buf, and the
00:04:04.666
size of the buffer, BUFLEN, and it
00:04:07.066
reads up to BUFLEN bytes of data.
00:04:09.600
And what it returns is the number
00:04:11.700
of bytes of data that were read.
00:04:14.066
Or, if the connection was closed,
00:04:16.100
it returns zero. Or, if an error
00:04:18.900
occurs, it returns -1, and again sets
00:04:21.700
global variable errno to indicate what happened.
00:04:26.933
When a recv() call finishes, you have
00:04:29.500
to check these three possibilities. You have
00:04:31.900
to check if the return value is
00:04:33.300
zero, to indicate that the connection is
00:04:35.466
closed and you've successfully received all the
00:04:38.133
data in that connection. At which point,
00:04:40.366
you should also close the connection.
00:04:42.900
You have to check if the return
00:04:44.300
value is minus one, in which case
00:04:46.166
an error has occurred, and that connection
00:04:48.566
has failed, and you need to somehow
00:04:50.566
handle that error.
00:04:53.266
And you need to check if it's some other value,
00:04:55.900
to indicate that you've received some data,
00:04:57.900
and then you need to process that data.
00:05:01.133
What's important is to remember that the
00:05:04.200
recv() call just gives you that data
00:05:07.033
in the buffer. If the return value
00:05:09.700
from receive is 157, this indicates that
00:05:12.566
the buffer has 157 bytes of data in it.
00:05:16.366
What the recv() called doesn't ever do,
00:05:18.833
is add a terminating null to that buffer.
00:05:22.366
Now, if you're careful that doesn't matter,
00:05:26.133
because you know how much data is
00:05:28.300
in the buffer, and you can explicitly
00:05:30.400
process the data up to that length.
00:05:33.866
But, a common problem with TCP-based applications,
00:05:38.500
is that they treat the data as if it was a string.
00:05:43.366
They pass it to the printf() call
00:05:45.200
using %s as if it were a
00:05:47.200
string, or they pass it to function
00:05:49.666
like strstr() to search for a string
00:05:51.533
within it, or strcpy(), or something like that.
00:05:56.133
And the problem is the string functions
00:05:58.033
assume there’s a terminating null, and the
00:06:00.333
recv() call doesn't provide one.
00:06:03.766
If you're going to pass the data
00:06:05.866
that's returned from a recv() call to
00:06:08.600
one of the C string functions,
00:06:10.666
you need to explicitly add that null yourself.
00:06:13.866
You need to look at the buffer,
00:06:17.333
add the null at the end,
00:06:19.100
after the last byte which was successfully
00:06:21.533
received. If you don't do, this the
00:06:25.033
string functions will just run off the end of the buffer
00:06:27.300
and you'll get a buffer overflow attack.
00:06:29.733
And this is a significant security risk.
00:06:31.733
It’s one of the biggest security problems
00:06:33.666
with network code using C. It’s misusing
00:06:36.900
these buffers, accidentally using one of the
00:06:39.166
string functions, and it just reads off
00:06:41.966
the end of buffer, and who knows what it processes.
00:06:48.566
When you send data using TCP,
00:06:50.700
the send() call enqueues the data for transmission.
00:06:55.200
The operating system, the TCP code in
00:06:57.900
the operating system, splits the data you've
00:07:00.366
written using the various send() calls into
00:07:02.266
what’s known as segments, and puts each
00:07:04.333
of these into a TCP packet.
00:07:07.433
The TCP packets are sent in IP
00:07:09.533
packets. And TCP runs a congestion control
00:07:12.933
algorithm to decide when it can send those packets.
00:07:17.166
Each TCP segment, each segment is in
00:07:20.200
a TCP packet. The TCP packets have
00:07:22.933
a header, which has a sequence number.
00:07:25.933
When the connection setup handshake happens,
00:07:28.700
in the SYN and the SYN-ACK packets,
00:07:31.366
the connection agrees the initial sequence numbers;
00:07:34.300
agrees the starting value for the sequence numbers.
00:07:37.666
If you’re the client, for example;
00:07:39.600
the client picks a sequence number at
00:07:43.200
random, and sends this in its SYN packet.
00:07:46.433
And then when it starts sending data,
00:07:48.600
the next data packet has a sequence
00:07:50.700
number that is one higher than that
00:07:52.466
in the SYN packet.
00:07:55.033
And, as it continues to send data,
00:07:57.700
the sequence numbers increase by the number
00:08:00.300
of data bytes sent.
00:08:02.400
So, for example, if the initial sequence
00:08:04.533
number was 1001, just picked randomly,
00:08:07.133
and it sends 30 bytes of data
00:08:09.466
in the packet, then the next sequence
00:08:12.733
number will be 1031.
00:08:16.533
The sequence number spaces are separate for
00:08:18.800
each in each direction. The sequence numbers
00:08:21.066
the client uses increase based on the
00:08:23.333
initial sequence number the client sent the SYN packet.
00:08:26.366
The sequence numbers the server use,
00:08:28.433
start based on the initial sequence number
00:08:30.600
the server sent in the SYN-ACK packet,
00:08:32.700
and increase based on the amount of
00:08:34.766
data the server is sending. The two
00:08:36.366
number spaces are unrelated.
00:08:41.600
What's important is that calls to send()
00:08:44.300
don't map directly onto TCP segments.
00:08:49.066
If the data which is given to
00:08:51.300
a send() call is too big to
00:08:52.900
fit into one TCP segment, then the
00:08:56.100
TCP code will split it across several
00:08:58.366
segments; it'll split it across several packets.
00:09:02.600
Similarly, if the data you send,
00:09:04.900
that data you give the send() call
00:09:06.666
is quite small, TCP might not send
00:09:09.066
it immediately.
00:09:11.066
It might buffer it up, combine it
00:09:13.166
with data sent as part of a
00:09:15.600
later send() call. And combine it,
00:09:18.266
and send it in a single larger
00:09:19.700
segment, a single larger TCP packet.
00:09:23.566
This is an idea known as Nagle’s
00:09:27.100
algorithm. It's there to improve efficiency by
00:09:30.200
only sending big packets, because there's a
00:09:32.633
certain amount of overhead for each packet.
00:09:35.733
Each packet that’s sent by TCP has
00:09:38.033
a TCP header. It’s got an IP
00:09:40.333
header. It's got the Ethernet or the
00:09:42.666
WiFi headers depending on the link layer.
00:09:45.033
And that adds a certain amount of
00:09:47.033
overhead. It’s about, I think, 40 bytes
00:09:48.966
per packet. So if you're only sending
00:09:51.066
a small amount of data, that's a
00:09:52.900
lot of overhead, a lot of wasted data.
00:09:55.533
So TCP, with the Nagle algorithm,
00:09:57.466
tries to combine these packets into larger
00:09:59.500
packets when it can. But, of course,
00:10:01.633
this adds some delay. It’s got to
00:10:03.800
wait for you to send more data;
00:10:05.400
wait to see if it can form a bigger packet.
00:10:09.133
If you really need low latency,
00:10:11.133
you can disable the Nagle algorithm.
00:10:13.100
There’s a socket option called TCP_NODELAY,
00:10:16.000
and we see the code on the
00:10:17.800
slide to show how to use that.
00:10:19.833
So you create the socket, you
00:10:23.300
establish the connection, and then you call
00:10:26.400
the TCP_NODELAY option and that turns this
00:10:28.700
off. And this means that every time
00:10:31.000
you send() on the socket, it immediately
00:10:32.900
gets sent as quickly as possible.
00:10:37.800
One implication of this behaviour, though,
00:10:40.233
where TCP can either split data written
00:10:43.800
in a single send() across multiple segments,
00:10:47.233
or where it can combine several send()
00:10:49.566
calls into a single segment, is that
00:10:52.400
the data returned by the recv() calls
00:10:54.900
doesn't always correspond to a single send().
00:10:58.400
When you call recv(), you might get
00:11:01.166
just part of a message. And you
00:11:03.266
need to call recv() again to get the rest of the message.
00:11:06.700
Or you may get several messages in one recv() call.
00:11:12.600
When you're using TCP, the recv() calls
00:11:14.933
return the data reliably, and they return
00:11:17.266
the data in the order that it was sent.
00:11:20.366
But what they don't do is frame
00:11:22.300
the data. What they don't do is
00:11:23.833
preserve the message boundaries.
00:11:27.233
For example, if we're using HTTP,
00:11:30.433
which we see, we see an example
00:11:32.666
of an HTTP message that might be sent,
00:11:34.866
an HTTP response that might be sent,
00:11:37.866
by a web server back to a browser.
00:11:41.566
If we're using HTTP, what we would
00:11:44.500
like is that the whole response is
00:11:46.800
received in one go. So if we're
00:11:50.066
implementing a web browser we just call
00:11:51.766
recv() on the TCP connection
00:11:53.633
and we get all of the headers,
00:11:55.866
and all of the body, in just
00:11:57.500
in just one call to recv() and
00:11:59.433
we can then parse it, and process it, and deal with it.
00:12:02.833
TCP doesn't guarantee this, though.
00:12:06.133
It can split the messages arbitrarily,
00:12:08.566
depending on how much data was in
00:12:11.166
the packets, what size packets the underlying
00:12:14.233
link layers can send, and on the
00:12:17.066
available capacity of the network depending on
00:12:19.566
the congestion control.
00:12:21.200
And it can split the packets at arbitrary points.
00:12:24.466
For example, if we look at the
00:12:26.800
slide, we see that the headers,
00:12:29.166
some of them are labeled in red,
00:12:30.533
some are in blue, some of the body is in blue,
00:12:33.233
some the rest of the body is
00:12:34.500
in green. And it could be that
00:12:36.400
the TCP connection splits the data up,
00:12:38.600
so that the first recv() call just
00:12:40.633
gets the part of the headers highlighted
00:12:42.466
in red,
00:12:43.500
ending halfway through the “ETag:” line.
00:12:46.466
And then you have to call recv()
00:12:48.333
again. And then you get the part
00:12:50.233
of the message highlighted in blue,
00:12:51.833
which contains the rest of the headers
00:12:53.600
and the first part of the body.
00:12:55.533
Then you have to call recv() again,
00:12:57.433
to get the rest of the message
00:12:59.033
that's highlighted in green on the slide.
00:13:01.300
And this makes it much harder to
00:13:03.166
parse; much harder for the programmer.
00:13:05.833
Because you have to look at the
00:13:07.866
data you've got, parse it, check to
00:13:09.900
see if you've got the whole message,
00:13:11.500
check if you've received the complete headers,
00:13:13.466
check to see if you've received the
00:13:15.033
complete body. And you have to handle
00:13:17.033
the fact that you might have partial messages.
00:13:20.633
And it's something which makes it a
00:13:22.200
little bit hard to debug, because if
00:13:24.466
you only send small messages,
00:13:25.833
if you're sending packets which are only
00:13:28.200
like 1000 bytes, or so, they’re probably
00:13:31.800
small enough to fit in a single
00:13:33.600
packet, and they always get delivered in one go.
00:13:36.333
It’s only when you start sending
00:13:38.400
larger packets, or sending lots of data
00:13:41.333
over connection so things get split up
00:13:43.800
due to congestion control, that you start
00:13:45.600
to see this behaviour where the messages
00:13:47.533
get split at arbitrary points.
00:13:54.133
So as we've seen, the TCP segments
00:13:58.200
contain sequence numbers, and the sequence numbers
00:14:00.333
count up with the number of bytes being sent.
00:14:03.600
Each TCP segment also has an acknowledgement number.
00:14:09.366
When a TCP segment is sent,
00:14:12.266
it acknowledges any segments that have previously
00:14:16.666
been received.
00:14:18.866
So if,
00:14:20.266
if a TCP endpoint has received some
00:14:24.733
data on a TCP connection,
00:14:27.266
when it sends its next packet,
00:14:29.400
the ACK bit will be set in
00:14:31.866
the TCP header, to indicate that the
00:14:33.866
acknowledgement number is valid, and the acknowledgement
00:14:36.500
number will have a value indicating the
00:14:39.100
next sequence number it is expecting.
00:14:42.166
That is, the next contiguous byte it's
00:14:44.533
expecting on the connection.
00:14:47.866
So, in the example, we have a
00:14:52.500
slightly unrealistic example in that the connection
00:14:54.733
is sending one byte at a time,
00:14:56.500
and the first packet is sent with sequence number five.
00:14:59.566
And then the next packet is sent
00:15:01.700
with sequence number six, and then seven,
00:15:03.833
and eight, and nine, and ten,
00:15:05.666
and so on. And this is what
00:15:07.800
might happen with an ssh connection,
00:15:09.600
where each key you type generates a
00:15:11.166
TCP segment, with just the one key press in it.
00:15:14.866
And when those packets are received at
00:15:17.866
host B, it sends a TCP segment
00:15:20.700
with the acknowledgement bit set, acknowledging what's
00:15:24.766
expected next.
00:15:26.233
So when it receives the TCP packet
00:15:29.800
with sequence number five, and one byte
00:15:31.833
of data in it, it sends an
00:15:33.900
acknowledgement saying it got it, and it's
00:15:36.133
expecting the packet with sequence number six next.
00:15:40.333
When it receives the packet with sequence
00:15:42.366
number six, and one byte of data
00:15:44.433
in it, it sends an acknowledgement saying
00:15:46.333
it's expecting seven. And so on.
00:15:51.033
TCP only ever acknowledges the next contiguous
00:15:55.766
sequence number expected.
00:15:58.233
And if a packet is lost,
00:16:00.500
subsequent packets generate duplicate acknowledgments.
00:16:05.300
So in this case, packet five was
00:16:08.733
sent. It got to the receiver,
00:16:10.766
and that sent the acknowledgement saying it
00:16:12.633
expected six. Six was sent, arrived at
00:16:15.100
the receiver, so the acknowledgement says it
00:16:17.133
expects seven.
00:16:18.800
Seven was sent, arrives at the receiver,
00:16:21.600
sends the acknowledgement saying it expects
00:16:23.333
eight. Eight was sent, and gets lost.
00:16:29.466
Nine was sent, and arrives at the receiver.
00:16:33.033
At this point, the receiver’s received the
00:16:36.066
packets with sequence numbers five, six,
00:16:38.000
and seven; eight is missing; and nine
00:16:40.366
has arrived. So the next contiguous sequence
00:16:43.400
number it's expecting is still eight.
00:16:46.233
So it sends an acknowledgement saying “I’m
00:16:48.633
expecting sequence number eight next”.
00:16:52.066
The packet sent, the next packet sent,
00:16:55.066
has sequence number 10. This arrives,
00:16:57.633
the acknowledgement goes back saying “I still
00:16:59.800
haven't got eight, I’m still expecting eight”,
00:17:02.400
and this carries on. TCP keeps sending
00:17:04.800
duplicate acknowledgments while there’s a gap in
00:17:06.900
the sequence number space.
00:17:11.533
In addition, we don't show it here,
00:17:14.000
but TCP can also send delayed acknowledgments,
00:17:16.333
where it only acknowledges every second packet.
00:17:18.466
In this case the acknowledgments might go,
00:17:20.666
six, eight. The packet with sequence number
00:17:23.966
five is sent, and it acknowledges six.
00:17:26.566
Packet with number six is sent,
00:17:28.366
and arrives, and packet number seven is
00:17:30.366
sent, and then it sends the acknowledgement
00:17:32.166
saying it's expecting eight. So it doesn't
00:17:34.366
have to send every acknowledgement, it can
00:17:36.300
sent every other acknowledgement to reduce the overheads.
00:17:43.300
TCP uses the acknowledgments to detect packet
00:17:47.800
loss; to detect when segments are lost.
00:17:51.233
There’s two ways in which it does this.
00:17:54.466
The first is that if it sends
00:17:57.433
data, but for some reason the acknowledgments stop entirely.
00:18:01.500
This is a sign that either the receiver has failed,
00:18:04.966
And, you know, the packets are being
00:18:06.866
delivered to the receiver, but the application
00:18:08.733
has crashed, and there's nothing there to
00:18:11.000
receive the data, to reply.
00:18:13.700
Or it's an indication that the network
00:18:15.800
connection has failed, and the packets are
00:18:17.900
just not reaching the receiver.
00:18:19.500
So if TCP is sending data,
00:18:21.633
and it's not getting any acknowledgments back,
00:18:24.066
after a while it times out and
00:18:26.933
uses this as an indication that the
00:18:28.866
connection has failed.
00:18:32.300
Alternatively, it can be sending data,
00:18:35.666
and if some data is lost,
00:18:39.700
but the later segments arrive, then TCP
00:18:42.000
will start sending the duplicate acknowledgments.
00:18:45.166
Again, back to the example, we see
00:18:47.900
that packet eight is lost, packet nine
00:18:50.266
arrives, and the sequence number, the acknowledgement
00:18:53.366
number, comes back says “I’m expecting sequence
00:18:55.266
number eight”.
00:18:56.966
And packet ten is sent and it
00:18:59.133
arrives, and it still says “I’m still
00:19:00.666
expecting packet with sequence number eight”,
00:19:03.200
and this just carries on.
00:19:05.700
And, eventually, TCP gets what's known as
00:19:08.333
a triple duplicate acknowledgement. It’s got the
00:19:11.833
original acknowledgement saying it's expecting packet eight,
00:19:14.933
and then three duplicates following that,
00:19:17.266
so four packets in total, all saying
00:19:19.433
“I’m still expecting packet eight”.
00:19:22.533
And what this indicates, is that data
00:19:24.900
is still arriving, but something's got lost.
00:19:28.266
It only generates acknowledgements when a new
00:19:30.800
packet arrives, so if we keep seeing
00:19:33.000
acknowledgments indicating the same thing, this indicates
00:19:35.933
that new packets arriving, because that's what
00:19:38.200
triggers the acknowledgement to be sent,
00:19:40.866
but there's still a packet missing,
00:19:43.400
and it's telling us which one it's expecting.
00:19:46.866
At that point TCP assumes that the
00:19:49.400
packet has got lost, and retransmits that
00:19:51.566
segment. It retransmits the packet with sequence
00:19:54.833
number eight.
00:19:59.233
Why does it wait for a triple duplicate acknowledgement?
00:20:03.466
Why does it not just retransmit it
00:20:06.033
immediately. when it sees a duplicate?
00:20:08.566
Well, the example we see here illustrates that.
00:20:13.466
In this case, a packet with sequence
00:20:15.733
number five is sent, containing one byte
00:20:17.866
of data, and it arrives, and the
00:20:19.866
receiver acknowledges it, saying it's expecting six.
00:20:23.400
And six is sent, and it arrives,
00:20:26.266
and the receiver acknowledges it, indicating it’s
00:20:28.333
expecting seven.
00:20:30.066
And packet seven is sent, and it's
00:20:32.866
delayed. And packet eight is sent,
00:20:35.566
and eventually arrives at the receiver.
00:20:38.233
Now the receiver hasn't received packet seven
00:20:41.100
yet, so it sends an acknowledgement which
00:20:43.500
says “I’m still expecting seven”. So that's
00:20:46.066
a duplicate acknowledgement.
00:20:48.200
At that point packet seven, which was
00:20:50.466
delayed, finally does arrive.
00:20:53.866
Now packet seven has arrived, packet eight
00:20:56.466
had arrived previously, so what is now
00:20:58.600
expecting is nine, so it sends an
00:21:00.833
acknowledgement for nine.
00:21:02.866
And we see that the acknowledgments go
00:21:05.266
six, seven, seven, nine, because that packet
00:21:08.033
seven was delayed a little bit.
00:21:11.900
And if TCP reacts to a single
00:21:14.300
duplicate acknowledgement as an indication that the
00:21:17.166
packet was lost, then you run the
00:21:20.233
risk that you're resending a packet on
00:21:23.033
the assumption when it was lost,
00:21:24.933
when it was just merely delayed a little bit.
00:21:28.466
And there's a trade off you can make here.
00:21:31.733
Do you treat, a single duplicate as
00:21:35.600
an indication of loss? Do you treat
00:21:38.066
two duplicates as an indication of loss?
00:21:40.366
Three? Four? Five? At what point do
00:21:42.900
you say “this as an indication of
00:21:44.300
loss”, rather than just “this is a
00:21:46.566
slightly delayed packet, and it might recover
00:21:49.133
itself in a minute”?
00:21:53.600
The reason that a triple duplicate is
00:21:55.933
used, is because someone did some measurements,
00:21:58.833
and decided that packets being delayed
00:22:01.800
enough to cause one or two duplicates,
00:22:04.500
because they arrived just a little bit
00:22:06.933
out of order, was relatively common.
00:22:09.133
But packets being delayed enough that they
00:22:11.800
cause three or more duplicates is rare.
00:22:14.500
So it's balancing-off speed of loss detection
00:22:17.766
vs. the likelihood that a merely delayed
00:22:20.466
packet is treated as if it were
00:22:22.600
lost, and retransmitted unnecessarily.
00:22:26.300
And, based on the statistics, the belief
00:22:29.500
by the designers of TCP was that
00:22:32.666
waiting for three duplicates was the right threshold.
00:22:36.233
And you could make a TCP version
00:22:38.900
that reduced this to two, or even
00:22:41.300
one duplicate, and it would respond to
00:22:43.666
loss faster, but would have the risk
00:22:45.666
that it's more likely to unnecessarily retransmit
00:22:47.966
something that's just delayed.
00:22:50.500
Or you could make it four,
00:22:52.433
five, six, even more duplicate acknowledgments,
00:22:55.700
which will be less likely to unnecessarily
00:22:57.900
retransmit data. But it’d be slower,
00:23:00.966
because it would be slower in responding
00:23:03.300
to loss, and slower in retransmitting actually lost packets.
00:23:12.766
The other behaviour of TCP. which is
00:23:16.033
worth noting, is head-of-line blocking.
00:23:19.566
Now, in this case we're sending something
00:23:21.866
more realistic. We're sending full size packets,
00:23:24.166
with 1500 bytes of data in each packet.
00:23:26.900
And 1500 is the maximum packet size
00:23:29.333
that you can send in an Ethernet
00:23:31.733
packet, or in a WiFi packet,
00:23:33.833
so this is a typical size that actually gets sent.
00:23:37.366
In this case, the first packet is
00:23:40.366
sent with sequence numbers in the range
00:23:42.966
zero through to 1499.
00:23:46.266
And this arrives at the receiver,
00:23:48.266
and the receiver sends an acknowledgement saying
00:23:50.500
it got it, and the next packet
00:23:52.300
it’s expecting has sequence number 1500.
00:23:55.666
So it sends an acknowledgement for 1500.
00:23:58.666
And if there’s a recv() call outstanding
00:24:01.033
on that socket, that recv() call will
00:24:03.400
return at that point, and return 1500
00:24:05.100
bytes of data. It returns the data
00:24:07.733
as it was received.
00:24:09.600
The next packet arrives at the receiver,
00:24:11.866
containing sequence numbers 1500 through to 2999,
00:24:16.800
and again the recv() call, if there
00:24:19.266
is one, will return, and return that
00:24:21.233
next 1500 bytes.
00:24:23.200
Similarly, when the packet containing the next
00:24:25.833
1500 comes in, the receiver will send
00:24:28.433
the ACK saying “I’m expecting 4500”,
00:24:30.533
and the recv() call will return.
00:24:33.733
The packet containing sequence numbers 4500 though
00:24:37.500
to 5999 is lost.
00:24:40.633
The packet containing 6000 through to 7499 arrives.
00:24:47.466
The acknowledgement goes back indicating that it’s
00:24:50.166
still expecting sequence number 4500, because that
00:24:53.166
packet got lost. And at that point,
00:24:56.233
some data has arrived, some new data
00:24:57.966
has arrived at the receiver.
00:24:59.600
But there's a gap. The packets,
00:25:02.566
the packet, containing data with sequence numbers
00:25:05.266
4500 through to 5999 is still missing.
00:25:08.833
So if the receiver application has called
00:25:12.933
recv() on that socket, it won't return.
00:25:16.366
The data has arrived, it's buffered up
00:25:18.833
in the TCP layer in the operating
00:25:20.800
system, but TCP won't give it back
00:25:22.400
to the application.
00:25:24.933
And the packets can keep being sent,
00:25:27.200
and the receiver keeps sending the duplicate
00:25:29.700
acknowledgments, and eventually it’s sent the triple
00:25:32.266
duplicate acknowledgement, and the TCP sender notices
00:25:35.700
and retransmits the packet with sequence numbers
00:25:38.366
4500 through to 5999.
00:25:41.833
And eventually those arrive at the receiver.
00:25:45.900
At that point, the receiver has a
00:25:48.966
contiguous block of data available, with no
00:25:51.133
gaps in it, and it returns all
00:25:54.100
of the data from sequence number 4500
00:25:57.000
up to sequence number 12,000,
00:26:00.533
up to the application in one go.
00:26:03.333
And if the application has given a
00:26:05.600
big enough buffer, at that point the
00:26:07.366
recv() call will returned 7500 bytes of
00:26:09.766
data. It’ll return all of that received
00:26:12.666
data in one big burst.
00:26:18.033
And then, as the data, you know,
00:26:20.700
gets retransmitted, as the data arrives,
00:26:23.066
it will just keep, you know,
00:26:25.233
the recv() call will unblock and data
00:26:27.066
will start flowing.
00:26:29.133
The point is the TCP receiver waits
00:26:31.700
for any missing data to be delivered.
00:26:34.366
If anything's missing, the triple duplicate ACK
00:26:37.900
happens, it eventually gets retransmitted, and the
00:26:40.933
receiver won't return anything to the application
00:26:43.533
until that retransmission has happened.
00:26:48.200
It’s called head of line blocking.
00:26:50.066
The data stops being delivered, until it
00:26:52.433
can be delivered in sequence to the
00:26:54.466
application. It’s all just buffered up in
00:26:56.633
the operating system, in the TCP code.
00:26:58.933
TCP always gives the data to the
00:27:01.100
application in a contiguous ordered sequence,
00:27:03.000
in the order it was sent.
00:27:04.933
And this is another reason why the
00:27:06.700
recv() calls don't always preserve the message boundaries.
00:27:09.600
Because it depends how much data was
00:27:11.700
queued up because of packet losses,
00:27:13.466
and so on, so that it can
00:27:15.266
always be delivered in order.
00:27:19.266
The head of line blocking increases the
00:27:21.900
total download time. We see on the
00:27:24.500
left, the case where one packet was
00:27:27.133
lost, and had to be re-transmitted.
00:27:29.500
And we see on the right,
00:27:31.033
the case where all the packets were
00:27:32.866
received on time. And we see an
00:27:34.666
increase in the download time because of
00:27:36.466
the packet loss.
00:27:40.733
It blocks the receiving, it delays things
00:27:43.700
a little bit, waiting for the retransmission.
00:27:46.533
And it increases the overall download time
00:27:50.500
a little bit.
00:27:52.366
It disrupts the behaviour of when the
00:27:54.966
packets are received, during the download quite
00:27:57.333
significantly. We see 1500, 1500, 1500,
00:28:02.400
big gap, seven thousand five hundred,
00:28:04.666
1500, 1500,
00:28:07.666
in the case where the packets were
00:28:09.300
lost. Or, in the case where they
00:28:10.733
were all received, the data is coming
00:28:12.533
in quite smoothly. It's regularly spaced.
00:28:14.966
So it affects the timing, it effects
00:28:17.133
when the data is delivered to the
00:28:18.733
application, and it has a smaller effect
00:28:20.300
on the overall download times.
00:28:28.633
And if you're building real time applications,
00:28:32.000
this is a significant problem. We see
00:28:34.833
the case on the right, if everything
00:28:36.866
is delivered on time, then the data
00:28:39.566
is released to the application very quickly
00:28:41.800
and very predictably.
00:28:43.566
And you don't need
00:28:47.333
much buffering delay at the receiver.
00:28:49.600
Things can be just delivered, things are
00:28:51.600
just delivered to the application, repeatedly on
00:28:53.600
a regular schedule.
00:28:55.033
But the minute something gets lost,
00:28:57.233
it has to wait for the retransmission.
00:28:59.333
In this case it waits for one
00:29:00.966
round trip time, because the ACK has
00:29:02.866
to get back, and then the data has to be retransmitted.
00:29:05.200
Plus, it has to wait for four
00:29:07.100
times the gap between packets, to allow
00:29:09.500
for the four duplicates, the triple duplicate
00:29:12.066
ACK and the original ACK, so you
00:29:14.366
get one round trip time plus four
00:29:16.500
times the packet spacing.
00:29:18.066
So if you're using TCP to send,
00:29:20.266
for example, speech data, where it's sending
00:29:22.400
packets regularly every 20 milliseconds, you need
00:29:25.133
to buffer 80 milliseconds plus the round
00:29:27.666
trip time, to allow for these re-transmissions,
00:29:30.766
if you're using it for a real time application.
00:29:33.766
Because, it waits for the retransmissions, and because
00:29:38.433
of the head of line blocking.
00:29:41.133
And when you're using applications like Netflix
00:29:44.933
or the iPlayer, when you press play on the video
00:29:47.433
there’s a little pause where it says “buffering”.
00:29:49.700
This is what it's doing. It’s buffering
00:29:51.766
up enough data that it can wait
00:29:54.933
for the retransmissions to happen,
00:29:57.666
buffering up enough data in the TCP
00:29:59.866
connection that it can keep playing out
00:30:01.633
the video frames, in order, while still
00:30:04.633
allowing time for a retransmission to happen.
00:30:07.100
So it's buffering up the data waiting,
00:30:09.533
making sure there's enough data buffered up,
00:30:12.766
because of this head of line blocking
00:30:14.366
issue in TCP.
00:30:20.300
So that concludes the discussion of TCP.
00:30:23.700
It gives you an ordered, reliable, byte stream.
00:30:28.233
As a service model it's easy to
00:30:30.433
understand. It’s like reading from a file;
00:30:33.133
you read from the connection and the
00:30:35.733
bytes arrive reliably and in the order they were sent.
00:30:39.733
The timing, though, is unpredictable. How much
00:30:43.566
you get from the connection each time you read from it,
00:30:46.433
and whether the data arrives regularly,
00:30:48.800
or whether it's arrives in big bursts
00:30:50.700
with large gaps between them, depends on
00:30:53.100
how much data is lost, and depends
00:30:55.233
on whether the TCP has to retransmit missing data.
00:30:59.066
And if you're just using this to
00:31:00.633
download files that doesn't matter. It means
00:31:03.700
that the progress bar is perhaps inaccurate,
00:31:05.866
but otherwise it doesn't make much difference.
00:31:08.466
But, if you're using it for real
00:31:10.066
time applications, like video streaming, like telephony,
00:31:14.066
this head of line blocking can quite
00:31:15.866
significantly affect the play out.
00:31:18.966
And a lot of that is the
00:31:20.500
reason why applications use, why real time
00:31:23.366
applications use, UDP. And for those that
00:31:26.233
don't use UDP,
00:31:27.700
applications like Netflix that use adaptive streaming
00:31:32.633
over HTTP, which we'll talk about in
00:31:35.166
lecture seven, that's why there’s this buffering
00:31:37.466
delay before they start playing.
00:31:40.966
And, of course, the lack of framing
00:31:42.700
complicates the application design, you have to
00:31:44.900
parse the data to make sure you've got all the data;
00:31:47.166
there's no message boundaries in there,
00:31:50.033
so you have to parse the data.
00:31:51.966
It doesn't tell you, the connection doesn't
00:31:53.700
tell you, when you've received all the data.
00:31:57.433
So that's it for TCP.
00:32:00.433
It delivers data reliably. It uses sequence
00:32:03.533
numbers and acknowledgments to indicate when the
00:32:06.133
data arrived.
00:32:07.633
It uses timeouts to indicate that a
00:32:09.733
connection has failed. And it uses this
00:32:12.433
idea of triple duplicate ACKs to indicate
00:32:14.866
that a packet has been lost,
00:32:16.300
and trigger a retransmission of any lost data.
00:32:19.833
What I’ll talk about in the next
00:32:21.333
part is QUIC and how it differs
00:32:23.266
from the way TCP handles reliability.
Part 4: Reliable Data Transfer with QUIC
The final part of the lecture discusses reliable data transfer using
QUIC. It outlines the QUIC service model, and how it differs from
that of TCP, and shows how QUIC achieves reliable data transfer.
It discusses how QUIC provides multiple streams within a single
connection, and consider how this affects head-of-line blocking
and latency. Approaches to making best use of multiple streams
are discussed..
Slides for part 4
00:00:00.100
In this final part I’d like to
00:00:02.533
talk about how reliable data transfer works
00:00:04.633
with QUIC, and how it's different to
00:00:07.100
reliable data transfer with TCP.
00:00:09.533
I’ll talk a little bit about the
00:00:11.733
QUIC service model, and how it handles
00:00:13.966
packet numbers and retransmission. I’ll talk about
00:00:16.166
the multi-streaming features of QUIC. And I’ll
00:00:19.133
talk about how it avoids head-of-line blocking.
00:00:23.333
The service model for TCP, as we
00:00:26.533
saw previously, is that it delivers a
00:00:29.100
single reliable, ordered, byte stream of data.
00:00:32.700
Applications write a stream of bytes in,
00:00:34.933
and that stream of bytes is delivered
00:00:37.033
to the receiver, eventually.
00:00:39.166
QUIC, by contrast, delivers several ordered reliable
00:00:42.200
byte streams within a single connection.
00:00:45.166
Applications can separate the data they're sending
00:00:47.933
into different streams, and each stream is
00:00:49.966
delivered reliably and in order.
00:00:52.066
QUIC doesn't preserve the ordering between the
00:00:54.666
streams within a connection, so if you
00:00:57.266
send in one stream, and then send
00:00:59.866
in a second stream, then the data
00:01:02.500
you sent second, in that second stream,
00:01:04.700
may arrive first, but it preserves the
00:01:06.666
ordering with a stream.
00:01:09.300
And you can treat each stream as
00:01:11.833
if it were running multiple TCP connections
00:01:15.366
in parallel, so it gives you the
00:01:17.100
same service model with several streams of
00:01:19.433
data, or you could perhaps treat each stream as a
00:01:22.900
sequence of messages to be sent,
00:01:25.600
with the streams indicating message boundaries.
00:01:30.366
QUIC delivers data in packets.
00:01:33.466
Each QUIC packet has a packet sequence
00:01:36.366
number, a packet number,
00:01:38.266
and the packet numbers
00:01:41.333
are split into two packet number spaces.
00:01:44.666
The packets sent during the initial QUIC
00:01:48.033
handshake start with packet sequence number zero,
00:01:50.900
and that packet sequence number increases by
00:01:53.033
one for each packet sent during the handshake.
00:01:56.066
Then, when the handshake’s complete, and it
00:01:58.800
switches to sending data, it resets the
00:02:01.666
packet sequence number to zero and starts again.
00:02:05.166
Within each of these packet number spaces,
00:02:07.666
the handshake space, and the data space,
00:02:10.833
the packet number sequence starts at zero,
00:02:13.400
and goes up by one for every packet sent.
00:02:16.566
That is, the sequence numbers in QUIC,
00:02:18.733
the packet numbers in QUIC, count the
00:02:20.966
number of packets of data being sent.
00:02:23.233
That's different to TCP. In TCP,
00:02:25.400
the sequence number in the header counts
00:02:27.966
the offset within the byte stream,
00:02:30.400
it counts how many bytes of data
00:02:32.166
have been sent. Whereas in QUIC,
00:02:34.300
the packet numbers count the number of packets.
00:02:38.033
Inside a QUIC packet is a sequence
00:02:40.833
of frames. Some of those frames may
00:02:43.100
be stream frames, and stream frames carry data.
00:02:46.600
Each stream frame has a stream ID,
00:02:50.066
so it knows which of the many sub-streams
00:02:52.200
it’s carrying data for, and it
00:02:53.766
also has the amount of data being carried,
00:02:57.033
and the offset of that data from the start of the stream.
00:02:59.866
So, essentially the stream contains sequence numbers
00:03:03.833
which play the same role as TCP
00:03:05.400
sequence numbers, in that they count bytes
00:03:07.366
of data being sent in that stream.
00:03:09.500
And the packets have sequence numbers that
00:03:11.766
count the number of packets being sent.
00:03:14.533
And we can see this in the
00:03:16.533
diagram on the right, where we see
00:03:18.366
the packet numbers going up, zero,
00:03:20.466
one, two, three, four. And the stream
00:03:22.433
numbers, packet zero carries data from the
00:03:24.566
first stream, bytes zero through 1000.
00:03:27.733
Packet one carries data from the first
00:03:29.733
stream, bytes 1001 to 2000. And packet
00:03:32.700
two carries bytes 2001 to 2500
00:03:36.833
from the first stream, and zero to
00:03:38.866
500 from the second stream, and so on.
00:03:41.566
And we see that we can send
00:03:44.333
data on multiple streams in a single packet.
00:03:50.400
QUIC doesn't preserve message boundaries within the
00:03:53.200
streams. In the same way that,
00:03:56.000
within a TCP stream, if you write
00:03:59.300
data to the stream and the amount you write is too big
00:04:02.300
to fit into a packet, it may
00:04:04.666
be arbitrarily split between packets.
00:04:06.900
Or if the data you send in a TCP Stream is too small,
00:04:09.566
and doesn't fill a whole packet,
00:04:11.500
it may be delayed waiting for more
00:04:13.433
data, to be able to fill up
00:04:15.033
the packet before it’s sent.
00:04:16.666
The same thing happens with QUIC.
00:04:18.633
If the amount of data you write to a stream is too big to
00:04:21.500
fit into a QUIC packet, then it
00:04:23.366
will be split across multiple packets.
00:04:26.166
Similarly, if the amount of data you
00:04:27.866
write to a stream is very small,
00:04:29.633
QUIC may buffer it up, delay it,
00:04:31.766
wait for more data, so it can
00:04:33.366
send it and fill a complete packet.
00:04:36.666
In addition, QUIC can take data from
00:04:39.466
more than one stream, and send it
00:04:41.300
in a single packet, if there’s space to do so.
00:04:44.566
And if there's more than one stream
00:04:46.833
with data that's available to send,
00:04:48.766
then the QUIC sender can make an
00:04:51.033
arbitrary decision, how it prioritises that data,
00:04:53.300
and how it delivers frames from each stream.
00:04:56.033
And usually it will split those,
00:04:59.200
the data from the streams, so each
00:05:01.700
packet has data from, half the data from, one stream,
00:05:05.200
and half from another stream. But it
00:05:07.400
may alternate them if it wants,
00:05:08.833
sending one packet with data from stream
00:05:10.966
1, one from stream 2, one from
00:05:12.600
stream 1, one from stream 2, and so on.
00:05:17.966
On the receiving side, the receiver sends,
00:05:20.566
the QUIC receiver sends acknowledgments for the
00:05:22.766
packets it receives.
00:05:24.166
So, unlike TCP which acknowledges the next
00:05:27.000
expected sequence number, a QUIC receiver just
00:05:29.566
sends an acknowledgement to say “I got this packet”.
00:05:33.500
So when packet zero arrives, it sends
00:05:35.866
an acknowledgement saying “I got packet zero”.
00:05:38.066
And when packet one arrives, it sends
00:05:39.900
an acknowledgement saying “I got packet one”, and so on.
00:05:43.566
The sender needs to remember what data
00:05:46.200
it puts in each packet, so it
00:05:47.800
knows when it gets an acknowledgement for packet two that,
00:05:51.033
in this case, it contained bytes 2001
00:05:54.800
to 2500 from stream one, and bytes
00:05:57.700
zero through 500 from stream two.
00:06:00.233
That information isn't in the acknowledgments.
00:06:02.766
What's in the acknowledgments it's just the
00:06:04.500
packet numbers, so the sender needs to
00:06:06.466
keep track of how it puts the
00:06:08.466
data from the streams into the packets.
00:06:12.366
The acknowledgments in QUIC are also a
00:06:15.133
bit more sophisticated than they are in
00:06:17.900
TCP, in that it doesn't just have
00:06:20.666
an acknowledgement number field in the header.
00:06:23.533
Rather, it sends the acknowledgments as frames
00:06:26.566
in the packets coming back.
00:06:28.833
And this gives a lot more flexibility, because
00:06:32.533
it can have a fairly sophisticated frame
00:06:35.700
format, and it can change the frame
00:06:37.400
format to include different, to support different
00:06:41.266
ways of sending a header, if it needs to.
00:06:45.233
In the initial version of QUIC,
00:06:47.133
what's in the frame format, in the
00:06:49.666
ACK frames coming back from the receiver to the sender,
00:06:53.266
is a field indicating the largest acknowledgement,
00:06:56.633
which is essentially the same as the
00:06:59.433
TCP acknowledgment – it tells you what's
00:07:02.866
the highest sequence number received.
00:07:06.166
There's an ACK delay field, that tells
00:07:08.933
you how long between receiving that packet
00:07:11.633
the receiver waited before sending the acknowledgement.
00:07:15.000
So this is the delay in the
00:07:16.866
receiver. And by measuring the time it
00:07:20.100
takes for the acknowledgment come back,
00:07:22.100
and removing this ACK delay field,
00:07:24.966
you can estimate the network round trip
00:07:27.366
time excluding the processing delays in the receiver.
00:07:31.466
There’s a list of ACK ranges.
00:07:35.300
And the ACK ranges are a way
00:07:37.100
of the receiver saying “I got a range of packets”.
00:07:40.366
So you can send an acknowledgement that
00:07:42.233
says, I got packets from five through seven
00:07:44.266
in a single go. And you can
00:07:46.800
split this up, with multiple ACK ranges.
00:07:48.833
So you could have an acknowledgement that
00:07:50.766
says “I got packet five; I got packets
00:07:53.466
seven through nine; and I got packets
00:07:55.433
11 through 15” and you can send
00:07:57.533
that all within a single acknowledgement block,
00:07:59.566
in an ACK frame, within the reverse path stream.
00:08:03.433
And this gives it more flexibility,
00:08:05.433
so it doesn't just have to acknowledge
00:08:07.833
the most recently received packet, which gives
00:08:11.200
the sender more information to make retransmissions.
00:08:14.466
This is a bit like the TCP
00:08:16.666
selective acknowledgement extension.
00:08:21.766
Like TCP, QUIC will retransmit lost data.
00:08:26.000
The difference is that TCP retransmits packets,
00:08:30.700
exactly as they would be originally sent,
00:08:33.400
so the retransmission looks just the same
00:08:35.466
as the original packet.
00:08:37.633
QUIC never retransmits packets.
00:08:40.500
Each packet in QUIC has a unique packet sequence number,
00:08:45.166
and each packet is only ever transmitted once.
00:08:48.366
What QUIC rather does, is it retransmits
00:08:51.000
the data which was in those packets
00:08:53.233
in a new packet.
00:08:55.533
So in this example, we see that
00:08:57.600
packet, on the slide, we see that
00:08:59.900
packet number two got lost, and it
00:09:01.633
contain the data bytes 2001 to 2500
00:09:06.033
from stream one, and bytes zero through 500 from stream two.
00:09:10.333
And, when it gets the acknowledgments indicating
00:09:12.933
that packet was lost, it resends that data.
00:09:16.233
And in this case it's sending in
00:09:18.733
packet six, it’s resending the first bytes
00:09:21.766
of data from stream, it’s sending the
00:09:25.333
bytes 2001 to 2500 from stream one,
00:09:28.533
and it will eventually, at some point
00:09:30.533
later, retransmit the data from stream two.
00:09:36.700
As we say, each packet has a
00:09:38.466
unique packet sequence number. Since we're not,
00:09:41.700
since each packet is acknowledged as it
00:09:43.666
arrives, and it's not acknowledging the highest,
00:09:46.666
not acknowledging the next sequence number expected
00:09:49.400
in the same way TCP does,
00:09:51.833
you can’t do the triple duplicate ACK
00:09:53.700
in the same way, because you don't
00:09:55.933
get duplicate ACKs. Each ACK acknowledges the
00:09:58.266
next new packet.
00:09:59.666
Rather QUIC declares a packet to be
00:10:02.333
lost when it's got ACKS for three
00:10:05.033
packets with higher packet numbers than the
00:10:07.500
one which it sent.
00:10:09.333
At that point, it can retransmit the
00:10:11.333
data that was in that packet.
00:10:13.366
And that’s QUIC’s equivalent to the triple
00:10:15.633
duplicate ACK; it's three following sequence numbers
00:10:18.600
rather than three duplicate sequence numbers.
00:10:20.766
And also, just like TCP, if there's
00:10:22.666
a timeout, and it stops getting ACKs,
00:10:24.533
then it declares the packets to be lost.
00:10:31.366
QUIC delivers multiple streams within a single
00:10:35.500
connection. And within each stream, the data
00:10:39.433
is delivered reliably, and in the order it was sent.
00:10:43.466
If a packet’s lost, then that clearly
00:10:46.100
causes data for the stream, streams,
00:10:48.533
where the data was included in that packet to be lost.
00:10:52.600
Whether a packet loss effects one,
00:10:55.600
or more, streams really depends on how
00:10:57.400
the sender chooses to put the data
00:10:59.266
from different streams into the packets.
00:11:02.300
It’s possible that a QUIC packet can
00:11:04.700
contain data from several streams. We saw
00:11:08.333
in the examples, how the packets contain
00:11:10.700
data from both stream one and stream two simultaneously.
00:11:13.566
In that case, if a packet is
00:11:15.833
lost, it will affect both of the
00:11:18.500
streams, all of the streams if there’s
00:11:20.333
data from more than two streams in the packet.
00:11:23.333
Equally, a QUIC sender can choose to
00:11:27.133
alternate, and send one packet with data
00:11:29.933
from stream one, and then another packet
00:11:32.066
with data from stream two, and only
00:11:34.266
ever put data from a single stream in each packet.
00:11:37.400
The specification puts no requirements on how
00:11:40.433
the sender does this, and different senders
00:11:42.766
can choose to do it differently depending
00:11:47.233
whether they're trying to make progress on
00:11:50.000
each stream simultaneously, or whether they want to
00:11:54.000
they want to alternate, and make sure
00:11:57.200
that packet loss only ever affects a single stream.
00:12:01.266
Depending on how they do this,
00:12:03.300
the streams can suffer from head of
00:12:05.366
line blocking independently.
00:12:07.500
If data is lost on a particular
00:12:09.800
stream, then that stream can't deliver later
00:12:14.866
data to the application, until that
00:12:18.033
lost data has been transmitted. But the
00:12:21.500
other streams, if they've got all the
00:12:23.533
data, can keep delivering to the application.
00:12:26.100
So streams suffer from head of line
00:12:28.300
blocking individually, but there's no head of
00:12:30.133
line blocking between streams.
00:12:32.600
This means that the data is delivered
00:12:35.466
reliably, and in order, on a stream,
00:12:37.866
but order’s not preserved between streams.
00:12:42.266
It’s quite possible that one stream can
00:12:45.033
be blocked, waiting for a retransmission of
00:12:47.000
some of the data in the packets,
00:12:48.800
while the other streams are continuing to
00:12:50.900
deliver data and haven't seen any loss
00:12:52.833
on that stream.
00:12:54.700
Each stream is sent and received independently.
00:12:57.866
And this means if you're careful with how you split data
00:13:00.800
across streams, and if the implementation is
00:13:04.300
careful with how it puts data from
00:13:05.900
streams into different packets, it can limit
00:13:08.233
the duration of the head of line
00:13:09.600
blocking, and make the streams independent in
00:13:11.766
terms of head of line blocking and data delivery.
00:13:18.566
QUIC delivers, as we've seen, several ordered,
00:13:21.900
reliable, byte streams of data in a single connection.
00:13:27.333
How you treat these different bytes streams,
00:13:30.000
is, I think, still a matter of interpretation.
00:13:33.600
It's possible to treat a QUIC connection
00:13:36.266
as though it was several parallel TCP connections.
00:13:40.333
So, rather than opening multiple TCP connections
00:13:42.700
to a server, you open one QUIC
00:13:45.100
connection, and you send and receive several
00:13:47.500
streams of data within that.
00:13:49.300
And then you treat each stream of
00:13:51.266
data as-if it were a TCP stream,
00:13:54.466
and you parse and process the data
00:13:56.800
as if it were a TCP stream.
00:13:58.500
And you possibly send multiple requests,
00:14:00.366
and get multiple responses, over each stream.
00:14:04.066
Or, you can treat the streams more as a framing device.
00:14:07.766
You can say that each stream,
00:14:10.300
you can choose to interpret each stream,
00:14:12.433
as sending a single object. And then,
00:14:15.466
when you send data from the stream,
00:14:17.000
on that stream, once you finish sending
00:14:18.833
that object, you close the stream and
00:14:20.933
move on to use the next one.
00:14:23.266
And, on the receiving side, you just
00:14:25.366
read all the data until you see
00:14:27.500
the end of stream marker, and then
00:14:30.200
you process it knowing you’ve got a complete object.
00:14:34.066
And I think that the best practices,
00:14:36.666
the way of thinking about a QUIC connection,
00:14:39.966
and the streams within a connection, is still evolving.
00:14:42.500
And it's not clear which of these
00:14:44.133
two approaches is the necessarily the right
00:14:46.433
way to do it. And I think
00:14:48.033
it probably depends on the application what
00:14:49.766
makes the most sense.
00:14:53.966
So, to conclude for this lecture.
00:14:57.366
We spoke a little bit about best
00:14:59.566
effort packet delivery on the Internet,
00:15:01.300
and why the IP layer delivers data
00:15:04.933
unreliably, and why it's appropriate to have
00:15:09.200
a best effort network.
00:15:11.200
Then we spoke a bit about the different transports.
00:15:14.266
The UDP transport that provides an unreliable,
00:15:17.500
but timely, service on which you can
00:15:20.433
build more sophisticated user space application protocols.
00:15:25.166
We spoke about TCP, that provides a
00:15:27.966
reliable ordered stream delivery service. And we
00:15:30.800
spoke about QUIC, that provides a reliable
00:15:33.600
ordered delivery service with multiple streams of
00:15:36.400
data. And it’s clear there’s different services,
00:15:38.800
different transport protocols, for different needs.
00:15:41.733
What I want to move on to
00:15:43.566
next time, is starting to talk about
00:15:45.300
congestion control and how all these different
00:15:49.166
transport protocols manage the rate at which they send data.
Discussion
Lecture 5 discussed reliable data transfer over the Internet. It started
with a discussion of best effort packet delivery, and an explanation of
why it makes sense for the Internet to be designed to be an unreliable
network. Then, it moved on to discuss UDP and how to make applications
and new transport protocols that work on an unreliable network. There's
a trade-off between timeliness and reliability that's important here,
and the lecture gave some examples of this to illustrate why many
real-time applications used UDP.
The bulk of the lecture discussed TCP. It spoke about how TCP sends
acknowledgement for packets, how timeouts and triple-duplicate ACKs
indicate loss, and why a triple-duplicate ACK is chosen as the loss
signal. It also discussed head-of-line blocking, and how the in-order,
single stream, reliable service model of TCP leads to head-of-line
blocking and potential latency.
Finally, It discussed the differences between QUIC and TCP. QUIC
acknowledges packets rather than bytes within a stream, uses ACK
frames rather than an ACK header, and delivers multiple streams
of data, allowing it to avoid head-of-line blocking in many cases.
The focus of the discussion will be on how TCP ensures reliability,
to make sure the mechanism is understood, and on the differences
between the TCP and QUIC service models and how QUIC can improve
latency. We'll also discuss how UDP can form a substrate, to easily
allow new transports, to suit different needs, to be built and
deployed.