Networked Systems H (2022-2023)

Lecture 5: Reliability and Data Transfer

Lecture 5 discusses reliable and unreliable data transfer in the Internet. It explains the best-effort nature of packet delivery, the end-to-end argument, and the timeliness-vs-reliability trade-off inherent in the design of the Internet. And it discusses three transport protocols in use in the Internet, UDP, TCP, and QUIC, and how the provide different degrees of timeliness and reliability, and offer different services to applications.

Part 1: Packet Loss in the Internet

The first part of the lecture discusses packet loss in the Internet. It talks about the causes of packet loss, the end-to-end argument, and the timeliness-reliability trade-off.

Slides for part 1


00:00:00.633 In this lecture I want to move

00:00:02.333 on from the discussion of connection establishment,

00:00:04.800 and talk instead about reliability and effective

00:00:07.433 data transfer across the network.


00:00:10.000 There are four parts to this.


00:00:12.000 In this first part, I’ll talk briefly

00:00:14.066 about packet loss in the Internet,

00:00:15.866 and the trade-off between reliability and timeliness.


00:00:19.000 Then, I’ll move on to discuss unreliable

00:00:21.633 data using UDP, and talk about the

00:00:23.833 types of applications that benefit from this.


00:00:26.700 In part three, I’ll talk about reliable

00:00:29.000 data transfer with TCP. I’ll discuss the

00:00:31.966 TCP service model, how TCP ensures data

00:00:34.866 is delivered reliably, and some of the

00:00:37.066 limitations of TCP relating to head-of-line blocking.


00:00:41.000 Then, in the final part, I’ll conclude

00:00:43.100 by discussing how QUIC transfers data and

00:00:45.433 how this differs from TCP.


00:00:49.866 I want to start by discussing packet loss in the Internet.

00:00:52.833 What we mean when we say that the Internet

00:00:55.000 provides a best effort service.

00:00:57.066 The end-to-end argument.

00:00:58.733 And the timeliness vs reliability trade-off inherent

00:01:01.400 in the design of the Internet.


00:01:05.833 As we discussed back in lecture 1,

00:01:08.066 the Internet is a best effort packet delivery network.


00:01:11.933 This means that it’s unreliable by design.


00:01:15.000 IP packets can be lost, delayed,

00:01:17.533 reordered, or corrupted in transit. And this

00:01:20.900 is regarded as a feature, rather than a bug.


00:01:23.766 A network that can’t deliver

00:01:25.766 a packet is supposed to discard it.


00:01:29.000 There are many reasons why a packet

00:01:31.133 can get lost or discarded. It could

00:01:33.733 be due to a transmission error,

00:01:35.433 where electrical noise of wireless interference corrupts

00:01:37.833 the packet in transit, making the packet unreadable.


00:01:41.833 Or it could be because too much

00:01:43.300 traffic is arriving at some intermediate link

00:01:45.500 in the network, so an intermediate router

00:01:48.433 runs out of buffer space. If traffic

00:01:51.033 is arriving at a router from several

00:01:52.666 different incoming links, but all going to

00:01:55.200 the same destination, so it’s arriving faster

00:01:57.700 than it can be delivered, a queue

00:01:59.900 of packets will build up, waiting for transmission.


00:02:03.033 If this situation persists, the queue might

00:02:05.400 grow so much that a router runs

00:02:07.633 out of memory, and has no choice

00:02:09.400 but to discard the packets.


00:02:12.000 Or packets could be lost because of

00:02:13.966 a link failure. Or a router bug.

00:02:15.833 Or for other reasons.


00:02:18.000 How often this happens varies significantly.


00:02:22.000 The packet loss rate depends on the type of link.


00:02:26.100 Wireless links tend to be less reliable

00:02:28.433 than wired links, for example.


00:02:30.966 It’s reasonably likely that packet sent over a wireless

00:02:34.533 link, such as WiFi or 4G,

00:02:36.466 will be corrupted in transit due to

00:02:38.433 noise, interference, or cross traffic.


00:02:41.066 This is very unlikely on an Ethernet

00:02:43.633 or optical fibre link.


00:02:46.000 The packet loss rate also depends on

00:02:48.933 the overall quality and robustness of the infrastructure.


00:02:52.366 Countries with well developed

00:02:53.800 and well maintained infrastructure

00:02:55.400 tend to have reliable Internet links;

00:02:58.366 countries with less robust or lower

00:03:00.500 capacity infrastructure tend to see more problems.


00:03:04.833 And the loss rate depends on the protocol.

00:03:07.966 Some protocols intentionally try to push

00:03:10.000 links to capacity, causing temporary overload as

00:03:12.500 they try to find the limit,

00:03:14.633 as they try to find the maximum

00:03:16.666 transmission rate they can achieve.


00:03:19.000 TCP and QUIC do this in many cases,

00:03:22.033 depending on the congestion control algorithm

00:03:24.366 used, as we’ll see in lecture 6.


00:03:28.000 Other applications, such as telephony or video

00:03:30.200 conferencing, tend to have an upper bound

00:03:32.400 in the amount of data they can send.


00:03:35.066 Whatever the reason, though,

00:03:36.933 some packet loss is inevitable.


00:03:40.000 The transport layer needs to recognise this.

00:03:42.533 It must detect packet loss. And,

00:03:44.866 if the application needs reliability, it must

00:03:47.133 retransmit or otherwise repair any lost data.


00:03:53.000 That the Internet provides best effort packet

00:03:55.266 delivery is a result of the end-to-end argument.


00:03:58.966 The end-to-end argument considers whether it’s better

00:04:02.133 to place functionality inside the network or

00:04:04.300 at the end points.


00:04:06.833 For example, rather than provide best effort

00:04:09.633 delivery, we could try to make the

00:04:11.766 network deliver packets reliably. We could design

00:04:15.466 some way to detect packet loss on

00:04:17.133 a particular link, and request that the

00:04:19.166 lost packets be retransmitted locally,

00:04:21.466 somewhere within the network.


00:04:23.666 And, indeed, some network links do this.


00:04:27.000 In WiFi networks, for example, the base

00:04:29.666 station acknowledges packets it receives from the

00:04:31.800 clients, and requests any corrupted packets are

00:04:34.966 re-sent, to correct the error.


00:04:38.000 The problem is, that unless this mechanism

00:04:40.333 is 100% perfect all the time,

00:04:43.033 then end systems will still need to

00:04:44.966 check if the data has been received

00:04:46.600 correctly, and will still need some way

00:04:48.600 of retransmitting packets in the case of problems.


00:04:52.000 And if they’ve got that, why bother

00:04:54.133 with the in-network retransmission and repair?


00:04:58.000 Often times, if you add features into

00:05:00.233 the network routers, they end up duplicating

00:05:03.000 functionality that the network endpoints need to

00:05:05.500 provide anyway.


00:05:08.600 Maybe the performance benefit of adding features

00:05:11.833 to the network is so big that it’s worth while.


00:05:16.000 But often, the right thing to do

00:05:17.566 is to keep the network simple.


00:05:19.733 Omit anything that can be done by the endpoints.


00:05:22.633 And favour simplicity over the

00:05:24.533 absolute optimal performance.


00:05:28.300 The end-to-end argument is one of the

00:05:29.933 defining principles of the Internet. And I

00:05:32.900 think it’s still a good approach to

00:05:34.566 take, when possible. Keep the network simple, if you can


00:05:39.000 The paper linked from the slide talks

00:05:40.866 about this subject in a lot more detail.


00:05:46.000 Irrespective of whether retransmission of lost packets

00:05:49.033 happen between the endpoints or within the

00:05:51.766 network, it takes time.


00:05:54.566 This leads to a fundamental trade-off in

00:05:56.400 the design of the network.


00:05:59.000 If a connection is to be reliable,

00:06:01.266 it cannot guarantee timeliness.


00:06:04.400 It’s not possible to build absolutely perfect

00:06:07.066 network links, that never discard or corrupt

00:06:09.433 packets. There’s always some risk that the

00:06:12.566 data is lost and needs to be

00:06:14.833 retransmitted. And retransmitting a packet will always

00:06:18.133 take time, and so disrupt the timeliness of the delivery.


00:06:22.400 And similarly, if a connection is to

00:06:24.600 be timely, it cannot guarantee reliability.


00:06:27.800 There’s a trade-off to be made.


00:06:31.100 Protocols like UDP are timely but don’t

00:06:33.966 attempt to be reliable. They send packets,

00:06:36.800 and if they get lost, they get lost.


00:06:40.533 TCP and QUIC, on the other hand,

00:06:42.566 aim to be reliable. They send the

00:06:45.733 packets, and if they get lost,

00:06:47.366 they retransmit them.


00:06:49.666 And if the retransmission gets lost? They

00:06:52.200 try again, until the data eventually arrives.


00:06:55.533 As we’ll see in part 3 of

00:06:57.533 this lecture, this causes head of line

00:06:59.266 blocking, making the protocol less timely.


00:07:03.000 And other protocols, such as the Real-time

00:07:05.466 Transport Protocol, RTP, that I’ll talk about

00:07:09.166 in lecture 7, or the partially reliable

00:07:11.566 version of the Stream Control Transport Protocol,

00:07:13.800 SCTP, aim for a middle ground.


00:07:17.466 They try to correct some, but not

00:07:19.100 all, of the transmission errors. The try

00:07:22.000 to achieve a balance, a middle-ground,

00:07:24.233 between timeliness and reliability.


00:07:29.266 The different protocols exist because different applications

00:07:32.400 make different trade-offs.


00:07:34.233 Some applications prefer timeliness,

00:07:36.533 some prefer reliability.


00:07:39.366 For applications like web browsing, email,

00:07:41.833 or messaging, you want to receive all

00:07:44.533 the data. If I’m loading a web

00:07:47.333 site, I’d like it to load quickly,

00:07:49.300 sure. But I prefer for it to

00:07:51.800 load slowly, and be uncorrupted, rather than

00:07:54.433 load quickly with some parts missing.


00:07:57.466 For a video conferencing tool, like Zoom,

00:08:00.100 though, the trade-off is different. If I’m

00:08:03.200 having a conversation with someone, it’s more

00:08:05.166 important that the latency is low,

00:08:07.066 than the picture quality is perfect.


00:08:10.000 The same may be true for gaming.


00:08:13.000 And this has implications for the way

00:08:15.166 we design the network.


00:08:17.000 It means that the IP layer needs

00:08:18.933 to be unreliable. It needs to be

00:08:21.066 a best effort network.


00:08:23.400 If the IP layer is unreliable,

00:08:25.700 protocols like TCP and QUIC can sit

00:08:28.100 on top and retransmit packets to make

00:08:30.200 it reliable. A transport protocol can make

00:08:33.533 an unreliable network into a reliable one.


00:08:37.366 But if the IP layer is reliable,

00:08:39.666 if the IP layer retransmits packets itself,

00:08:42.700 then the network, the applications, the transport

00:08:45.366 protocols, can’t undo that.


00:08:51.466 So this concludes the discussion of packet

00:08:53.533 loss and why the Internet opts to

00:08:55.433 provide an unreliable, best-effort, service.


00:08:58.566 In the next part, I’ll talk about

00:09:00.233 UDP and how to make use of

00:09:02.100 an unreliable transport protocol.

Part 2: Unreliable Data Using UDP

The second part of the lecture discusses UDP. It outlines the UDP service model, and reviews how to send and receive data using UDP sockets, and the implications of unreliable delivery for applications using UDP. It discusses how UDP is suitable for real-time applications that prioritise low-latency over reliability. And is discusses the use of UDP as a substrate on which alternative transport protocols can be implemented, avoiding some of the challenges of protocol ossification.

Slides for part 2


00:00:00.300 In this part, I’ll move on to

00:00:02.166 discuss how to send unreliable data using UDP.


00:00:05.400 I’ll talk about the UDP service model,

00:00:07.900 how to send and receive packets,

00:00:09.833 and how to layer protocols on top of UDP.


00:00:14.000 UDP provides an unreliable,

00:00:16.300 connectionless, datagram service.


00:00:18.600 It adds only two features on top

00:00:20.566 of the IP layer: port numbers and a checksum.


00:00:24.000 The checksum is used to detect whether

00:00:26.300 the packet has been corrupted in transit.


00:00:28.666 If so, the packet will be discarded

00:00:30.933 by the UDP code in the operating

00:00:33.066 system of the receiver, and won’t be

00:00:34.833 delivered to the application.


00:00:37.000 The port numbers determine what application receives

00:00:39.866 the UDP datagrams when they arrive at

00:00:42.066 the destination. They’re set by the bind()

00:00:44.500 call, once the socket has been created.


00:00:47.566 The Internet Assigned Numbers Authority, the IANA,

00:00:50.566 maintains a list of well-known UDP port

00:00:52.633 numbers which you should use for particular

00:00:55.200 applications. This is linked from the bottom of the slide.


00:00:59.400 UDP is very minimal. It doesn’t provide

00:01:02.166 reliability, or ordering, or congestion control.


00:01:05.533 It just delivers packets to an application,

00:01:08.400 that’s bound to a particular port.


00:01:11.000 Mostly, UDP is used as a substrate.


00:01:13.800 It’s a base on which higher-layer protocols are built.


00:01:17.666 QUIC is an example of this,

00:01:19.433 as we discussed in the last lecture.

00:01:21.600 Others are the Real-time Transport Protocol,

00:01:24.100 and the DNS protocol,

00:01:25.466 that we’ll talk about later in the course.


00:01:29.666 UDP is connectionless. It’s got no notion

00:01:32.633 of clients or servers, or of establishing

00:01:34.933 a connection before it can be used.


00:01:38.000 To use UDP, you first create a socket.


00:01:41.433 Then you call bind(),

00:01:42.766 to choose the local port on which that socket

00:01:44.833 listens for incoming datagrams.


00:01:46.900 They you call recvfrom() if you want

00:01:49.400 to receive a datagram on that socket,

00:01:51.800 or sendto() if you want to send a datagram.


00:01:55.000 You don’t need to connect.

00:01:56.966 You don’t need to accept connections.

00:01:59.266 You just send and receive data.


00:02:01.866 And maybe that data is delivered.


00:02:05.000 When you’re finished, you close the socket.


00:02:08.433 Protocols that run on top of UDP,

00:02:10.700 such as QUIC, might add support for

00:02:13.133 connections, reliability, ordering,

00:02:15.333 congestion control, and so on,

00:02:17.166 but UDP itself supports none of this.


00:02:22.866 To send a UDP datagram, you use

00:02:25.000 the sendto() function.


00:02:27.033 This work similarly to the send() function

00:02:29.566 you used to send data over a

00:02:31.033 TCP connection in the labs, except that

00:02:33.933 it takes two additional parameters to indicate

00:02:36.566 the address to which the datagram should

00:02:39.066 be sent, and the size of that address.


00:02:41.800 When using TCP, you establish a connection

00:02:45.133 between a socket, bound to a local

00:02:46.933 address and port, and a server listening

00:02:49.600 on a particular port on some remote

00:02:51.400 IP address. And once the connection is

00:02:54.033 established, all the data goes over that

00:02:55.966 connection, to the same destination.


00:02:59.033 UDP is not like that.


00:03:01.533 Every time you call sendto(), you specify

00:03:04.333 the destination address. Every packet you send

00:03:07.633 from a UDP socket can go to

00:03:09.800 a different destination, if you want.


00:03:12.166 There’s no notion of connections.


00:03:15.400 Now, you can call connect() on a

00:03:17.600 UDP socket, if you like, but it doesn’t actually create

00:03:20.233 a connection. Rather, it just remembers the

00:03:23.666 address you give it, so you can

00:03:25.400 call send(), rather than sendto() in future,

00:03:28.400 to save having to specify the address each time.


00:03:33.000 To receive a UDP datagram, you call

00:03:36.133 the recvfrom() function, as shown on the slide.


00:03:39.800 This is like the recv() call you

00:03:42.000 use with TCP, but again it has

00:03:44.566 two additional parameters. These allow it to

00:03:47.433 record the address that the received datagram

00:03:49.700 came from, so you can use them

00:03:52.133 in the sendto() function to send a reply.


00:03:54.766 You can also call recv(), rather than

00:03:57.000 recvfrom(), like with TCP, and it works,

00:04:00.566 but it doesn’t give you the return

00:04:02.233 address, so it’s not very useful.


00:04:05.366 The important point with UDP is that

00:04:07.900 packets can be lost, delayed, or reordered

00:04:10.166 in transit, and UDP doesn’t attempt to

00:04:12.500 recover from this.


00:04:14.900 Just because you send a datagram,

00:04:17.000 doesn’t mean it will arrive. And if

00:04:19.500 datagrams do arrive, they won’t necessarily arrive

00:04:21.933 in the order sent.


00:04:27.900 Unlike TCP, where data written to a

00:04:30.700 connection in a single send() call might

00:04:32.933 end up being split across multiple read()

00:04:35.066 calls at the receiver, a single UDP

00:04:37.900 send generates exactly one datagram.


00:04:41.566 If it’s delivered at all, the data

00:04:43.800 sent by a single call to sendto()

00:04:45.866 will be delivered by a single call

00:04:47.566 to recvfrom(). UDP doesn’t split messages.


00:04:52.233 But UDP is otherwise unreliable.


00:04:54.966 Datagrams can be lost, delayed, reordered,

00:04:57.766 or duplicated in transit.


00:05:00.400 Data sent with sendto() might never arrive.

00:05:03.300 Or it might arrive more than once.

00:05:05.966 Or data sent in consecutive calls to

00:05:08.266 sendto() might arrive out of order,

00:05:10.633 with data sent later arriving first.


00:05:14.700 UDP doesn’t attempt to correct any of these things.


00:05:19.500 The protocol you build on top of

00:05:21.200 UDP might choose to do so.


00:05:23.900 For example, we saw that QUIC adds

00:05:26.000 packet sequence numbers and acknowledgement frames to

00:05:28.433 the data it sends within UDP packets.


00:05:31.366 This lets it put the data back

00:05:33.100 into the correct order, and retransmit any

00:05:34.966 missing packets.


00:05:36.800 But there’s no requirement that the protocol

00:05:38.566 running over UDP is reliable.


00:05:41.500 RTP, the Real-time Transport Protocol, that’s used

00:05:44.933 for video conferencing apps, puts sequence numbers

00:05:47.600 and timestamps inside the UDP datagrams it

00:05:50.666 sends, so it can know if any

00:05:53.033 data is missing, and it can conceal

00:05:55.000 loss or reconstruct the packet playout time,

00:05:58.466 but it generally doesn’t retransmit missing data.


00:06:03.000 UDP gives the application the choice of

00:06:05.566 building reliability, if it wants it.


00:06:07.866 But it doesn’t require that the applications

00:06:09.966 deliver data reliably.


00:06:14.000 Applications that use UDP need to organise

00:06:16.766 the data they send, so it’s useful

00:06:18.400 if some data is lost.


00:06:21.133 Different applications do this in different ways,

00:06:23.633 depending on their needs.


00:06:26.000 QUIC, for example, organises the data into

00:06:28.533 sub-streams within a connection,

00:06:30.300 and retransmits missing data.


00:06:33.266 Video conferencing applications

00:06:34.766 tend to do something different.


00:06:37.333 The way video compression works, is that

00:06:39.700 the codec sends occasional full frames of

00:06:41.700 video, known as I-frames, index frames,

00:06:44.700 every few seconds. And in between these

00:06:48.000 it sends only the differences from the

00:06:49.566 previous frame, known as P-frames, predicted frames.


00:06:53.866 In a video call, it’s common for

00:06:55.900 the background to stay the same,

00:06:57.433 while the person moves in the foreground,

00:06:59.766 so a lot of the frame is

00:07:01.166 the same each time. By only sending

00:07:03.866 the differences, video compression saves bandwidth.


00:07:07.733 But this affects how the application treats

00:07:09.533 the different datagrams.


00:07:12.000 If a UDP datagram containing a predicted

00:07:14.500 frame is lost, it’s not that important.


00:07:17.433 You’ll get a glitch in one frame of video.


00:07:20.500 But if a UDP datagram containing an

00:07:22.700 index frame, or part of an index

00:07:25.300 frame, is lost, then that matters a

00:07:27.533 lot more because the next few seconds

00:07:29.700 worth of video are predicted based on

00:07:31.600 that index frame. Losing an index frame

00:07:34.766 corrupts several seconds worth of video.


00:07:38.000 For this reason, many video conferencing apps

00:07:40.166 running over UDP try to determine if

00:07:42.866 missing packets contained an index frame or

00:07:45.233 not. And they try to retransmit index

00:07:48.000 frames, but not predicted frames.


00:07:51.000 The details of how they do this

00:07:52.833 aren’t really important, unless you’re building a

00:07:54.966 video conferencing app.


00:07:56.766 What’s important though, is that UDP gives

00:07:58.966 the application flexibility to be unreliable for

00:08:02.033 some of the datagrams it sends,

00:08:04.266 while trying to deliver other datagrams reliably.


00:08:08.000 You don’t have that flexibility with TCP.


00:08:12.333 UDP is harder to use, because it

00:08:14.766 provides very few services to help your

00:08:16.600 application, but it’s more flexible because you

00:08:19.333 can build exactly the services you need

00:08:21.200 on top of UDP.


00:08:25.100 Fundamentally, UDP doesn’t make any attempt to

00:08:28.233 provide sequencing, reliability,

00:08:30.500 timing recovery, or congestion control.


00:08:33.766 It just delivers datagrams on a best effort basis.


00:08:38.000 It lets you build any type of

00:08:39.933 transport protocol you want, running inside UDP packets.


00:08:44.000 Maybe that transport protocols has sequence numbers

00:08:46.766 and acknowledgements, and retransmits some or all

00:08:49.533 of the lost packets.


00:08:52.100 Maybe, instead, it uses error correcting codes,

00:08:55.066 to allow some of the packets to

00:08:57.233 be repaired without retransmission.


00:08:59.666 Maybe it includes timestamps, so the receiver

00:09:02.233 can carefully reconstruct the timing.


00:09:04.733 Maybe it contains other information.


00:09:07.166 The point is that UDP gives you

00:09:09.133 flexibility, but at the cost of having

00:09:11.500 to implement these features yourself. At the

00:09:14.066 cost of adding complexity.


00:09:18.000 There’s a lot to think about when

00:09:19.866 writing a UDP-based protocol or a UDP-

00:09:22.133 based application.


00:09:24.033 If you use a transport protocol,

00:09:26.200 like QUIC or like RTP, that runs

00:09:28.633 over UDP, then the designers of that

00:09:31.366 protocol have made these decisions, and will

00:09:33.833 have given you a library you can use.


00:09:36.500 If not, if you’re designing your own

00:09:38.566 protocol that runs over UDP, then the

00:09:41.733 IETF has written some guidelines, highlighting the

00:09:44.066 issues you need to think about,

00:09:45.600 in RFC 8085.


00:09:48.200 Please read this before you try and

00:09:49.866 write applications that use UDP. There are

00:09:52.666 a lot of non-obvious things that can catch you out.


00:09:57.500 So, that concludes our discussion of UDP.


00:10:00.233 In the next part, I’ll talk about

00:10:01.866 how TCP delivers data reliably.

Part 3: Reliable Data with TCP

The third part of the lecture discusses TCP. It outlines the TCP service model and shows how to send and receive data using a TCP connection. It explains how TCP ensures reliable and order data transfer, using sequence numbers and acknowledgements. And it explains TCP loss detection using timeouts and triple-duplicate acknowledgements. The issue of head-of-line blocking in TCP connections is discussed, as an example of the timeliness vs reliability trade-off.

Slides for part 3


00:00:00.233 In this part I want to talk

00:00:02.000 about how reliable data is delivered using

00:00:03.966 TCP connections. I’ll talk about the TCP

00:00:06.866 service model, how TCP uses sequence numbers

00:00:10.400 and acknowledgments, and how packet loss detection

00:00:13.500 and recovery works in TCP.


00:00:17.033 Thinking about the TCP service model,

00:00:19.266 as we've seen in previous lectures,

00:00:21.600 TCP provides a reliable, ordered, byte stream

00:00:24.966 delivery service that runs over IP.


00:00:27.966 The applications write data into the TCP

00:00:30.733 socket, that buffers it up in the

00:00:32.833 sending system, and then delivers it over

00:00:35.533 a sequence of data segments over the IP layer.


00:00:38.866 When these data packets, these data segments,

00:00:41.766 are received, they are accumulated in a

00:00:44.533 receive buffer at the receiver. If anything

00:00:47.200 is lost, or arrives out of order,

00:00:49.066 it's re-transmitted, and eventually the data is

00:00:51.433 delivered to the application.


00:00:53.333 The data delivered to the application is

00:00:55.666 always delivered reliably, and in the order sent.


00:00:58.733 If something is lost, if something needs

00:01:01.600 to be re-transmitted, this stalls the delivery

00:01:04.400 of the later data, to make sure

00:01:06.766 that everything is always delivered in order.


00:01:10.966 TCP delivers, as we say, an order,

00:01:13.766 reliable, byte stream.


00:01:16.366 After the connection has been established,

00:01:18.466 after the SYN, SYN-ACK, ACK handshake,

00:01:20.866 the client and the server can send

00:01:22.633 and receive data.


00:01:24.700 The data can flow in either direction

00:01:27.166 within that TCP connection.


00:01:29.400 It’s usual that the data follows a

00:01:31.900 request response pattern. You open the connection.

00:01:35.100 The client sends a request to the

00:01:36.733 server. The server replies with a response.


00:01:39.400 The client makes another request. The server

00:01:41.566 replies with another response, and so on.


00:01:44.566 But TCP doesn't make any requirements on

00:01:46.900 this. There’s no requirement that the data

00:01:49.266 flows in a request response pattern,

00:01:51.300 and the client and the server can

00:01:53.666 send data in any order they feel like.


00:01:56.366 TCP does ensure that the data is

00:01:58.400 delivered reliably, and in the order it

00:02:00.600 was sent, though.


00:02:02.766 TCP sends acknowledgments for each data segment

00:02:05.600 as it's received. And if any data

00:02:07.733 is lost, it retransmits that lost data.


00:02:10.733 And if segments are delayed and arrive

00:02:13.033 out of order, or if a segment

00:02:15.166 has to be re-transmitted and arrives out

00:02:17.166 of order, then TCP will reconstruct the

00:02:19.300 order before giving the segments back to the application.


00:02:25.533 In order to send data over a

00:02:27.300 TCP connection you use the send() function.


00:02:30.766 This transmits a block of data over

00:02:33.500 the TCP connection. The parameters are the

00:02:37.133 file descriptor representing the socket – the

00:02:39.500 TCP socket, the data, the length of

00:02:42.066 the data, and a flag. And the

00:02:44.366 flag field is usually zero.


00:02:47.466 The send() function blocks until all the

00:02:49.600 data can be written.


00:02:51.400 And it might take a significant amount

00:02:53.900 of time to do this, depending on

00:02:55.866 the available capacity of the network.


00:02:59.433 It also might not be able to

00:03:00.800 send all the data.


00:03:02.833 If the connection is congested, and can't

00:03:05.233 accept any more data, then the send()

00:03:06.966 function will return to indicate that it

00:03:10.566 wasn't able to successfully send all the

00:03:12.766 data that was requested.


00:03:15.300 The return value from the send() function

00:03:17.300 is the amount of data it actually

00:03:18.766 managed to send on the connection.


00:03:20.266 And that can be less than the

00:03:21.866 amount it was asked to send.


00:03:23.566 In which case, you need to figure

00:03:25.533 out what data was not sent,

00:03:27.100 by looking at the return value,

00:03:29.800 and the amount you asked for,

00:03:31.233 and re-send just the missing part in another call.


00:03:34.833 Similarly, if an error occurs, if the

00:03:37.333 connection has failed for some reason,

00:03:39.500 the send() function will return -1,

00:03:41.166 and it will set the global variable

00:03:42.566 errno to indicate that.


00:03:46.800 On the receiving side you call the

00:03:49.100 recv() function to receive data on a

00:03:50.966 TCP connection.


00:03:53.200 The recv() function blocks until data is

00:03:55.833 available, or until the connection is closed.


00:03:59.833 It’s passed a size,

00:04:01.333 It’s passed a buffer, buf, and the

00:04:04.666 size of the buffer, BUFLEN, and it

00:04:07.066 reads up to BUFLEN bytes of data.


00:04:09.600 And what it returns is the number

00:04:11.700 of bytes of data that were read.


00:04:14.066 Or, if the connection was closed,

00:04:16.100 it returns zero. Or, if an error

00:04:18.900 occurs, it returns -1, and again sets

00:04:21.700 global variable errno to indicate what happened.


00:04:26.933 When a recv() call finishes, you have

00:04:29.500 to check these three possibilities. You have

00:04:31.900 to check if the return value is

00:04:33.300 zero, to indicate that the connection is

00:04:35.466 closed and you've successfully received all the

00:04:38.133 data in that connection. At which point,

00:04:40.366 you should also close the connection.


00:04:42.900 You have to check if the return

00:04:44.300 value is minus one, in which case

00:04:46.166 an error has occurred, and that connection

00:04:48.566 has failed, and you need to somehow

00:04:50.566 handle that error.


00:04:53.266 And you need to check if it's some other value,

00:04:55.900 to indicate that you've received some data,

00:04:57.900 and then you need to process that data.


00:05:01.133 What's important is to remember that the

00:05:04.200 recv() call just gives you that data

00:05:07.033 in the buffer. If the return value

00:05:09.700 from receive is 157, this indicates that

00:05:12.566 the buffer has 157 bytes of data in it.


00:05:16.366 What the recv() called doesn't ever do,

00:05:18.833 is add a terminating null to that buffer.


00:05:22.366 Now, if you're careful that doesn't matter,

00:05:26.133 because you know how much data is

00:05:28.300 in the buffer, and you can explicitly

00:05:30.400 process the data up to that length.


00:05:33.866 But, a common problem with TCP-based applications,

00:05:38.500 is that they treat the data as if it was a string.


00:05:43.366 They pass it to the printf() call

00:05:45.200 using %s as if it were a

00:05:47.200 string, or they pass it to function

00:05:49.666 like strstr() to search for a string

00:05:51.533 within it, or strcpy(), or something like that.


00:05:56.133 And the problem is the string functions

00:05:58.033 assume there’s a terminating null, and the

00:06:00.333 recv() call doesn't provide one.


00:06:03.766 If you're going to pass the data

00:06:05.866 that's returned from a recv() call to

00:06:08.600 one of the C string functions,

00:06:10.666 you need to explicitly add that null yourself.


00:06:13.866 You need to look at the buffer,

00:06:17.333 add the null at the end,

00:06:19.100 after the last byte which was successfully

00:06:21.533 received. If you don't do, this the

00:06:25.033 string functions will just run off the end of the buffer

00:06:27.300 and you'll get a buffer overflow attack.


00:06:29.733 And this is a significant security risk.

00:06:31.733 It’s one of the biggest security problems

00:06:33.666 with network code using C. It’s misusing

00:06:36.900 these buffers, accidentally using one of the

00:06:39.166 string functions, and it just reads off

00:06:41.966 the end of buffer, and who knows what it processes.


00:06:48.566 When you send data using TCP,

00:06:50.700 the send() call enqueues the data for transmission.


00:06:55.200 The operating system, the TCP code in

00:06:57.900 the operating system, splits the data you've

00:07:00.366 written using the various send() calls into

00:07:02.266 what’s known as segments, and puts each

00:07:04.333 of these into a TCP packet.


00:07:07.433 The TCP packets are sent in IP

00:07:09.533 packets. And TCP runs a congestion control

00:07:12.933 algorithm to decide when it can send those packets.


00:07:17.166 Each TCP segment, each segment is in

00:07:20.200 a TCP packet. The TCP packets have

00:07:22.933 a header, which has a sequence number.


00:07:25.933 When the connection setup handshake happens,

00:07:28.700 in the SYN and the SYN-ACK packets,

00:07:31.366 the connection agrees the initial sequence numbers;

00:07:34.300 agrees the starting value for the sequence numbers.


00:07:37.666 If you’re the client, for example;

00:07:39.600 the client picks a sequence number at

00:07:43.200 random, and sends this in its SYN packet.


00:07:46.433 And then when it starts sending data,

00:07:48.600 the next data packet has a sequence

00:07:50.700 number that is one higher than that

00:07:52.466 in the SYN packet.


00:07:55.033 And, as it continues to send data,

00:07:57.700 the sequence numbers increase by the number

00:08:00.300 of data bytes sent.


00:08:02.400 So, for example, if the initial sequence

00:08:04.533 number was 1001, just picked randomly,

00:08:07.133 and it sends 30 bytes of data

00:08:09.466 in the packet, then the next sequence

00:08:12.733 number will be 1031.


00:08:16.533 The sequence number spaces are separate for

00:08:18.800 each in each direction. The sequence numbers

00:08:21.066 the client uses increase based on the

00:08:23.333 initial sequence number the client sent the SYN packet.


00:08:26.366 The sequence numbers the server use,

00:08:28.433 start based on the initial sequence number

00:08:30.600 the server sent in the SYN-ACK packet,

00:08:32.700 and increase based on the amount of

00:08:34.766 data the server is sending. The two

00:08:36.366 number spaces are unrelated.


00:08:41.600 What's important is that calls to send()

00:08:44.300 don't map directly onto TCP segments.


00:08:49.066 If the data which is given to

00:08:51.300 a send() call is too big to

00:08:52.900 fit into one TCP segment, then the

00:08:56.100 TCP code will split it across several

00:08:58.366 segments; it'll split it across several packets.


00:09:02.600 Similarly, if the data you send,

00:09:04.900 that data you give the send() call

00:09:06.666 is quite small, TCP might not send

00:09:09.066 it immediately.


00:09:11.066 It might buffer it up, combine it

00:09:13.166 with data sent as part of a

00:09:15.600 later send() call. And combine it,

00:09:18.266 and send it in a single larger

00:09:19.700 segment, a single larger TCP packet.


00:09:23.566 This is an idea known as Nagle’s

00:09:27.100 algorithm. It's there to improve efficiency by

00:09:30.200 only sending big packets, because there's a

00:09:32.633 certain amount of overhead for each packet.


00:09:35.733 Each packet that’s sent by TCP has

00:09:38.033 a TCP header. It’s got an IP

00:09:40.333 header. It's got the Ethernet or the

00:09:42.666 WiFi headers depending on the link layer.


00:09:45.033 And that adds a certain amount of

00:09:47.033 overhead. It’s about, I think, 40 bytes

00:09:48.966 per packet. So if you're only sending

00:09:51.066 a small amount of data, that's a

00:09:52.900 lot of overhead, a lot of wasted data.


00:09:55.533 So TCP, with the Nagle algorithm,

00:09:57.466 tries to combine these packets into larger

00:09:59.500 packets when it can. But, of course,

00:10:01.633 this adds some delay. It’s got to

00:10:03.800 wait for you to send more data;

00:10:05.400 wait to see if it can form a bigger packet.


00:10:09.133 If you really need low latency,

00:10:11.133 you can disable the Nagle algorithm.


00:10:13.100 There’s a socket option called TCP_NODELAY,

00:10:16.000 and we see the code on the

00:10:17.800 slide to show how to use that.


00:10:19.833 So you create the socket, you

00:10:23.300 establish the connection, and then you call

00:10:26.400 the TCP_NODELAY option and that turns this

00:10:28.700 off. And this means that every time

00:10:31.000 you send() on the socket, it immediately

00:10:32.900 gets sent as quickly as possible.


00:10:37.800 One implication of this behaviour, though,

00:10:40.233 where TCP can either split data written

00:10:43.800 in a single send() across multiple segments,

00:10:47.233 or where it can combine several send()

00:10:49.566 calls into a single segment, is that

00:10:52.400 the data returned by the recv() calls

00:10:54.900 doesn't always correspond to a single send().


00:10:58.400 When you call recv(), you might get

00:11:01.166 just part of a message. And you

00:11:03.266 need to call recv() again to get the rest of the message.


00:11:06.700 Or you may get several messages in one recv() call.


00:11:12.600 When you're using TCP, the recv() calls

00:11:14.933 return the data reliably, and they return

00:11:17.266 the data in the order that it was sent.


00:11:20.366 But what they don't do is frame

00:11:22.300 the data. What they don't do is

00:11:23.833 preserve the message boundaries.


00:11:27.233 For example, if we're using HTTP,

00:11:30.433 which we see, we see an example

00:11:32.666 of an HTTP message that might be sent,

00:11:34.866 an HTTP response that might be sent,

00:11:37.866 by a web server back to a browser.


00:11:41.566 If we're using HTTP, what we would

00:11:44.500 like is that the whole response is

00:11:46.800 received in one go. So if we're

00:11:50.066 implementing a web browser we just call

00:11:51.766 recv() on the TCP connection

00:11:53.633 and we get all of the headers,

00:11:55.866 and all of the body, in just

00:11:57.500 in just one call to recv() and

00:11:59.433 we can then parse it, and process it, and deal with it.


00:12:02.833 TCP doesn't guarantee this, though.


00:12:06.133 It can split the messages arbitrarily,

00:12:08.566 depending on how much data was in

00:12:11.166 the packets, what size packets the underlying

00:12:14.233 link layers can send, and on the

00:12:17.066 available capacity of the network depending on

00:12:19.566 the congestion control.


00:12:21.200 And it can split the packets at arbitrary points.


00:12:24.466 For example, if we look at the

00:12:26.800 slide, we see that the headers,

00:12:29.166 some of them are labeled in red,

00:12:30.533 some are in blue, some of the body is in blue,

00:12:33.233 some the rest of the body is

00:12:34.500 in green. And it could be that

00:12:36.400 the TCP connection splits the data up,

00:12:38.600 so that the first recv() call just

00:12:40.633 gets the part of the headers highlighted

00:12:42.466 in red,


00:12:43.500 ending halfway through the “ETag:” line.


00:12:46.466 And then you have to call recv()

00:12:48.333 again. And then you get the part

00:12:50.233 of the message highlighted in blue,

00:12:51.833 which contains the rest of the headers

00:12:53.600 and the first part of the body.


00:12:55.533 Then you have to call recv() again,

00:12:57.433 to get the rest of the message

00:12:59.033 that's highlighted in green on the slide.


00:13:01.300 And this makes it much harder to

00:13:03.166 parse; much harder for the programmer.


00:13:05.833 Because you have to look at the

00:13:07.866 data you've got, parse it, check to

00:13:09.900 see if you've got the whole message,

00:13:11.500 check if you've received the complete headers,

00:13:13.466 check to see if you've received the

00:13:15.033 complete body. And you have to handle

00:13:17.033 the fact that you might have partial messages.


00:13:20.633 And it's something which makes it a

00:13:22.200 little bit hard to debug, because if

00:13:24.466 you only send small messages,

00:13:25.833 if you're sending packets which are only

00:13:28.200 like 1000 bytes, or so, they’re probably

00:13:31.800 small enough to fit in a single

00:13:33.600 packet, and they always get delivered in one go.


00:13:36.333 It’s only when you start sending

00:13:38.400 larger packets, or sending lots of data

00:13:41.333 over connection so things get split up

00:13:43.800 due to congestion control, that you start

00:13:45.600 to see this behaviour where the messages

00:13:47.533 get split at arbitrary points.


00:13:54.133 So as we've seen, the TCP segments

00:13:58.200 contain sequence numbers, and the sequence numbers

00:14:00.333 count up with the number of bytes being sent.


00:14:03.600 Each TCP segment also has an acknowledgement number.


00:14:09.366 When a TCP segment is sent,

00:14:12.266 it acknowledges any segments that have previously

00:14:16.666 been received.


00:14:18.866 So if,

00:14:20.266 if a TCP endpoint has received some

00:14:24.733 data on a TCP connection,

00:14:27.266 when it sends its next packet,

00:14:29.400 the ACK bit will be set in

00:14:31.866 the TCP header, to indicate that the

00:14:33.866 acknowledgement number is valid, and the acknowledgement

00:14:36.500 number will have a value indicating the

00:14:39.100 next sequence number it is expecting.


00:14:42.166 That is, the next contiguous byte it's

00:14:44.533 expecting on the connection.


00:14:47.866 So, in the example, we have a

00:14:52.500 slightly unrealistic example in that the connection

00:14:54.733 is sending one byte at a time,

00:14:56.500 and the first packet is sent with sequence number five.


00:14:59.566 And then the next packet is sent

00:15:01.700 with sequence number six, and then seven,

00:15:03.833 and eight, and nine, and ten,

00:15:05.666 and so on. And this is what

00:15:07.800 might happen with an ssh connection,

00:15:09.600 where each key you type generates a

00:15:11.166 TCP segment, with just the one key press in it.


00:15:14.866 And when those packets are received at

00:15:17.866 host B, it sends a TCP segment

00:15:20.700 with the acknowledgement bit set, acknowledging what's

00:15:24.766 expected next.


00:15:26.233 So when it receives the TCP packet

00:15:29.800 with sequence number five, and one byte

00:15:31.833 of data in it, it sends an

00:15:33.900 acknowledgement saying it got it, and it's

00:15:36.133 expecting the packet with sequence number six next.


00:15:40.333 When it receives the packet with sequence

00:15:42.366 number six, and one byte of data

00:15:44.433 in it, it sends an acknowledgement saying

00:15:46.333 it's expecting seven. And so on.


00:15:51.033 TCP only ever acknowledges the next contiguous

00:15:55.766 sequence number expected.


00:15:58.233 And if a packet is lost,

00:16:00.500 subsequent packets generate duplicate acknowledgments.


00:16:05.300 So in this case, packet five was

00:16:08.733 sent. It got to the receiver,

00:16:10.766 and that sent the acknowledgement saying it

00:16:12.633 expected six. Six was sent, arrived at

00:16:15.100 the receiver, so the acknowledgement says it

00:16:17.133 expects seven.


00:16:18.800 Seven was sent, arrives at the receiver,

00:16:21.600 sends the acknowledgement saying it expects

00:16:23.333 eight. Eight was sent, and gets lost.


00:16:29.466 Nine was sent, and arrives at the receiver.


00:16:33.033 At this point, the receiver’s received the

00:16:36.066 packets with sequence numbers five, six,

00:16:38.000 and seven; eight is missing; and nine

00:16:40.366 has arrived. So the next contiguous sequence

00:16:43.400 number it's expecting is still eight.


00:16:46.233 So it sends an acknowledgement saying “I’m

00:16:48.633 expecting sequence number eight next”.


00:16:52.066 The packet sent, the next packet sent,

00:16:55.066 has sequence number 10. This arrives,

00:16:57.633 the acknowledgement goes back saying “I still

00:16:59.800 haven't got eight, I’m still expecting eight”,

00:17:02.400 and this carries on. TCP keeps sending

00:17:04.800 duplicate acknowledgments while there’s a gap in

00:17:06.900 the sequence number space.


00:17:11.533 In addition, we don't show it here,

00:17:14.000 but TCP can also send delayed acknowledgments,

00:17:16.333 where it only acknowledges every second packet.


00:17:18.466 In this case the acknowledgments might go,

00:17:20.666 six, eight. The packet with sequence number

00:17:23.966 five is sent, and it acknowledges six.


00:17:26.566 Packet with number six is sent,

00:17:28.366 and arrives, and packet number seven is

00:17:30.366 sent, and then it sends the acknowledgement

00:17:32.166 saying it's expecting eight. So it doesn't

00:17:34.366 have to send every acknowledgement, it can

00:17:36.300 sent every other acknowledgement to reduce the overheads.


00:17:43.300 TCP uses the acknowledgments to detect packet

00:17:47.800 loss; to detect when segments are lost.


00:17:51.233 There’s two ways in which it does this.


00:17:54.466 The first is that if it sends

00:17:57.433 data, but for some reason the acknowledgments stop entirely.


00:18:01.500 This is a sign that either the receiver has failed,


00:18:04.966 And, you know, the packets are being

00:18:06.866 delivered to the receiver, but the application

00:18:08.733 has crashed, and there's nothing there to

00:18:11.000 receive the data, to reply.


00:18:13.700 Or it's an indication that the network

00:18:15.800 connection has failed, and the packets are

00:18:17.900 just not reaching the receiver.


00:18:19.500 So if TCP is sending data,

00:18:21.633 and it's not getting any acknowledgments back,

00:18:24.066 after a while it times out and

00:18:26.933 uses this as an indication that the

00:18:28.866 connection has failed.


00:18:32.300 Alternatively, it can be sending data,

00:18:35.666 and if some data is lost,

00:18:39.700 but the later segments arrive, then TCP

00:18:42.000 will start sending the duplicate acknowledgments.


00:18:45.166 Again, back to the example, we see

00:18:47.900 that packet eight is lost, packet nine

00:18:50.266 arrives, and the sequence number, the acknowledgement

00:18:53.366 number, comes back says “I’m expecting sequence

00:18:55.266 number eight”.


00:18:56.966 And packet ten is sent and it

00:18:59.133 arrives, and it still says “I’m still

00:19:00.666 expecting packet with sequence number eight”,

00:19:03.200 and this just carries on.


00:19:05.700 And, eventually, TCP gets what's known as

00:19:08.333 a triple duplicate acknowledgement. It’s got the

00:19:11.833 original acknowledgement saying it's expecting packet eight,

00:19:14.933 and then three duplicates following that,

00:19:17.266 so four packets in total, all saying

00:19:19.433 “I’m still expecting packet eight”.


00:19:22.533 And what this indicates, is that data

00:19:24.900 is still arriving, but something's got lost.


00:19:28.266 It only generates acknowledgements when a new

00:19:30.800 packet arrives, so if we keep seeing

00:19:33.000 acknowledgments indicating the same thing, this indicates

00:19:35.933 that new packets arriving, because that's what

00:19:38.200 triggers the acknowledgement to be sent,

00:19:40.866 but there's still a packet missing,

00:19:43.400 and it's telling us which one it's expecting.


00:19:46.866 At that point TCP assumes that the

00:19:49.400 packet has got lost, and retransmits that

00:19:51.566 segment. It retransmits the packet with sequence

00:19:54.833 number eight.


00:19:59.233 Why does it wait for a triple duplicate acknowledgement?


00:20:03.466 Why does it not just retransmit it

00:20:06.033 immediately. when it sees a duplicate?


00:20:08.566 Well, the example we see here illustrates that.


00:20:13.466 In this case, a packet with sequence

00:20:15.733 number five is sent, containing one byte

00:20:17.866 of data, and it arrives, and the

00:20:19.866 receiver acknowledges it, saying it's expecting six.


00:20:23.400 And six is sent, and it arrives,

00:20:26.266 and the receiver acknowledges it, indicating it’s

00:20:28.333 expecting seven.


00:20:30.066 And packet seven is sent, and it's

00:20:32.866 delayed. And packet eight is sent,

00:20:35.566 and eventually arrives at the receiver.


00:20:38.233 Now the receiver hasn't received packet seven

00:20:41.100 yet, so it sends an acknowledgement which

00:20:43.500 says “I’m still expecting seven”. So that's

00:20:46.066 a duplicate acknowledgement.


00:20:48.200 At that point packet seven, which was

00:20:50.466 delayed, finally does arrive.


00:20:53.866 Now packet seven has arrived, packet eight

00:20:56.466 had arrived previously, so what is now

00:20:58.600 expecting is nine, so it sends an

00:21:00.833 acknowledgement for nine.


00:21:02.866 And we see that the acknowledgments go

00:21:05.266 six, seven, seven, nine, because that packet

00:21:08.033 seven was delayed a little bit.


00:21:11.900 And if TCP reacts to a single

00:21:14.300 duplicate acknowledgement as an indication that the

00:21:17.166 packet was lost, then you run the

00:21:20.233 risk that you're resending a packet on

00:21:23.033 the assumption when it was lost,

00:21:24.933 when it was just merely delayed a little bit.


00:21:28.466 And there's a trade off you can make here.


00:21:31.733 Do you treat, a single duplicate as

00:21:35.600 an indication of loss? Do you treat

00:21:38.066 two duplicates as an indication of loss?

00:21:40.366 Three? Four? Five? At what point do

00:21:42.900 you say “this as an indication of

00:21:44.300 loss”, rather than just “this is a

00:21:46.566 slightly delayed packet, and it might recover

00:21:49.133 itself in a minute”?


00:21:53.600 The reason that a triple duplicate is

00:21:55.933 used, is because someone did some measurements,

00:21:58.833 and decided that packets being delayed

00:22:01.800 enough to cause one or two duplicates,

00:22:04.500 because they arrived just a little bit

00:22:06.933 out of order, was relatively common.


00:22:09.133 But packets being delayed enough that they

00:22:11.800 cause three or more duplicates is rare.


00:22:14.500 So it's balancing-off speed of loss detection

00:22:17.766 vs. the likelihood that a merely delayed

00:22:20.466 packet is treated as if it were

00:22:22.600 lost, and retransmitted unnecessarily.


00:22:26.300 And, based on the statistics, the belief

00:22:29.500 by the designers of TCP was that

00:22:32.666 waiting for three duplicates was the right threshold.


00:22:36.233 And you could make a TCP version

00:22:38.900 that reduced this to two, or even

00:22:41.300 one duplicate, and it would respond to

00:22:43.666 loss faster, but would have the risk

00:22:45.666 that it's more likely to unnecessarily retransmit

00:22:47.966 something that's just delayed.


00:22:50.500 Or you could make it four,

00:22:52.433 five, six, even more duplicate acknowledgments,

00:22:55.700 which will be less likely to unnecessarily

00:22:57.900 retransmit data. But it’d be slower,

00:23:00.966 because it would be slower in responding

00:23:03.300 to loss, and slower in retransmitting actually lost packets.


00:23:12.766 The other behaviour of TCP. which is

00:23:16.033 worth noting, is head-of-line blocking.


00:23:19.566 Now, in this case we're sending something

00:23:21.866 more realistic. We're sending full size packets,

00:23:24.166 with 1500 bytes of data in each packet.


00:23:26.900 And 1500 is the maximum packet size

00:23:29.333 that you can send in an Ethernet

00:23:31.733 packet, or in a WiFi packet,

00:23:33.833 so this is a typical size that actually gets sent.


00:23:37.366 In this case, the first packet is

00:23:40.366 sent with sequence numbers in the range

00:23:42.966 zero through to 1499.


00:23:46.266 And this arrives at the receiver,

00:23:48.266 and the receiver sends an acknowledgement saying

00:23:50.500 it got it, and the next packet

00:23:52.300 it’s expecting has sequence number 1500.

00:23:55.666 So it sends an acknowledgement for 1500.


00:23:58.666 And if there’s a recv() call outstanding

00:24:01.033 on that socket, that recv() call will

00:24:03.400 return at that point, and return 1500

00:24:05.100 bytes of data. It returns the data

00:24:07.733 as it was received.


00:24:09.600 The next packet arrives at the receiver,

00:24:11.866 containing sequence numbers 1500 through to 2999,

00:24:16.800 and again the recv() call, if there

00:24:19.266 is one, will return, and return that

00:24:21.233 next 1500 bytes.


00:24:23.200 Similarly, when the packet containing the next

00:24:25.833 1500 comes in, the receiver will send

00:24:28.433 the ACK saying “I’m expecting 4500”,

00:24:30.533 and the recv() call will return.


00:24:33.733 The packet containing sequence numbers 4500 though

00:24:37.500 to 5999 is lost.


00:24:40.633 The packet containing 6000 through to 7499 arrives.


00:24:47.466 The acknowledgement goes back indicating that it’s

00:24:50.166 still expecting sequence number 4500, because that

00:24:53.166 packet got lost. And at that point,

00:24:56.233 some data has arrived, some new data

00:24:57.966 has arrived at the receiver.


00:24:59.600 But there's a gap. The packets,

00:25:02.566 the packet, containing data with sequence numbers

00:25:05.266 4500 through to 5999 is still missing.


00:25:08.833 So if the receiver application has called

00:25:12.933 recv() on that socket, it won't return.


00:25:16.366 The data has arrived, it's buffered up

00:25:18.833 in the TCP layer in the operating

00:25:20.800 system, but TCP won't give it back

00:25:22.400 to the application.


00:25:24.933 And the packets can keep being sent,

00:25:27.200 and the receiver keeps sending the duplicate

00:25:29.700 acknowledgments, and eventually it’s sent the triple

00:25:32.266 duplicate acknowledgement, and the TCP sender notices

00:25:35.700 and retransmits the packet with sequence numbers

00:25:38.366 4500 through to 5999.


00:25:41.833 And eventually those arrive at the receiver.


00:25:45.900 At that point, the receiver has a

00:25:48.966 contiguous block of data available, with no

00:25:51.133 gaps in it, and it returns all

00:25:54.100 of the data from sequence number 4500

00:25:57.000 up to sequence number 12,000,

00:26:00.533 up to the application in one go.


00:26:03.333 And if the application has given a

00:26:05.600 big enough buffer, at that point the

00:26:07.366 recv() call will returned 7500 bytes of

00:26:09.766 data. It’ll return all of that received

00:26:12.666 data in one big burst.


00:26:18.033 And then, as the data, you know,

00:26:20.700 gets retransmitted, as the data arrives,

00:26:23.066 it will just keep, you know,

00:26:25.233 the recv() call will unblock and data

00:26:27.066 will start flowing.


00:26:29.133 The point is the TCP receiver waits

00:26:31.700 for any missing data to be delivered.


00:26:34.366 If anything's missing, the triple duplicate ACK

00:26:37.900 happens, it eventually gets retransmitted, and the

00:26:40.933 receiver won't return anything to the application

00:26:43.533 until that retransmission has happened.


00:26:48.200 It’s called head of line blocking.

00:26:50.066 The data stops being delivered, until it

00:26:52.433 can be delivered in sequence to the

00:26:54.466 application. It’s all just buffered up in

00:26:56.633 the operating system, in the TCP code.


00:26:58.933 TCP always gives the data to the

00:27:01.100 application in a contiguous ordered sequence,

00:27:03.000 in the order it was sent.


00:27:04.933 And this is another reason why the

00:27:06.700 recv() calls don't always preserve the message boundaries.


00:27:09.600 Because it depends how much data was

00:27:11.700 queued up because of packet losses,

00:27:13.466 and so on, so that it can

00:27:15.266 always be delivered in order.


00:27:19.266 The head of line blocking increases the

00:27:21.900 total download time. We see on the

00:27:24.500 left, the case where one packet was

00:27:27.133 lost, and had to be re-transmitted.


00:27:29.500 And we see on the right,

00:27:31.033 the case where all the packets were

00:27:32.866 received on time. And we see an

00:27:34.666 increase in the download time because of

00:27:36.466 the packet loss.


00:27:40.733 It blocks the receiving, it delays things

00:27:43.700 a little bit, waiting for the retransmission.


00:27:46.533 And it increases the overall download time

00:27:50.500 a little bit.


00:27:52.366 It disrupts the behaviour of when the

00:27:54.966 packets are received, during the download quite

00:27:57.333 significantly. We see 1500, 1500, 1500,

00:28:02.400 big gap, seven thousand five hundred,

00:28:04.666 1500, 1500,


00:28:07.666 in the case where the packets were

00:28:09.300 lost. Or, in the case where they

00:28:10.733 were all received, the data is coming

00:28:12.533 in quite smoothly. It's regularly spaced.


00:28:14.966 So it affects the timing, it effects

00:28:17.133 when the data is delivered to the

00:28:18.733 application, and it has a smaller effect

00:28:20.300 on the overall download times.


00:28:28.633 And if you're building real time applications,

00:28:32.000 this is a significant problem. We see

00:28:34.833 the case on the right, if everything

00:28:36.866 is delivered on time, then the data

00:28:39.566 is released to the application very quickly

00:28:41.800 and very predictably.


00:28:43.566 And you don't need

00:28:47.333 much buffering delay at the receiver.

00:28:49.600 Things can be just delivered, things are

00:28:51.600 just delivered to the application, repeatedly on

00:28:53.600 a regular schedule.


00:28:55.033 But the minute something gets lost,

00:28:57.233 it has to wait for the retransmission.


00:28:59.333 In this case it waits for one

00:29:00.966 round trip time, because the ACK has

00:29:02.866 to get back, and then the data has to be retransmitted.


00:29:05.200 Plus, it has to wait for four

00:29:07.100 times the gap between packets, to allow

00:29:09.500 for the four duplicates, the triple duplicate

00:29:12.066 ACK and the original ACK, so you

00:29:14.366 get one round trip time plus four

00:29:16.500 times the packet spacing.


00:29:18.066 So if you're using TCP to send,

00:29:20.266 for example, speech data, where it's sending

00:29:22.400 packets regularly every 20 milliseconds, you need

00:29:25.133 to buffer 80 milliseconds plus the round

00:29:27.666 trip time, to allow for these re-transmissions,

00:29:30.766 if you're using it for a real time application.


00:29:33.766 Because, it waits for the retransmissions, and because

00:29:38.433 of the head of line blocking.


00:29:41.133 And when you're using applications like Netflix

00:29:44.933 or the iPlayer, when you press play on the video

00:29:47.433 there’s a little pause where it says “buffering”.


00:29:49.700 This is what it's doing. It’s buffering

00:29:51.766 up enough data that it can wait

00:29:54.933 for the retransmissions to happen,

00:29:57.666 buffering up enough data in the TCP

00:29:59.866 connection that it can keep playing out

00:30:01.633 the video frames, in order, while still

00:30:04.633 allowing time for a retransmission to happen.


00:30:07.100 So it's buffering up the data waiting,

00:30:09.533 making sure there's enough data buffered up,

00:30:12.766 because of this head of line blocking

00:30:14.366 issue in TCP.


00:30:20.300 So that concludes the discussion of TCP.


00:30:23.700 It gives you an ordered, reliable, byte stream.


00:30:28.233 As a service model it's easy to

00:30:30.433 understand. It’s like reading from a file;

00:30:33.133 you read from the connection and the

00:30:35.733 bytes arrive reliably and in the order they were sent.


00:30:39.733 The timing, though, is unpredictable. How much

00:30:43.566 you get from the connection each time you read from it,

00:30:46.433 and whether the data arrives regularly,

00:30:48.800 or whether it's arrives in big bursts

00:30:50.700 with large gaps between them, depends on

00:30:53.100 how much data is lost, and depends

00:30:55.233 on whether the TCP has to retransmit missing data.


00:30:59.066 And if you're just using this to

00:31:00.633 download files that doesn't matter. It means

00:31:03.700 that the progress bar is perhaps inaccurate,

00:31:05.866 but otherwise it doesn't make much difference.


00:31:08.466 But, if you're using it for real

00:31:10.066 time applications, like video streaming, like telephony,

00:31:14.066 this head of line blocking can quite

00:31:15.866 significantly affect the play out.


00:31:18.966 And a lot of that is the

00:31:20.500 reason why applications use, why real time

00:31:23.366 applications use, UDP. And for those that

00:31:26.233 don't use UDP,

00:31:27.700 applications like Netflix that use adaptive streaming

00:31:32.633 over HTTP, which we'll talk about in

00:31:35.166 lecture seven, that's why there’s this buffering

00:31:37.466 delay before they start playing.


00:31:40.966 And, of course, the lack of framing

00:31:42.700 complicates the application design, you have to

00:31:44.900 parse the data to make sure you've got all the data;

00:31:47.166 there's no message boundaries in there,

00:31:50.033 so you have to parse the data.

00:31:51.966 It doesn't tell you, the connection doesn't

00:31:53.700 tell you, when you've received all the data.


00:31:57.433 So that's it for TCP.


00:32:00.433 It delivers data reliably. It uses sequence

00:32:03.533 numbers and acknowledgments to indicate when the

00:32:06.133 data arrived.


00:32:07.633 It uses timeouts to indicate that a

00:32:09.733 connection has failed. And it uses this

00:32:12.433 idea of triple duplicate ACKs to indicate

00:32:14.866 that a packet has been lost,

00:32:16.300 and trigger a retransmission of any lost data.


00:32:19.833 What I’ll talk about in the next

00:32:21.333 part is QUIC and how it differs

00:32:23.266 from the way TCP handles reliability.

Part 4: Reliable Data Transfer with QUIC

The final part of the lecture discusses reliable data transfer using QUIC. It outlines the QUIC service model, and how it differs from that of TCP, and shows how QUIC achieves reliable data transfer. It discusses how QUIC provides multiple streams within a single connection, and consider how this affects head-of-line blocking and latency. Approaches to making best use of multiple streams are discussed..

Slides for part 4


00:00:00.100 In this final part I’d like to

00:00:02.533 talk about how reliable data transfer works

00:00:04.633 with QUIC, and how it's different to

00:00:07.100 reliable data transfer with TCP.


00:00:09.533 I’ll talk a little bit about the

00:00:11.733 QUIC service model, and how it handles

00:00:13.966 packet numbers and retransmission. I’ll talk about

00:00:16.166 the multi-streaming features of QUIC. And I’ll

00:00:19.133 talk about how it avoids head-of-line blocking.


00:00:23.333 The service model for TCP, as we

00:00:26.533 saw previously, is that it delivers a

00:00:29.100 single reliable, ordered, byte stream of data.


00:00:32.700 Applications write a stream of bytes in,

00:00:34.933 and that stream of bytes is delivered

00:00:37.033 to the receiver, eventually.


00:00:39.166 QUIC, by contrast, delivers several ordered reliable

00:00:42.200 byte streams within a single connection.


00:00:45.166 Applications can separate the data they're sending

00:00:47.933 into different streams, and each stream is

00:00:49.966 delivered reliably and in order.


00:00:52.066 QUIC doesn't preserve the ordering between the

00:00:54.666 streams within a connection, so if you

00:00:57.266 send in one stream, and then send

00:00:59.866 in a second stream, then the data

00:01:02.500 you sent second, in that second stream,

00:01:04.700 may arrive first, but it preserves the

00:01:06.666 ordering with a stream.


00:01:09.300 And you can treat each stream as

00:01:11.833 if it were running multiple TCP connections

00:01:15.366 in parallel, so it gives you the

00:01:17.100 same service model with several streams of

00:01:19.433 data, or you could perhaps treat each stream as a

00:01:22.900 sequence of messages to be sent,

00:01:25.600 with the streams indicating message boundaries.


00:01:30.366 QUIC delivers data in packets.


00:01:33.466 Each QUIC packet has a packet sequence

00:01:36.366 number, a packet number,

00:01:38.266 and the packet numbers

00:01:41.333 are split into two packet number spaces.


00:01:44.666 The packets sent during the initial QUIC

00:01:48.033 handshake start with packet sequence number zero,

00:01:50.900 and that packet sequence number increases by

00:01:53.033 one for each packet sent during the handshake.


00:01:56.066 Then, when the handshake’s complete, and it

00:01:58.800 switches to sending data, it resets the

00:02:01.666 packet sequence number to zero and starts again.


00:02:05.166 Within each of these packet number spaces,

00:02:07.666 the handshake space, and the data space,

00:02:10.833 the packet number sequence starts at zero,

00:02:13.400 and goes up by one for every packet sent.


00:02:16.566 That is, the sequence numbers in QUIC,

00:02:18.733 the packet numbers in QUIC, count the

00:02:20.966 number of packets of data being sent.


00:02:23.233 That's different to TCP. In TCP,

00:02:25.400 the sequence number in the header counts

00:02:27.966 the offset within the byte stream,

00:02:30.400 it counts how many bytes of data

00:02:32.166 have been sent. Whereas in QUIC,

00:02:34.300 the packet numbers count the number of packets.


00:02:38.033 Inside a QUIC packet is a sequence

00:02:40.833 of frames. Some of those frames may

00:02:43.100 be stream frames, and stream frames carry data.


00:02:46.600 Each stream frame has a stream ID,

00:02:50.066 so it knows which of the many sub-streams

00:02:52.200 it’s carrying data for, and it

00:02:53.766 also has the amount of data being carried,

00:02:57.033 and the offset of that data from the start of the stream.


00:02:59.866 So, essentially the stream contains sequence numbers

00:03:03.833 which play the same role as TCP

00:03:05.400 sequence numbers, in that they count bytes

00:03:07.366 of data being sent in that stream.


00:03:09.500 And the packets have sequence numbers that

00:03:11.766 count the number of packets being sent.


00:03:14.533 And we can see this in the

00:03:16.533 diagram on the right, where we see

00:03:18.366 the packet numbers going up, zero,

00:03:20.466 one, two, three, four. And the stream

00:03:22.433 numbers, packet zero carries data from the

00:03:24.566 first stream, bytes zero through 1000.


00:03:27.733 Packet one carries data from the first

00:03:29.733 stream, bytes 1001 to 2000. And packet

00:03:32.700 two carries bytes 2001 to 2500

00:03:36.833 from the first stream, and zero to

00:03:38.866 500 from the second stream, and so on.


00:03:41.566 And we see that we can send

00:03:44.333 data on multiple streams in a single packet.


00:03:50.400 QUIC doesn't preserve message boundaries within the

00:03:53.200 streams. In the same way that,

00:03:56.000 within a TCP stream, if you write

00:03:59.300 data to the stream and the amount you write is too big

00:04:02.300 to fit into a packet, it may

00:04:04.666 be arbitrarily split between packets.


00:04:06.900 Or if the data you send in a TCP Stream is too small,

00:04:09.566 and doesn't fill a whole packet,

00:04:11.500 it may be delayed waiting for more

00:04:13.433 data, to be able to fill up

00:04:15.033 the packet before it’s sent.


00:04:16.666 The same thing happens with QUIC.


00:04:18.633 If the amount of data you write to a stream is too big to

00:04:21.500 fit into a QUIC packet, then it

00:04:23.366 will be split across multiple packets.


00:04:26.166 Similarly, if the amount of data you

00:04:27.866 write to a stream is very small,

00:04:29.633 QUIC may buffer it up, delay it,

00:04:31.766 wait for more data, so it can

00:04:33.366 send it and fill a complete packet.


00:04:36.666 In addition, QUIC can take data from

00:04:39.466 more than one stream, and send it

00:04:41.300 in a single packet, if there’s space to do so.


00:04:44.566 And if there's more than one stream

00:04:46.833 with data that's available to send,

00:04:48.766 then the QUIC sender can make an

00:04:51.033 arbitrary decision, how it prioritises that data,

00:04:53.300 and how it delivers frames from each stream.


00:04:56.033 And usually it will split those,

00:04:59.200 the data from the streams, so each

00:05:01.700 packet has data from, half the data from, one stream,

00:05:05.200 and half from another stream. But it

00:05:07.400 may alternate them if it wants,

00:05:08.833 sending one packet with data from stream

00:05:10.966 1, one from stream 2, one from

00:05:12.600 stream 1, one from stream 2, and so on.


00:05:17.966 On the receiving side, the receiver sends,

00:05:20.566 the QUIC receiver sends acknowledgments for the

00:05:22.766 packets it receives.


00:05:24.166 So, unlike TCP which acknowledges the next

00:05:27.000 expected sequence number, a QUIC receiver just

00:05:29.566 sends an acknowledgement to say “I got this packet”.


00:05:33.500 So when packet zero arrives, it sends

00:05:35.866 an acknowledgement saying “I got packet zero”.

00:05:38.066 And when packet one arrives, it sends

00:05:39.900 an acknowledgement saying “I got packet one”, and so on.


00:05:43.566 The sender needs to remember what data

00:05:46.200 it puts in each packet, so it

00:05:47.800 knows when it gets an acknowledgement for packet two that,

00:05:51.033 in this case, it contained bytes 2001

00:05:54.800 to 2500 from stream one, and bytes

00:05:57.700 zero through 500 from stream two.


00:06:00.233 That information isn't in the acknowledgments.

00:06:02.766 What's in the acknowledgments it's just the

00:06:04.500 packet numbers, so the sender needs to

00:06:06.466 keep track of how it puts the

00:06:08.466 data from the streams into the packets.


00:06:12.366 The acknowledgments in QUIC are also a

00:06:15.133 bit more sophisticated than they are in

00:06:17.900 TCP, in that it doesn't just have

00:06:20.666 an acknowledgement number field in the header.


00:06:23.533 Rather, it sends the acknowledgments as frames

00:06:26.566 in the packets coming back.


00:06:28.833 And this gives a lot more flexibility, because

00:06:32.533 it can have a fairly sophisticated frame

00:06:35.700 format, and it can change the frame

00:06:37.400 format to include different, to support different

00:06:41.266 ways of sending a header, if it needs to.


00:06:45.233 In the initial version of QUIC,

00:06:47.133 what's in the frame format, in the

00:06:49.666 ACK frames coming back from the receiver to the sender,

00:06:53.266 is a field indicating the largest acknowledgement,

00:06:56.633 which is essentially the same as the

00:06:59.433 TCP acknowledgment – it tells you what's

00:07:02.866 the highest sequence number received.


00:07:06.166 There's an ACK delay field, that tells

00:07:08.933 you how long between receiving that packet

00:07:11.633 the receiver waited before sending the acknowledgement.


00:07:15.000 So this is the delay in the

00:07:16.866 receiver. And by measuring the time it

00:07:20.100 takes for the acknowledgment come back,

00:07:22.100 and removing this ACK delay field,

00:07:24.966 you can estimate the network round trip

00:07:27.366 time excluding the processing delays in the receiver.


00:07:31.466 There’s a list of ACK ranges.


00:07:35.300 And the ACK ranges are a way

00:07:37.100 of the receiver saying “I got a range of packets”.


00:07:40.366 So you can send an acknowledgement that

00:07:42.233 says, I got packets from five through seven

00:07:44.266 in a single go. And you can

00:07:46.800 split this up, with multiple ACK ranges.

00:07:48.833 So you could have an acknowledgement that

00:07:50.766 says “I got packet five; I got packets

00:07:53.466 seven through nine; and I got packets

00:07:55.433 11 through 15” and you can send

00:07:57.533 that all within a single acknowledgement block,

00:07:59.566 in an ACK frame, within the reverse path stream.


00:08:03.433 And this gives it more flexibility,

00:08:05.433 so it doesn't just have to acknowledge

00:08:07.833 the most recently received packet, which gives

00:08:11.200 the sender more information to make retransmissions.


00:08:14.466 This is a bit like the TCP

00:08:16.666 selective acknowledgement extension.


00:08:21.766 Like TCP, QUIC will retransmit lost data.


00:08:26.000 The difference is that TCP retransmits packets,

00:08:30.700 exactly as they would be originally sent,

00:08:33.400 so the retransmission looks just the same

00:08:35.466 as the original packet.


00:08:37.633 QUIC never retransmits packets.

00:08:40.500 Each packet in QUIC has a unique packet sequence number,

00:08:45.166 and each packet is only ever transmitted once.


00:08:48.366 What QUIC rather does, is it retransmits

00:08:51.000 the data which was in those packets

00:08:53.233 in a new packet.


00:08:55.533 So in this example, we see that

00:08:57.600 packet, on the slide, we see that

00:08:59.900 packet number two got lost, and it

00:09:01.633 contain the data bytes 2001 to 2500

00:09:06.033 from stream one, and bytes zero through 500 from stream two.


00:09:10.333 And, when it gets the acknowledgments indicating

00:09:12.933 that packet was lost, it resends that data.


00:09:16.233 And in this case it's sending in

00:09:18.733 packet six, it’s resending the first bytes

00:09:21.766 of data from stream, it’s sending the

00:09:25.333 bytes 2001 to 2500 from stream one,

00:09:28.533 and it will eventually, at some point

00:09:30.533 later, retransmit the data from stream two.


00:09:36.700 As we say, each packet has a

00:09:38.466 unique packet sequence number. Since we're not,

00:09:41.700 since each packet is acknowledged as it

00:09:43.666 arrives, and it's not acknowledging the highest,

00:09:46.666 not acknowledging the next sequence number expected

00:09:49.400 in the same way TCP does,

00:09:51.833 you can’t do the triple duplicate ACK

00:09:53.700 in the same way, because you don't

00:09:55.933 get duplicate ACKs. Each ACK acknowledges the

00:09:58.266 next new packet.


00:09:59.666 Rather QUIC declares a packet to be

00:10:02.333 lost when it's got ACKS for three

00:10:05.033 packets with higher packet numbers than the

00:10:07.500 one which it sent.


00:10:09.333 At that point, it can retransmit the

00:10:11.333 data that was in that packet.


00:10:13.366 And that’s QUIC’s equivalent to the triple

00:10:15.633 duplicate ACK; it's three following sequence numbers

00:10:18.600 rather than three duplicate sequence numbers.


00:10:20.766 And also, just like TCP, if there's

00:10:22.666 a timeout, and it stops getting ACKs,

00:10:24.533 then it declares the packets to be lost.


00:10:31.366 QUIC delivers multiple streams within a single

00:10:35.500 connection. And within each stream, the data

00:10:39.433 is delivered reliably, and in the order it was sent.


00:10:43.466 If a packet’s lost, then that clearly

00:10:46.100 causes data for the stream, streams,

00:10:48.533 where the data was included in that packet to be lost.


00:10:52.600 Whether a packet loss effects one,

00:10:55.600 or more, streams really depends on how

00:10:57.400 the sender chooses to put the data

00:10:59.266 from different streams into the packets.


00:11:02.300 It’s possible that a QUIC packet can

00:11:04.700 contain data from several streams. We saw

00:11:08.333 in the examples, how the packets contain

00:11:10.700 data from both stream one and stream two simultaneously.


00:11:13.566 In that case, if a packet is

00:11:15.833 lost, it will affect both of the

00:11:18.500 streams, all of the streams if there’s

00:11:20.333 data from more than two streams in the packet.


00:11:23.333 Equally, a QUIC sender can choose to

00:11:27.133 alternate, and send one packet with data

00:11:29.933 from stream one, and then another packet

00:11:32.066 with data from stream two, and only

00:11:34.266 ever put data from a single stream in each packet.


00:11:37.400 The specification puts no requirements on how

00:11:40.433 the sender does this, and different senders

00:11:42.766 can choose to do it differently depending

00:11:47.233 whether they're trying to make progress on

00:11:50.000 each stream simultaneously, or whether they want to


00:11:54.000 they want to alternate, and make sure

00:11:57.200 that packet loss only ever affects a single stream.


00:12:01.266 Depending on how they do this,

00:12:03.300 the streams can suffer from head of

00:12:05.366 line blocking independently.


00:12:07.500 If data is lost on a particular

00:12:09.800 stream, then that stream can't deliver later

00:12:14.866 data to the application, until that

00:12:18.033 lost data has been transmitted. But the

00:12:21.500 other streams, if they've got all the

00:12:23.533 data, can keep delivering to the application.


00:12:26.100 So streams suffer from head of line

00:12:28.300 blocking individually, but there's no head of

00:12:30.133 line blocking between streams.


00:12:32.600 This means that the data is delivered

00:12:35.466 reliably, and in order, on a stream,

00:12:37.866 but order’s not preserved between streams.


00:12:42.266 It’s quite possible that one stream can

00:12:45.033 be blocked, waiting for a retransmission of

00:12:47.000 some of the data in the packets,

00:12:48.800 while the other streams are continuing to

00:12:50.900 deliver data and haven't seen any loss

00:12:52.833 on that stream.


00:12:54.700 Each stream is sent and received independently.


00:12:57.866 And this means if you're careful with how you split data

00:13:00.800 across streams, and if the implementation is

00:13:04.300 careful with how it puts data from

00:13:05.900 streams into different packets, it can limit

00:13:08.233 the duration of the head of line

00:13:09.600 blocking, and make the streams independent in

00:13:11.766 terms of head of line blocking and data delivery.


00:13:18.566 QUIC delivers, as we've seen, several ordered,

00:13:21.900 reliable, byte streams of data in a single connection.


00:13:27.333 How you treat these different bytes streams,

00:13:30.000 is, I think, still a matter of interpretation.


00:13:33.600 It's possible to treat a QUIC connection

00:13:36.266 as though it was several parallel TCP connections.


00:13:40.333 So, rather than opening multiple TCP connections

00:13:42.700 to a server, you open one QUIC

00:13:45.100 connection, and you send and receive several

00:13:47.500 streams of data within that.


00:13:49.300 And then you treat each stream of

00:13:51.266 data as-if it were a TCP stream,

00:13:54.466 and you parse and process the data

00:13:56.800 as if it were a TCP stream.

00:13:58.500 And you possibly send multiple requests,

00:14:00.366 and get multiple responses, over each stream.


00:14:04.066 Or, you can treat the streams more as a framing device.


00:14:07.766 You can say that each stream,

00:14:10.300 you can choose to interpret each stream,

00:14:12.433 as sending a single object. And then,

00:14:15.466 when you send data from the stream,

00:14:17.000 on that stream, once you finish sending

00:14:18.833 that object, you close the stream and

00:14:20.933 move on to use the next one.


00:14:23.266 And, on the receiving side, you just

00:14:25.366 read all the data until you see

00:14:27.500 the end of stream marker, and then

00:14:30.200 you process it knowing you’ve got a complete object.


00:14:34.066 And I think that the best practices,

00:14:36.666 the way of thinking about a QUIC connection,

00:14:39.966 and the streams within a connection, is still evolving.


00:14:42.500 And it's not clear which of these

00:14:44.133 two approaches is the necessarily the right

00:14:46.433 way to do it. And I think

00:14:48.033 it probably depends on the application what

00:14:49.766 makes the most sense.


00:14:53.966 So, to conclude for this lecture.


00:14:57.366 We spoke a little bit about best

00:14:59.566 effort packet delivery on the Internet,

00:15:01.300 and why the IP layer delivers data

00:15:04.933 unreliably, and why it's appropriate to have

00:15:09.200 a best effort network.


00:15:11.200 Then we spoke a bit about the different transports.


00:15:14.266 The UDP transport that provides an unreliable,

00:15:17.500 but timely, service on which you can

00:15:20.433 build more sophisticated user space application protocols.


00:15:25.166 We spoke about TCP, that provides a

00:15:27.966 reliable ordered stream delivery service. And we

00:15:30.800 spoke about QUIC, that provides a reliable

00:15:33.600 ordered delivery service with multiple streams of

00:15:36.400 data. And it’s clear there’s different services,

00:15:38.800 different transport protocols, for different needs.


00:15:41.733 What I want to move on to

00:15:43.566 next time, is starting to talk about

00:15:45.300 congestion control and how all these different

00:15:49.166 transport protocols manage the rate at which they send data.


Lecture 5 discussed reliable data transfer over the Internet. It started with a discussion of best effort packet delivery, and an explanation of why it makes sense for the Internet to be designed to be an unreliable network. Then, it moved on to discuss UDP and how to make applications and new transport protocols that work on an unreliable network. There's a trade-off between timeliness and reliability that's important here, and the lecture gave some examples of this to illustrate why many real-time applications used UDP.

The bulk of the lecture discussed TCP. It spoke about how TCP sends acknowledgement for packets, how timeouts and triple-duplicate ACKs indicate loss, and why a triple-duplicate ACK is chosen as the loss signal. It also discussed head-of-line blocking, and how the in-order, single stream, reliable service model of TCP leads to head-of-line blocking and potential latency.

Finally, It discussed the differences between QUIC and TCP. QUIC acknowledges packets rather than bytes within a stream, uses ACK frames rather than an ACK header, and delivers multiple streams of data, allowing it to avoid head-of-line blocking in many cases.

The focus of the discussion will be on how TCP ensures reliability, to make sure the mechanism is understood, and on the differences between the TCP and QUIC service models and how QUIC can improve latency. We'll also discuss how UDP can form a substrate, to easily allow new transports, to suit different needs, to be built and deployed.