Networked Systems H (2021-2022)
Lecture 5: Reliability and Data Transfer
Lecture 5 discusses reliable and unreliable data transfer in the Internet. It explains the best-effort nature of packet delivery, the end-to-end argument, and the timeliness-vs-reliability trade-off inherent in the design of the Internet. And it discusses three transport protocols in use in the Internet, UDP, TCP, and QUIC, and how the provide different degrees of timeliness and reliability, and offer different services to applications.
Part 1: Packet Loss in the Internet
The first part of the lecture discusses packet loss in the Internet. It talks about the causes of packet loss, the end-to-end argument, and the timeliness-reliability trade-off.
00:00:00.633 In this lecture I want to move
00:00:02.333 on from the discussion of connection establishment,
00:00:04.800 and talk instead about reliability and effective
00:00:07.433 data transfer across the network.
00:00:10.000 There are four parts to this.
00:00:12.000 In this first part, I’ll talk briefly
00:00:14.066 about packet loss in the Internet,
00:00:15.866 and the trade-off between reliability and timeliness.
00:00:19.000 Then, I’ll move on to discuss unreliable
00:00:21.633 data using UDP, and talk about the
00:00:23.833 types of applications that benefit from this.
00:00:26.700 In part three, I’ll talk about reliable
00:00:29.000 data transfer with TCP. I’ll discuss the
00:00:31.966 TCP service model, how TCP ensures data
00:00:34.866 is delivered reliably, and some of the
00:00:37.066 limitations of TCP relating to head-of-line blocking.
00:00:41.000 Then, in the final part, I’ll conclude
00:00:43.100 by discussing how QUIC transfers data and
00:00:45.433 how this differs from TCP.
00:00:49.866 I want to start by discussing packet loss in the Internet.
00:00:52.833 What we mean when we say that the Internet
00:00:55.000 provides a best effort service.
00:00:57.066 The end-to-end argument.
00:00:58.733 And the timeliness vs reliability trade-off inherent
00:01:01.400 in the design of the Internet.
00:01:05.833 As we discussed back in lecture 1,
00:01:08.066 the Internet is a best effort packet delivery network.
00:01:11.933 This means that it’s unreliable by design.
00:01:15.000 IP packets can be lost, delayed,
00:01:17.533 reordered, or corrupted in transit. And this
00:01:20.900 is regarded as a feature, rather than a bug.
00:01:23.766 A network that can’t deliver
00:01:25.766 a packet is supposed to discard it.
00:01:29.000 There are many reasons why a packet
00:01:31.133 can get lost or discarded. It could
00:01:33.733 be due to a transmission error,
00:01:35.433 where electrical noise of wireless interference corrupts
00:01:37.833 the packet in transit, making the packet unreadable.
00:01:41.833 Or it could be because too much
00:01:43.300 traffic is arriving at some intermediate link
00:01:45.500 in the network, so an intermediate router
00:01:48.433 runs out of buffer space. If traffic
00:01:51.033 is arriving at a router from several
00:01:52.666 different incoming links, but all going to
00:01:55.200 the same destination, so it’s arriving faster
00:01:57.700 than it can be delivered, a queue
00:01:59.900 of packets will build up, waiting for transmission.
00:02:03.033 If this situation persists, the queue might
00:02:05.400 grow so much that a router runs
00:02:07.633 out of memory, and has no choice
00:02:09.400 but to discard the packets.
00:02:12.000 Or packets could be lost because of
00:02:13.966 a link failure. Or a router bug.
00:02:15.833 Or for other reasons.
00:02:18.000 How often this happens varies significantly.
00:02:22.000 The packet loss rate depends on the type of link.
00:02:26.100 Wireless links tend to be less reliable
00:02:28.433 than wired links, for example.
00:02:30.966 It’s reasonably likely that packet sent over a wireless
00:02:34.533 link, such as WiFi or 4G,
00:02:36.466 will be corrupted in transit due to
00:02:38.433 noise, interference, or cross traffic.
00:02:41.066 This is very unlikely on an Ethernet
00:02:43.633 or optical fibre link.
00:02:46.000 The packet loss rate also depends on
00:02:48.933 the overall quality and robustness of the infrastructure.
00:02:52.366 Countries with well developed
00:02:53.800 and well maintained infrastructure
00:02:55.400 tend to have reliable Internet links;
00:02:58.366 countries with less robust or lower
00:03:00.500 capacity infrastructure tend to see more problems.
00:03:04.833 And the loss rate depends on the protocol.
00:03:07.966 Some protocols intentionally try to push
00:03:10.000 links to capacity, causing temporary overload as
00:03:12.500 they try to find the limit,
00:03:14.633 as they try to find the maximum
00:03:16.666 transmission rate they can achieve.
00:03:19.000 TCP and QUIC do this in many cases,
00:03:22.033 depending on the congestion control algorithm
00:03:24.366 used, as we’ll see in lecture 6.
00:03:28.000 Other applications, such as telephony or video
00:03:30.200 conferencing, tend to have an upper bound
00:03:32.400 in the amount of data they can send.
00:03:35.066 Whatever the reason, though,
00:03:36.933 some packet loss is inevitable.
00:03:40.000 The transport layer needs to recognise this.
00:03:42.533 It must detect packet loss. And,
00:03:44.866 if the application needs reliability, it must
00:03:47.133 retransmit or otherwise repair any lost data.
00:03:53.000 That the Internet provides best effort packet
00:03:55.266 delivery is a result of the end-to-end argument.
00:03:58.966 The end-to-end argument considers whether it’s better
00:04:02.133 to place functionality inside the network or
00:04:04.300 at the end points.
00:04:06.833 For example, rather than provide best effort
00:04:09.633 delivery, we could try to make the
00:04:11.766 network deliver packets reliably. We could design
00:04:15.466 some way to detect packet loss on
00:04:17.133 a particular link, and request that the
00:04:19.166 lost packets be retransmitted locally,
00:04:21.466 somewhere within the network.
00:04:23.666 And, indeed, some network links do this.
00:04:27.000 In WiFi networks, for example, the base
00:04:29.666 station acknowledges packets it receives from the
00:04:31.800 clients, and requests any corrupted packets are
00:04:34.966 re-sent, to correct the error.
00:04:38.000 The problem is, that unless this mechanism
00:04:40.333 is 100% perfect all the time,
00:04:43.033 then end systems will still need to
00:04:44.966 check if the data has been received
00:04:46.600 correctly, and will still need some way
00:04:48.600 of retransmitting packets in the case of problems.
00:04:52.000 And if they’ve got that, why bother
00:04:54.133 with the in-network retransmission and repair?
00:04:58.000 Often times, if you add features into
00:05:00.233 the network routers, they end up duplicating
00:05:03.000 functionality that the network endpoints need to
00:05:05.500 provide anyway.
00:05:08.600 Maybe the performance benefit of adding features
00:05:11.833 to the network is so big that it’s worth while.
00:05:16.000 But often, the right thing to do
00:05:17.566 is to keep the network simple.
00:05:19.733 Omit anything that can be done by the endpoints.
00:05:22.633 And favour simplicity over the
00:05:24.533 absolute optimal performance.
00:05:28.300 The end-to-end argument is one of the
00:05:29.933 defining principles of the Internet. And I
00:05:32.900 think it’s still a good approach to
00:05:34.566 take, when possible. Keep the network simple, if you can
00:05:39.000 The paper linked from the slide talks
00:05:40.866 about this subject in a lot more detail.
00:05:46.000 Irrespective of whether retransmission of lost packets
00:05:49.033 happen between the endpoints or within the
00:05:51.766 network, it takes time.
00:05:54.566 This leads to a fundamental trade-off in
00:05:56.400 the design of the network.
00:05:59.000 If a connection is to be reliable,
00:06:01.266 it cannot guarantee timeliness.
00:06:04.400 It’s not possible to build absolutely perfect
00:06:07.066 network links, that never discard or corrupt
00:06:09.433 packets. There’s always some risk that the
00:06:12.566 data is lost and needs to be
00:06:14.833 retransmitted. And retransmitting a packet will always
00:06:18.133 take time, and so disrupt the timeliness of the delivery.
00:06:22.400 And similarly, if a connection is to
00:06:24.600 be timely, it cannot guarantee reliability.
00:06:27.800 There’s a trade-off to be made.
00:06:31.100 Protocols like UDP are timely but don’t
00:06:33.966 attempt to be reliable. They send packets,
00:06:36.800 and if they get lost, they get lost.
00:06:40.533 TCP and QUIC, on the other hand,
00:06:42.566 aim to be reliable. They send the
00:06:45.733 packets, and if they get lost,
00:06:47.366 they retransmit them.
00:06:49.666 And if the retransmission gets lost? They
00:06:52.200 try again, until the data eventually arrives.
00:06:55.533 As we’ll see in part 3 of
00:06:57.533 this lecture, this causes head of line
00:06:59.266 blocking, making the protocol less timely.
00:07:03.000 And other protocols, such as the Real-time
00:07:05.466 Transport Protocol, RTP, that I’ll talk about
00:07:09.166 in lecture 7, or the partially reliable
00:07:11.566 version of the Stream Control Transport Protocol,
00:07:13.800 SCTP, aim for a middle ground.
00:07:17.466 They try to correct some, but not
00:07:19.100 all, of the transmission errors. The try
00:07:22.000 to achieve a balance, a middle-ground,
00:07:24.233 between timeliness and reliability.
00:07:29.266 The different protocols exist because different applications
00:07:32.400 make different trade-offs.
00:07:34.233 Some applications prefer timeliness,
00:07:36.533 some prefer reliability.
00:07:39.366 For applications like web browsing, email,
00:07:41.833 or messaging, you want to receive all
00:07:44.533 the data. If I’m loading a web
00:07:47.333 site, I’d like it to load quickly,
00:07:49.300 sure. But I prefer for it to
00:07:51.800 load slowly, and be uncorrupted, rather than
00:07:54.433 load quickly with some parts missing.
00:07:57.466 For a video conferencing tool, like Zoom,
00:08:00.100 though, the trade-off is different. If I’m
00:08:03.200 having a conversation with someone, it’s more
00:08:05.166 important that the latency is low,
00:08:07.066 than the picture quality is perfect.
00:08:10.000 The same may be true for gaming.
00:08:13.000 And this has implications for the way
00:08:15.166 we design the network.
00:08:17.000 It means that the IP layer needs
00:08:18.933 to be unreliable. It needs to be
00:08:21.066 a best effort network.
00:08:23.400 If the IP layer is unreliable,
00:08:25.700 protocols like TCP and QUIC can sit
00:08:28.100 on top and retransmit packets to make
00:08:30.200 it reliable. A transport protocol can make
00:08:33.533 an unreliable network into a reliable one.
00:08:37.366 But if the IP layer is reliable,
00:08:39.666 if the IP layer retransmits packets itself,
00:08:42.700 then the network, the applications, the transport
00:08:45.366 protocols, can’t undo that.
00:08:51.466 So this concludes the discussion of packet
00:08:53.533 loss and why the Internet opts to
00:08:55.433 provide an unreliable, best-effort, service.
00:08:58.566 In the next part, I’ll talk about
00:09:00.233 UDP and how to make use of
00:09:02.100 an unreliable transport protocol.
Part 2: Unreliable Data Using UDP
The second part of the lecture discusses UDP. It outlines the UDP service model, and reviews how to send and receive data using UDP sockets, and the implications of unreliable delivery for applications using UDP. It discusses how UDP is suitable for real-time applications that prioritise low-latency over reliability. And is discusses the use of UDP as a substrate on which alternative transport protocols can be implemented, avoiding some of the challenges of protocol ossification.
00:00:00.300 In this part, I’ll move on to
00:00:02.166 discuss how to send unreliable data using UDP.
00:00:05.400 I’ll talk about the UDP service model,
00:00:07.900 how to send and receive packets,
00:00:09.833 and how to layer protocols on top of UDP.
00:00:14.000 UDP provides an unreliable,
00:00:16.300 connectionless, datagram service.
00:00:18.600 It adds only two features on top
00:00:20.566 of the IP layer: port numbers and a checksum.
00:00:24.000 The checksum is used to detect whether
00:00:26.300 the packet has been corrupted in transit.
00:00:28.666 If so, the packet will be discarded
00:00:30.933 by the UDP code in the operating
00:00:33.066 system of the receiver, and won’t be
00:00:34.833 delivered to the application.
00:00:37.000 The port numbers determine what application receives
00:00:39.866 the UDP datagrams when they arrive at
00:00:42.066 the destination. They’re set by the bind()
00:00:44.500 call, once the socket has been created.
00:00:47.566 The Internet Assigned Numbers Authority, the IANA,
00:00:50.566 maintains a list of well-known UDP port
00:00:52.633 numbers which you should use for particular
00:00:55.200 applications. This is linked from the bottom of the slide.
00:00:59.400 UDP is very minimal. It doesn’t provide
00:01:02.166 reliability, or ordering, or congestion control.
00:01:05.533 It just delivers packets to an application,
00:01:08.400 that’s bound to a particular port.
00:01:11.000 Mostly, UDP is used as a substrate.
00:01:13.800 It’s a base on which higher-layer protocols are built.
00:01:17.666 QUIC is an example of this,
00:01:19.433 as we discussed in the last lecture.
00:01:21.600 Others are the Real-time Transport Protocol,
00:01:24.100 and the DNS protocol,
00:01:25.466 that we’ll talk about later in the course.
00:01:29.666 UDP is connectionless. It’s got no notion
00:01:32.633 of clients or servers, or of establishing
00:01:34.933 a connection before it can be used.
00:01:38.000 To use UDP, you first create a socket.
00:01:41.433 Then you call bind(),
00:01:42.766 to choose the local port on which that socket
00:01:44.833 listens for incoming datagrams.
00:01:46.900 They you call recvfrom() if you want
00:01:49.400 to receive a datagram on that socket,
00:01:51.800 or sendto() if you want to send a datagram.
00:01:55.000 You don’t need to connect.
00:01:56.966 You don’t need to accept connections.
00:01:59.266 You just send and receive data.
00:02:01.866 And maybe that data is delivered.
00:02:05.000 When you’re finished, you close the socket.
00:02:08.433 Protocols that run on top of UDP,
00:02:10.700 such as QUIC, might add support for
00:02:13.133 connections, reliability, ordering,
00:02:15.333 congestion control, and so on,
00:02:17.166 but UDP itself supports none of this.
00:02:22.866 To send a UDP datagram, you use
00:02:25.000 the sendto() function.
00:02:27.033 This work similarly to the send() function
00:02:29.566 you used to send data over a
00:02:31.033 TCP connection in the labs, except that
00:02:33.933 it takes two additional parameters to indicate
00:02:36.566 the address to which the datagram should
00:02:39.066 be sent, and the size of that address.
00:02:41.800 When using TCP, you establish a connection
00:02:45.133 between a socket, bound to a local
00:02:46.933 address and port, and a server listening
00:02:49.600 on a particular port on some remote
00:02:51.400 IP address. And once the connection is
00:02:54.033 established, all the data goes over that
00:02:55.966 connection, to the same destination.
00:02:59.033 UDP is not like that.
00:03:01.533 Every time you call sendto(), you specify
00:03:04.333 the destination address. Every packet you send
00:03:07.633 from a UDP socket can go to
00:03:09.800 a different destination, if you want.
00:03:12.166 There’s no notion of connections.
00:03:15.400 Now, you can call connect() on a
00:03:17.600 UDP socket, if you like, but it doesn’t actually create
00:03:20.233 a connection. Rather, it just remembers the
00:03:23.666 address you give it, so you can
00:03:25.400 call send(), rather than sendto() in future,
00:03:28.400 to save having to specify the address each time.
00:03:33.000 To receive a UDP datagram, you call
00:03:36.133 the recvfrom() function, as shown on the slide.
00:03:39.800 This is like the recv() call you
00:03:42.000 use with TCP, but again it has
00:03:44.566 two additional parameters. These allow it to
00:03:47.433 record the address that the received datagram
00:03:49.700 came from, so you can use them
00:03:52.133 in the sendto() function to send a reply.
00:03:54.766 You can also call recv(), rather than
00:03:57.000 recvfrom(), like with TCP, and it works,
00:04:00.566 but it doesn’t give you the return
00:04:02.233 address, so it’s not very useful.
00:04:05.366 The important point with UDP is that
00:04:07.900 packets can be lost, delayed, or reordered
00:04:10.166 in transit, and UDP doesn’t attempt to
00:04:12.500 recover from this.
00:04:14.900 Just because you send a datagram,
00:04:17.000 doesn’t mean it will arrive. And if
00:04:19.500 datagrams do arrive, they won’t necessarily arrive
00:04:21.933 in the order sent.
00:04:27.900 Unlike TCP, where data written to a
00:04:30.700 connection in a single send() call might
00:04:32.933 end up being split across multiple read()
00:04:35.066 calls at the receiver, a single UDP
00:04:37.900 send generates exactly one datagram.
00:04:41.566 If it’s delivered at all, the data
00:04:43.800 sent by a single call to sendto()
00:04:45.866 will be delivered by a single call
00:04:47.566 to recvfrom(). UDP doesn’t split messages.
00:04:52.233 But UDP is otherwise unreliable.
00:04:54.966 Datagrams can be lost, delayed, reordered,
00:04:57.766 or duplicated in transit.
00:05:00.400 Data sent with sendto() might never arrive.
00:05:03.300 Or it might arrive more than once.
00:05:05.966 Or data sent in consecutive calls to
00:05:08.266 sendto() might arrive out of order,
00:05:10.633 with data sent later arriving first.
00:05:14.700 UDP doesn’t attempt to correct any of these things.
00:05:19.500 The protocol you build on top of
00:05:21.200 UDP might choose to do so.
00:05:23.900 For example, we saw that QUIC adds
00:05:26.000 packet sequence numbers and acknowledgement frames to
00:05:28.433 the data it sends within UDP packets.
00:05:31.366 This lets it put the data back
00:05:33.100 into the correct order, and retransmit any
00:05:34.966 missing packets.
00:05:36.800 But there’s no requirement that the protocol
00:05:38.566 running over UDP is reliable.
00:05:41.500 RTP, the Real-time Transport Protocol, that’s used
00:05:44.933 for video conferencing apps, puts sequence numbers
00:05:47.600 and timestamps inside the UDP datagrams it
00:05:50.666 sends, so it can know if any
00:05:53.033 data is missing, and it can conceal
00:05:55.000 loss or reconstruct the packet playout time,
00:05:58.466 but it generally doesn’t retransmit missing data.
00:06:03.000 UDP gives the application the choice of
00:06:05.566 building reliability, if it wants it.
00:06:07.866 But it doesn’t require that the applications
00:06:09.966 deliver data reliably.
00:06:14.000 Applications that use UDP need to organise
00:06:16.766 the data they send, so it’s useful
00:06:18.400 if some data is lost.
00:06:21.133 Different applications do this in different ways,
00:06:23.633 depending on their needs.
00:06:26.000 QUIC, for example, organises the data into
00:06:28.533 sub-streams within a connection,
00:06:30.300 and retransmits missing data.
00:06:33.266 Video conferencing applications
00:06:34.766 tend to do something different.
00:06:37.333 The way video compression works, is that
00:06:39.700 the codec sends occasional full frames of
00:06:41.700 video, known as I-frames, index frames,
00:06:44.700 every few seconds. And in between these
00:06:48.000 it sends only the differences from the
00:06:49.566 previous frame, known as P-frames, predicted frames.
00:06:53.866 In a video call, it’s common for
00:06:55.900 the background to stay the same,
00:06:57.433 while the person moves in the foreground,
00:06:59.766 so a lot of the frame is
00:07:01.166 the same each time. By only sending
00:07:03.866 the differences, video compression saves bandwidth.
00:07:07.733 But this affects how the application treats
00:07:09.533 the different datagrams.
00:07:12.000 If a UDP datagram containing a predicted
00:07:14.500 frame is lost, it’s not that important.
00:07:17.433 You’ll get a glitch in one frame of video.
00:07:20.500 But if a UDP datagram containing an
00:07:22.700 index frame, or part of an index
00:07:25.300 frame, is lost, then that matters a
00:07:27.533 lot more because the next few seconds
00:07:29.700 worth of video are predicted based on
00:07:31.600 that index frame. Losing an index frame
00:07:34.766 corrupts several seconds worth of video.
00:07:38.000 For this reason, many video conferencing apps
00:07:40.166 running over UDP try to determine if
00:07:42.866 missing packets contained an index frame or
00:07:45.233 not. And they try to retransmit index
00:07:48.000 frames, but not predicted frames.
00:07:51.000 The details of how they do this
00:07:52.833 aren’t really important, unless you’re building a
00:07:54.966 video conferencing app.
00:07:56.766 What’s important though, is that UDP gives
00:07:58.966 the application flexibility to be unreliable for
00:08:02.033 some of the datagrams it sends,
00:08:04.266 while trying to deliver other datagrams reliably.
00:08:08.000 You don’t have that flexibility with TCP.
00:08:12.333 UDP is harder to use, because it
00:08:14.766 provides very few services to help your
00:08:16.600 application, but it’s more flexible because you
00:08:19.333 can build exactly the services you need
00:08:21.200 on top of UDP.
00:08:25.100 Fundamentally, UDP doesn’t make any attempt to
00:08:28.233 provide sequencing, reliability,
00:08:30.500 timing recovery, or congestion control.
00:08:33.766 It just delivers datagrams on a best effort basis.
00:08:38.000 It lets you build any type of
00:08:39.933 transport protocol you want, running inside UDP packets.
00:08:44.000 Maybe that transport protocols has sequence numbers
00:08:46.766 and acknowledgements, and retransmits some or all
00:08:49.533 of the lost packets.
00:08:52.100 Maybe, instead, it uses error correcting codes,
00:08:55.066 to allow some of the packets to
00:08:57.233 be repaired without retransmission.
00:08:59.666 Maybe it includes timestamps, so the receiver
00:09:02.233 can carefully reconstruct the timing.
00:09:04.733 Maybe it contains other information.
00:09:07.166 The point is that UDP gives you
00:09:09.133 flexibility, but at the cost of having
00:09:11.500 to implement these features yourself. At the
00:09:14.066 cost of adding complexity.
00:09:18.000 There’s a lot to think about when
00:09:19.866 writing a UDP-based protocol or a UDP-
00:09:22.133 based application.
00:09:24.033 If you use a transport protocol,
00:09:26.200 like QUIC or like RTP, that runs
00:09:28.633 over UDP, then the designers of that
00:09:31.366 protocol have made these decisions, and will
00:09:33.833 have given you a library you can use.
00:09:36.500 If not, if you’re designing your own
00:09:38.566 protocol that runs over UDP, then the
00:09:41.733 IETF has written some guidelines, highlighting the
00:09:44.066 issues you need to think about,
00:09:45.600 in RFC 8085.
00:09:48.200 Please read this before you try and
00:09:49.866 write applications that use UDP. There are
00:09:52.666 a lot of non-obvious things that can catch you out.
00:09:57.500 So, that concludes our discussion of UDP.
00:10:00.233 In the next part, I’ll talk about
00:10:01.866 how TCP delivers data reliably.
Part 3: Reliable Data with TCP
The third part of the lecture discusses TCP. It outlines the TCP service model and shows how to send and receive data using a TCP connection. It explains how TCP ensures reliable and order data transfer, using sequence numbers and acknowledgements. And it explains TCP loss detection using timeouts and triple-duplicate acknowledgements. The issue of head-of-line blocking in TCP connections is discussed, as an example of the timeliness vs reliability trade-off.
00:00:00.233 In this part I want to talk
00:00:02.000 about how reliable data is delivered using
00:00:03.966 TCP connections. I’ll talk about the TCP
00:00:06.866 service model, how TCP uses sequence numbers
00:00:10.400 and acknowledgments, and how packet loss detection
00:00:13.500 and recovery works in TCP.
00:00:17.033 Thinking about the TCP service model,
00:00:19.266 as we've seen in previous lectures,
00:00:21.600 TCP provides a reliable, ordered, byte stream
00:00:24.966 delivery service that runs over IP.
00:00:27.966 The applications write data into the TCP
00:00:30.733 socket, that buffers it up in the
00:00:32.833 sending system, and then delivers it over
00:00:35.533 a sequence of data segments over the IP layer.
00:00:38.866 When these data packets, these data segments,
00:00:41.766 are received, they are accumulated in a
00:00:44.533 receive buffer at the receiver. If anything
00:00:47.200 is lost, or arrives out of order,
00:00:49.066 it's re-transmitted, and eventually the data is
00:00:51.433 delivered to the application.
00:00:53.333 The data delivered to the application is
00:00:55.666 always delivered reliably, and in the order sent.
00:00:58.733 If something is lost, if something needs
00:01:01.600 to be re-transmitted, this stalls the delivery
00:01:04.400 of the later data, to make sure
00:01:06.766 that everything is always delivered in order.
00:01:10.966 TCP delivers, as we say, an order,
00:01:13.766 reliable, byte stream.
00:01:16.366 After the connection has been established,
00:01:18.466 after the SYN, SYN-ACK, ACK handshake,
00:01:20.866 the client and the server can send
00:01:22.633 and receive data.
00:01:24.700 The data can flow in either direction
00:01:27.166 within that TCP connection.
00:01:29.400 It’s usual that the data follows a
00:01:31.900 request response pattern. You open the connection.
00:01:35.100 The client sends a request to the
00:01:36.733 server. The server replies with a response.
00:01:39.400 The client makes another request. The server
00:01:41.566 replies with another response, and so on.
00:01:44.566 But TCP doesn't make any requirements on
00:01:46.900 this. There’s no requirement that the data
00:01:49.266 flows in a request response pattern,
00:01:51.300 and the client and the server can
00:01:53.666 send data in any order they feel like.
00:01:56.366 TCP does ensure that the data is
00:01:58.400 delivered reliably, and in the order it
00:02:00.600 was sent, though.
00:02:02.766 TCP sends acknowledgments for each data segment
00:02:05.600 as it's received. And if any data
00:02:07.733 is lost, it retransmits that lost data.
00:02:10.733 And if segments are delayed and arrive
00:02:13.033 out of order, or if a segment
00:02:15.166 has to be re-transmitted and arrives out
00:02:17.166 of order, then TCP will reconstruct the
00:02:19.300 order before giving the segments back to the application.
00:02:25.533 In order to send data over a
00:02:27.300 TCP connection you use the send() function.
00:02:30.766 This transmits a block of data over
00:02:33.500 the TCP connection. The parameters are the
00:02:37.133 file descriptor representing the socket – the
00:02:39.500 TCP socket, the data, the length of
00:02:42.066 the data, and a flag. And the
00:02:44.366 flag field is usually zero.
00:02:47.466 The send() function blocks until all the
00:02:49.600 data can be written.
00:02:51.400 And it might take a significant amount
00:02:53.900 of time to do this, depending on
00:02:55.866 the available capacity of the network.
00:02:59.433 It also might not be able to
00:03:00.800 send all the data.
00:03:02.833 If the connection is congested, and can't
00:03:05.233 accept any more data, then the send()
00:03:06.966 function will return to indicate that it
00:03:10.566 wasn't able to successfully send all the
00:03:12.766 data that was requested.
00:03:15.300 The return value from the send() function
00:03:17.300 is the amount of data it actually
00:03:18.766 managed to send on the connection.
00:03:20.266 And that can be less than the
00:03:21.866 amount it was asked to send.
00:03:23.566 In which case, you need to figure
00:03:25.533 out what data was not sent,
00:03:27.100 by looking at the return value,
00:03:29.800 and the amount you asked for,
00:03:31.233 and re-send just the missing part in another call.
00:03:34.833 Similarly, if an error occurs, if the
00:03:37.333 connection has failed for some reason,
00:03:39.500 the send() function will return -1,
00:03:41.166 and it will set the global variable
00:03:42.566 errno to indicate that.
00:03:46.800 On the receiving side you call the
00:03:49.100 recv() function to receive data on a
00:03:50.966 TCP connection.
00:03:53.200 The recv() function blocks until data is
00:03:55.833 available, or until the connection is closed.
00:03:59.833 It’s passed a size,
00:04:01.333 It’s passed a buffer, buf, and the
00:04:04.666 size of the buffer, BUFLEN, and it
00:04:07.066 reads up to BUFLEN bytes of data.
00:04:09.600 And what it returns is the number
00:04:11.700 of bytes of data that were read.
00:04:14.066 Or, if the connection was closed,
00:04:16.100 it returns zero. Or, if an error
00:04:18.900 occurs, it returns -1, and again sets
00:04:21.700 global variable errno to indicate what happened.
00:04:26.933 When a recv() call finishes, you have
00:04:29.500 to check these three possibilities. You have
00:04:31.900 to check if the return value is
00:04:33.300 zero, to indicate that the connection is
00:04:35.466 closed and you've successfully received all the
00:04:38.133 data in that connection. At which point,
00:04:40.366 you should also close the connection.
00:04:42.900 You have to check if the return
00:04:44.300 value is minus one, in which case
00:04:46.166 an error has occurred, and that connection
00:04:48.566 has failed, and you need to somehow
00:04:50.566 handle that error.
00:04:53.266 And you need to check if it's some other value,
00:04:55.900 to indicate that you've received some data,
00:04:57.900 and then you need to process that data.
00:05:01.133 What's important is to remember that the
00:05:04.200 recv() call just gives you that data
00:05:07.033 in the buffer. If the return value
00:05:09.700 from receive is 157, this indicates that
00:05:12.566 the buffer has 157 bytes of data in it.
00:05:16.366 What the recv() called doesn't ever do,
00:05:18.833 is add a terminating null to that buffer.
00:05:22.366 Now, if you're careful that doesn't matter,
00:05:26.133 because you know how much data is
00:05:28.300 in the buffer, and you can explicitly
00:05:30.400 process the data up to that length.
00:05:33.866 But, a common problem with TCP-based applications,
00:05:38.500 is that they treat the data as if it was a string.
00:05:43.366 They pass it to the printf() call
00:05:45.200 using %s as if it were a
00:05:47.200 string, or they pass it to function
00:05:49.666 like strstr() to search for a string
00:05:51.533 within it, or strcpy(), or something like that.
00:05:56.133 And the problem is the string functions
00:05:58.033 assume there’s a terminating null, and the
00:06:00.333 recv() call doesn't provide one.
00:06:03.766 If you're going to pass the data
00:06:05.866 that's returned from a recv() call to
00:06:08.600 one of the C string functions,
00:06:10.666 you need to explicitly add that null yourself.
00:06:13.866 You need to look at the buffer,
00:06:17.333 add the null at the end,
00:06:19.100 after the last byte which was successfully
00:06:21.533 received. If you don't do, this the
00:06:25.033 string functions will just run off the end of the buffer
00:06:27.300 and you'll get a buffer overflow attack.
00:06:29.733 And this is a significant security risk.
00:06:31.733 It’s one of the biggest security problems
00:06:33.666 with network code using C. It’s misusing
00:06:36.900 these buffers, accidentally using one of the
00:06:39.166 string functions, and it just reads off
00:06:41.966 the end of buffer, and who knows what it processes.
00:06:48.566 When you send data using TCP,
00:06:50.700 the send() call enqueues the data for transmission.
00:06:55.200 The operating system, the TCP code in
00:06:57.900 the operating system, splits the data you've
00:07:00.366 written using the various send() calls into
00:07:02.266 what’s known as segments, and puts each
00:07:04.333 of these into a TCP packet.
00:07:07.433 The TCP packets are sent in IP
00:07:09.533 packets. And TCP runs a congestion control
00:07:12.933 algorithm to decide when it can send those packets.
00:07:17.166 Each TCP segment, each segment is in
00:07:20.200 a TCP packet. The TCP packets have
00:07:22.933 a header, which has a sequence number.
00:07:25.933 When the connection setup handshake happens,
00:07:28.700 in the SYN and the SYN-ACK packets,
00:07:31.366 the connection agrees the initial sequence numbers;
00:07:34.300 agrees the starting value for the sequence numbers.
00:07:37.666 If you’re the client, for example;
00:07:39.600 the client picks a sequence number at
00:07:43.200 random, and sends this in its SYN packet.
00:07:46.433 And then when it starts sending data,
00:07:48.600 the next data packet has a sequence
00:07:50.700 number that is one higher than that
00:07:52.466 in the SYN packet.
00:07:55.033 And, as it continues to send data,
00:07:57.700 the sequence numbers increase by the number
00:08:00.300 of data bytes sent.
00:08:02.400 So, for example, if the initial sequence
00:08:04.533 number was 1001, just picked randomly,
00:08:07.133 and it sends 30 bytes of data
00:08:09.466 in the packet, then the next sequence
00:08:12.733 number will be 1031.
00:08:16.533 The sequence number spaces are separate for
00:08:18.800 each in each direction. The sequence numbers
00:08:21.066 the client uses increase based on the
00:08:23.333 initial sequence number the client sent the SYN packet.
00:08:26.366 The sequence numbers the server use,
00:08:28.433 start based on the initial sequence number
00:08:30.600 the server sent in the SYN-ACK packet,
00:08:32.700 and increase based on the amount of
00:08:34.766 data the server is sending. The two
00:08:36.366 number spaces are unrelated.
00:08:41.600 What's important is that calls to send()
00:08:44.300 don't map directly onto TCP segments.
00:08:49.066 If the data which is given to
00:08:51.300 a send() call is too big to
00:08:52.900 fit into one TCP segment, then the
00:08:56.100 TCP code will split it across several
00:08:58.366 segments; it'll split it across several packets.
00:09:02.600 Similarly, if the data you send,
00:09:04.900 that data you give the send() call
00:09:06.666 is quite small, TCP might not send
00:09:09.066 it immediately.
00:09:11.066 It might buffer it up, combine it
00:09:13.166 with data sent as part of a
00:09:15.600 later send() call. And combine it,
00:09:18.266 and send it in a single larger
00:09:19.700 segment, a single larger TCP packet.
00:09:23.566 This is an idea known as Nagle’s
00:09:27.100 algorithm. It's there to improve efficiency by
00:09:30.200 only sending big packets, because there's a
00:09:32.633 certain amount of overhead for each packet.
00:09:35.733 Each packet that’s sent by TCP has
00:09:38.033 a TCP header. It’s got an IP
00:09:40.333 header. It's got the Ethernet or the
00:09:42.666 WiFi headers depending on the link layer.
00:09:45.033 And that adds a certain amount of
00:09:47.033 overhead. It’s about, I think, 40 bytes
00:09:48.966 per packet. So if you're only sending
00:09:51.066 a small amount of data, that's a
00:09:52.900 lot of overhead, a lot of wasted data.
00:09:55.533 So TCP, with the Nagle algorithm,
00:09:57.466 tries to combine these packets into larger
00:09:59.500 packets when it can. But, of course,
00:10:01.633 this adds some delay. It’s got to
00:10:03.800 wait for you to send more data;
00:10:05.400 wait to see if it can form a bigger packet.
00:10:09.133 If you really need low latency,
00:10:11.133 you can disable the Nagle algorithm.
00:10:13.100 There’s a socket option called TCP_NODELAY,
00:10:16.000 and we see the code on the
00:10:17.800 slide to show how to use that.
00:10:19.833 So you create the socket, you
00:10:23.300 establish the connection, and then you call
00:10:26.400 the TCP_NODELAY option and that turns this
00:10:28.700 off. And this means that every time
00:10:31.000 you send() on the socket, it immediately
00:10:32.900 gets sent as quickly as possible.
00:10:37.800 One implication of this behaviour, though,
00:10:40.233 where TCP can either split data written
00:10:43.800 in a single send() across multiple segments,
00:10:47.233 or where it can combine several send()
00:10:49.566 calls into a single segment, is that
00:10:52.400 the data returned by the recv() calls
00:10:54.900 doesn't always correspond to a single send().
00:10:58.400 When you call recv(), you might get
00:11:01.166 just part of a message. And you
00:11:03.266 need to call recv() again to get the rest of the message.
00:11:06.700 Or you may get several messages in one recv() call.
00:11:12.600 When you're using TCP, the recv() calls
00:11:14.933 return the data reliably, and they return
00:11:17.266 the data in the order that it was sent.
00:11:20.366 But what they don't do is frame
00:11:22.300 the data. What they don't do is
00:11:23.833 preserve the message boundaries.
00:11:27.233 For example, if we're using HTTP,
00:11:30.433 which we see, we see an example
00:11:32.666 of an HTTP message that might be sent,
00:11:34.866 an HTTP response that might be sent,
00:11:37.866 by a web server back to a browser.
00:11:41.566 If we're using HTTP, what we would
00:11:44.500 like is that the whole response is
00:11:46.800 received in one go. So if we're
00:11:50.066 implementing a web browser we just call
00:11:51.766 recv() on the TCP connection
00:11:53.633 and we get all of the headers,
00:11:55.866 and all of the body, in just
00:11:57.500 in just one call to recv() and
00:11:59.433 we can then parse it, and process it, and deal with it.
00:12:02.833 TCP doesn't guarantee this, though.
00:12:06.133 It can split the messages arbitrarily,
00:12:08.566 depending on how much data was in
00:12:11.166 the packets, what size packets the underlying
00:12:14.233 link layers can send, and on the
00:12:17.066 available capacity of the network depending on
00:12:19.566 the congestion control.
00:12:21.200 And it can split the packets at arbitrary points.
00:12:24.466 For example, if we look at the
00:12:26.800 slide, we see that the headers,
00:12:29.166 some of them are labeled in red,
00:12:30.533 some are in blue, some of the body is in blue,
00:12:33.233 some the rest of the body is
00:12:34.500 in green. And it could be that
00:12:36.400 the TCP connection splits the data up,
00:12:38.600 so that the first recv() call just
00:12:40.633 gets the part of the headers highlighted
00:12:42.466 in red,
00:12:43.500 ending halfway through the “ETag:” line.
00:12:46.466 And then you have to call recv()
00:12:48.333 again. And then you get the part
00:12:50.233 of the message highlighted in blue,
00:12:51.833 which contains the rest of the headers
00:12:53.600 and the first part of the body.
00:12:55.533 Then you have to call recv() again,
00:12:57.433 to get the rest of the message
00:12:59.033 that's highlighted in green on the slide.
00:13:01.300 And this makes it much harder to
00:13:03.166 parse; much harder for the programmer.
00:13:05.833 Because you have to look at the
00:13:07.866 data you've got, parse it, check to
00:13:09.900 see if you've got the whole message,
00:13:11.500 check if you've received the complete headers,
00:13:13.466 check to see if you've received the
00:13:15.033 complete body. And you have to handle
00:13:17.033 the fact that you might have partial messages.
00:13:20.633 And it's something which makes it a
00:13:22.200 little bit hard to debug, because if
00:13:24.466 you only send small messages,
00:13:25.833 if you're sending packets which are only
00:13:28.200 like 1000 bytes, or so, they’re probably
00:13:31.800 small enough to fit in a single
00:13:33.600 packet, and they always get delivered in one go.
00:13:36.333 It’s only when you start sending
00:13:38.400 larger packets, or sending lots of data
00:13:41.333 over connection so things get split up
00:13:43.800 due to congestion control, that you start
00:13:45.600 to see this behaviour where the messages
00:13:47.533 get split at arbitrary points.
00:13:54.133 So as we've seen, the TCP segments
00:13:58.200 contain sequence numbers, and the sequence numbers
00:14:00.333 count up with the number of bytes being sent.
00:14:03.600 Each TCP segment also has an acknowledgement number.
00:14:09.366 When a TCP segment is sent,
00:14:12.266 it acknowledges any segments that have previously
00:14:16.666 been received.
00:14:18.866 So if,
00:14:20.266 if a TCP endpoint has received some
00:14:24.733 data on a TCP connection,
00:14:27.266 when it sends its next packet,
00:14:29.400 the ACK bit will be set in
00:14:31.866 the TCP header, to indicate that the
00:14:33.866 acknowledgement number is valid, and the acknowledgement
00:14:36.500 number will have a value indicating the
00:14:39.100 next sequence number it is expecting.
00:14:42.166 That is, the next contiguous byte it's
00:14:44.533 expecting on the connection.
00:14:47.866 So, in the example, we have a
00:14:52.500 slightly unrealistic example in that the connection
00:14:54.733 is sending one byte at a time,
00:14:56.500 and the first packet is sent with sequence number five.
00:14:59.566 And then the next packet is sent
00:15:01.700 with sequence number six, and then seven,
00:15:03.833 and eight, and nine, and ten,
00:15:05.666 and so on. And this is what
00:15:07.800 might happen with an ssh connection,
00:15:09.600 where each key you type generates a
00:15:11.166 TCP segment, with just the one key press in it.
00:15:14.866 And when those packets are received at
00:15:17.866 host B, it sends a TCP segment
00:15:20.700 with the acknowledgement bit set, acknowledging what's
00:15:24.766 expected next.
00:15:26.233 So when it receives the TCP packet
00:15:29.800 with sequence number five, and one byte
00:15:31.833 of data in it, it sends an
00:15:33.900 acknowledgement saying it got it, and it's
00:15:36.133 expecting the packet with sequence number six next.
00:15:40.333 When it receives the packet with sequence
00:15:42.366 number six, and one byte of data
00:15:44.433 in it, it sends an acknowledgement saying
00:15:46.333 it's expecting seven. And so on.
00:15:51.033 TCP only ever acknowledges the next contiguous
00:15:55.766 sequence number expected.
00:15:58.233 And if a packet is lost,
00:16:00.500 subsequent packets generate duplicate acknowledgments.
00:16:05.300 So in this case, packet five was
00:16:08.733 sent. It got to the receiver,
00:16:10.766 and that sent the acknowledgement saying it
00:16:12.633 expected six. Six was sent, arrived at
00:16:15.100 the receiver, so the acknowledgement says it
00:16:17.133 expects seven.
00:16:18.800 Seven was sent, arrives at the receiver,
00:16:21.600 sends the acknowledgement saying it expects
00:16:23.333 eight. Eight was sent, and gets lost.
00:16:29.466 Nine was sent, and arrives at the receiver.
00:16:33.033 At this point, the receiver’s received the
00:16:36.066 packets with sequence numbers five, six,
00:16:38.000 and seven; eight is missing; and nine
00:16:40.366 has arrived. So the next contiguous sequence
00:16:43.400 number it's expecting is still eight.
00:16:46.233 So it sends an acknowledgement saying “I’m
00:16:48.633 expecting sequence number eight next”.
00:16:52.066 The packet sent, the next packet sent,
00:16:55.066 has sequence number 10. This arrives,
00:16:57.633 the acknowledgement goes back saying “I still
00:16:59.800 haven't got eight, I’m still expecting eight”,
00:17:02.400 and this carries on. TCP keeps sending
00:17:04.800 duplicate acknowledgments while there’s a gap in
00:17:06.900 the sequence number space.
00:17:11.533 In addition, we don't show it here,
00:17:14.000 but TCP can also send delayed acknowledgments,
00:17:16.333 where it only acknowledges every second packet.
00:17:18.466 In this case the acknowledgments might go,
00:17:20.666 six, eight. The packet with sequence number
00:17:23.966 five is sent, and it acknowledges six.
00:17:26.566 Packet with number six is sent,
00:17:28.366 and arrives, and packet number seven is
00:17:30.366 sent, and then it sends the acknowledgement
00:17:32.166 saying it's expecting eight. So it doesn't
00:17:34.366 have to send every acknowledgement, it can
00:17:36.300 sent every other acknowledgement to reduce the overheads.
00:17:43.300 TCP uses the acknowledgments to detect packet
00:17:47.800 loss; to detect when segments are lost.
00:17:51.233 There’s two ways in which it does this.
00:17:54.466 The first is that if it sends
00:17:57.433 data, but for some reason the acknowledgments stop entirely.
00:18:01.500 This is a sign that either the receiver has failed,
00:18:04.966 And, you know, the packets are being
00:18:06.866 delivered to the receiver, but the application
00:18:08.733 has crashed, and there's nothing there to
00:18:11.000 receive the data, to reply.
00:18:13.700 Or it's an indication that the network
00:18:15.800 connection has failed, and the packets are
00:18:17.900 just not reaching the receiver.
00:18:19.500 So if TCP is sending data,
00:18:21.633 and it's not getting any acknowledgments back,
00:18:24.066 after a while it times out and
00:18:26.933 uses this as an indication that the
00:18:28.866 connection has failed.
00:18:32.300 Alternatively, it can be sending data,
00:18:35.666 and if some data is lost,
00:18:39.700 but the later segments arrive, then TCP
00:18:42.000 will start sending the duplicate acknowledgments.
00:18:45.166 Again, back to the example, we see
00:18:47.900 that packet eight is lost, packet nine
00:18:50.266 arrives, and the sequence number, the acknowledgement
00:18:53.366 number, comes back says “I’m expecting sequence
00:18:55.266 number eight”.
00:18:56.966 And packet ten is sent and it
00:18:59.133 arrives, and it still says “I’m still
00:19:00.666 expecting packet with sequence number eight”,
00:19:03.200 and this just carries on.
00:19:05.700 And, eventually, TCP gets what's known as
00:19:08.333 a triple duplicate acknowledgement. It’s got the
00:19:11.833 original acknowledgement saying it's expecting packet eight,
00:19:14.933 and then three duplicates following that,
00:19:17.266 so four packets in total, all saying
00:19:19.433 “I’m still expecting packet eight”.
00:19:22.533 And what this indicates, is that data
00:19:24.900 is still arriving, but something's got lost.
00:19:28.266 It only generates acknowledgements when a new
00:19:30.800 packet arrives, so if we keep seeing
00:19:33.000 acknowledgments indicating the same thing, this indicates
00:19:35.933 that new packets arriving, because that's what
00:19:38.200 triggers the acknowledgement to be sent,
00:19:40.866 but there's still a packet missing,
00:19:43.400 and it's telling us which one it's expecting.
00:19:46.866 At that point TCP assumes that the
00:19:49.400 packet has got lost, and retransmits that
00:19:51.566 segment. It retransmits the packet with sequence
00:19:54.833 number eight.
00:19:59.233 Why does it wait for a triple duplicate acknowledgement?
00:20:03.466 Why does it not just retransmit it
00:20:06.033 immediately. when it sees a duplicate?
00:20:08.566 Well, the example we see here illustrates that.
00:20:13.466 In this case, a packet with sequence
00:20:15.733 number five is sent, containing one byte
00:20:17.866 of data, and it arrives, and the
00:20:19.866 receiver acknowledges it, saying it's expecting six.
00:20:23.400 And six is sent, and it arrives,
00:20:26.266 and the receiver acknowledges it, indicating it’s
00:20:28.333 expecting seven.
00:20:30.066 And packet seven is sent, and it's
00:20:32.866 delayed. And packet eight is sent,
00:20:35.566 and eventually arrives at the receiver.
00:20:38.233 Now the receiver hasn't received packet seven
00:20:41.100 yet, so it sends an acknowledgement which
00:20:43.500 says “I’m still expecting seven”. So that's
00:20:46.066 a duplicate acknowledgement.
00:20:48.200 At that point packet seven, which was
00:20:50.466 delayed, finally does arrive.
00:20:53.866 Now packet seven has arrived, packet eight
00:20:56.466 had arrived previously, so what is now
00:20:58.600 expecting is nine, so it sends an
00:21:00.833 acknowledgement for nine.
00:21:02.866 And we see that the acknowledgments go
00:21:05.266 six, seven, seven, nine, because that packet
00:21:08.033 seven was delayed a little bit.
00:21:11.900 And if TCP reacts to a single
00:21:14.300 duplicate acknowledgement as an indication that the
00:21:17.166 packet was lost, then you run the
00:21:20.233 risk that you're resending a packet on
00:21:23.033 the assumption when it was lost,
00:21:24.933 when it was just merely delayed a little bit.
00:21:28.466 And there's a trade off you can make here.
00:21:31.733 Do you treat, a single duplicate as
00:21:35.600 an indication of loss? Do you treat
00:21:38.066 two duplicates as an indication of loss?
00:21:40.366 Three? Four? Five? At what point do
00:21:42.900 you say “this as an indication of
00:21:44.300 loss”, rather than just “this is a
00:21:46.566 slightly delayed packet, and it might recover
00:21:49.133 itself in a minute”?
00:21:53.600 The reason that a triple duplicate is
00:21:55.933 used, is because someone did some measurements,
00:21:58.833 and decided that packets being delayed
00:22:01.800 enough to cause one or two duplicates,
00:22:04.500 because they arrived just a little bit
00:22:06.933 out of order, was relatively common.
00:22:09.133 But packets being delayed enough that they
00:22:11.800 cause three or more duplicates is rare.
00:22:14.500 So it's balancing-off speed of loss detection
00:22:17.766 vs. the likelihood that a merely delayed
00:22:20.466 packet is treated as if it were
00:22:22.600 lost, and retransmitted unnecessarily.
00:22:26.300 And, based on the statistics, the belief
00:22:29.500 by the designers of TCP was that
00:22:32.666 waiting for three duplicates was the right threshold.
00:22:36.233 And you could make a TCP version
00:22:38.900 that reduced this to two, or even
00:22:41.300 one duplicate, and it would respond to
00:22:43.666 loss faster, but would have the risk
00:22:45.666 that it's more likely to unnecessarily retransmit
00:22:47.966 something that's just delayed.
00:22:50.500 Or you could make it four,
00:22:52.433 five, six, even more duplicate acknowledgments,
00:22:55.700 which will be less likely to unnecessarily
00:22:57.900 retransmit data. But it’d be slower,
00:23:00.966 because it would be slower in responding
00:23:03.300 to loss, and slower in retransmitting actually lost packets.
00:23:12.766 The other behaviour of TCP. which is
00:23:16.033 worth noting, is head-of-line blocking.
00:23:19.566 Now, in this case we're sending something
00:23:21.866 more realistic. We're sending full size packets,
00:23:24.166 with 1500 bytes of data in each packet.
00:23:26.900 And 1500 is the maximum packet size
00:23:29.333 that you can send in an Ethernet
00:23:31.733 packet, or in a WiFi packet,
00:23:33.833 so this is a typical size that actually gets sent.
00:23:37.366 In this case, the first packet is
00:23:40.366 sent with sequence numbers in the range
00:23:42.966 zero through to 1499.
00:23:46.266 And this arrives at the receiver,
00:23:48.266 and the receiver sends an acknowledgement saying
00:23:50.500 it got it, and the next packet
00:23:52.300 it’s expecting has sequence number 1500.
00:23:55.666 So it sends an acknowledgement for 1500.
00:23:58.666 And if there’s a recv() call outstanding
00:24:01.033 on that socket, that recv() call will
00:24:03.400 return at that point, and return 1500
00:24:05.100 bytes of data. It returns the data
00:24:07.733 as it was received.
00:24:09.600 The next packet arrives at the receiver,
00:24:11.866 containing sequence numbers 1500 through to 2999,
00:24:16.800 and again the recv() call, if there
00:24:19.266 is one, will return, and return that
00:24:21.233 next 1500 bytes.
00:24:23.200 Similarly, when the packet containing the next
00:24:25.833 1500 comes in, the receiver will send
00:24:28.433 the ACK saying “I’m expecting 4500”,
00:24:30.533 and the recv() call will return.
00:24:33.733 The packet containing sequence numbers 4500 though
00:24:37.500 to 5999 is lost.
00:24:40.633 The packet containing 6000 through to 7499 arrives.
00:24:47.466 The acknowledgement goes back indicating that it’s
00:24:50.166 still expecting sequence number 4500, because that
00:24:53.166 packet got lost. And at that point,
00:24:56.233 some data has arrived, some new data
00:24:57.966 has arrived at the receiver.
00:24:59.600 But there's a gap. The packets,
00:25:02.566 the packet, containing data with sequence numbers
00:25:05.266 4500 through to 5999 is still missing.
00:25:08.833 So if the receiver application has called
00:25:12.933 recv() on that socket, it won't return.
00:25:16.366 The data has arrived, it's buffered up
00:25:18.833 in the TCP layer in the operating
00:25:20.800 system, but TCP won't give it back
00:25:22.400 to the application.
00:25:24.933 And the packets can keep being sent,
00:25:27.200 and the receiver keeps sending the duplicate
00:25:29.700 acknowledgments, and eventually it’s sent the triple
00:25:32.266 duplicate acknowledgement, and the TCP sender notices
00:25:35.700 and retransmits the packet with sequence numbers
00:25:38.366 4500 through to 5999.
00:25:41.833 And eventually those arrive at the receiver.
00:25:45.900 At that point, the receiver has a
00:25:48.966 contiguous block of data available, with no
00:25:51.133 gaps in it, and it returns all
00:25:54.100 of the data from sequence number 4500
00:25:57.000 up to sequence number 12,000,
00:26:00.533 up to the application in one go.
00:26:03.333 And if the application has given a
00:26:05.600 big enough buffer, at that point the
00:26:07.366 recv() call will returned 7500 bytes of
00:26:09.766 data. It’ll return all of that received
00:26:12.666 data in one big burst.
00:26:18.033 And then, as the data, you know,
00:26:20.700 gets retransmitted, as the data arrives,
00:26:23.066 it will just keep, you know,
00:26:25.233 the recv() call will unblock and data
00:26:27.066 will start flowing.
00:26:29.133 The point is the TCP receiver waits
00:26:31.700 for any missing data to be delivered.
00:26:34.366 If anything's missing, the triple duplicate ACK
00:26:37.900 happens, it eventually gets retransmitted, and the
00:26:40.933 receiver won't return anything to the application
00:26:43.533 until that retransmission has happened.
00:26:48.200 It’s called head of line blocking.
00:26:50.066 The data stops being delivered, until it
00:26:52.433 can be delivered in sequence to the
00:26:54.466 application. It’s all just buffered up in
00:26:56.633 the operating system, in the TCP code.
00:26:58.933 TCP always gives the data to the
00:27:01.100 application in a contiguous ordered sequence,
00:27:03.000 in the order it was sent.
00:27:04.933 And this is another reason why the
00:27:06.700 recv() calls don't always preserve the message boundaries.
00:27:09.600 Because it depends how much data was
00:27:11.700 queued up because of packet losses,
00:27:13.466 and so on, so that it can
00:27:15.266 always be delivered in order.
00:27:19.266 The head of line blocking increases the
00:27:21.900 total download time. We see on the
00:27:24.500 left, the case where one packet was
00:27:27.133 lost, and had to be re-transmitted.
00:27:29.500 And we see on the right,
00:27:31.033 the case where all the packets were
00:27:32.866 received on time. And we see an
00:27:34.666 increase in the download time because of
00:27:36.466 the packet loss.
00:27:40.733 It blocks the receiving, it delays things
00:27:43.700 a little bit, waiting for the retransmission.
00:27:46.533 And it increases the overall download time
00:27:50.500 a little bit.
00:27:52.366 It disrupts the behaviour of when the
00:27:54.966 packets are received, during the download quite
00:27:57.333 significantly. We see 1500, 1500, 1500,
00:28:02.400 big gap, seven thousand five hundred,
00:28:04.666 1500, 1500,
00:28:07.666 in the case where the packets were
00:28:09.300 lost. Or, in the case where they
00:28:10.733 were all received, the data is coming
00:28:12.533 in quite smoothly. It's regularly spaced.
00:28:14.966 So it affects the timing, it effects
00:28:17.133 when the data is delivered to the
00:28:18.733 application, and it has a smaller effect
00:28:20.300 on the overall download times.
00:28:28.633 And if you're building real time applications,
00:28:32.000 this is a significant problem. We see
00:28:34.833 the case on the right, if everything
00:28:36.866 is delivered on time, then the data
00:28:39.566 is released to the application very quickly
00:28:41.800 and very predictably.
00:28:43.566 And you don't need
00:28:47.333 much buffering delay at the receiver.
00:28:49.600 Things can be just delivered, things are
00:28:51.600 just delivered to the application, repeatedly on
00:28:53.600 a regular schedule.
00:28:55.033 But the minute something gets lost,
00:28:57.233 it has to wait for the retransmission.
00:28:59.333 In this case it waits for one
00:29:00.966 round trip time, because the ACK has
00:29:02.866 to get back, and then the data has to be retransmitted.
00:29:05.200 Plus, it has to wait for four
00:29:07.100 times the gap between packets, to allow
00:29:09.500 for the four duplicates, the triple duplicate
00:29:12.066 ACK and the original ACK, so you
00:29:14.366 get one round trip time plus four
00:29:16.500 times the packet spacing.
00:29:18.066 So if you're using TCP to send,
00:29:20.266 for example, speech data, where it's sending
00:29:22.400 packets regularly every 20 milliseconds, you need
00:29:25.133 to buffer 80 milliseconds plus the round
00:29:27.666 trip time, to allow for these re-transmissions,
00:29:30.766 if you're using it for a real time application.
00:29:33.766 Because, it waits for the retransmissions, and because
00:29:38.433 of the head of line blocking.
00:29:41.133 And when you're using applications like Netflix
00:29:44.933 or the iPlayer, when you press play on the video
00:29:47.433 there’s a little pause where it says “buffering”.
00:29:49.700 This is what it's doing. It’s buffering
00:29:51.766 up enough data that it can wait
00:29:54.933 for the retransmissions to happen,
00:29:57.666 buffering up enough data in the TCP
00:29:59.866 connection that it can keep playing out
00:30:01.633 the video frames, in order, while still
00:30:04.633 allowing time for a retransmission to happen.
00:30:07.100 So it's buffering up the data waiting,
00:30:09.533 making sure there's enough data buffered up,
00:30:12.766 because of this head of line blocking
00:30:14.366 issue in TCP.
00:30:20.300 So that concludes the discussion of TCP.
00:30:23.700 It gives you an ordered, reliable, byte stream.
00:30:28.233 As a service model it's easy to
00:30:30.433 understand. It’s like reading from a file;
00:30:33.133 you read from the connection and the
00:30:35.733 bytes arrive reliably and in the order they were sent.
00:30:39.733 The timing, though, is unpredictable. How much
00:30:43.566 you get from the connection each time you read from it,
00:30:46.433 and whether the data arrives regularly,
00:30:48.800 or whether it's arrives in big bursts
00:30:50.700 with large gaps between them, depends on
00:30:53.100 how much data is lost, and depends
00:30:55.233 on whether the TCP has to retransmit missing data.
00:30:59.066 And if you're just using this to
00:31:00.633 download files that doesn't matter. It means
00:31:03.700 that the progress bar is perhaps inaccurate,
00:31:05.866 but otherwise it doesn't make much difference.
00:31:08.466 But, if you're using it for real
00:31:10.066 time applications, like video streaming, like telephony,
00:31:14.066 this head of line blocking can quite
00:31:15.866 significantly affect the play out.
00:31:18.966 And a lot of that is the
00:31:20.500 reason why applications use, why real time
00:31:23.366 applications use, UDP. And for those that
00:31:26.233 don't use UDP,
00:31:27.700 applications like Netflix that use adaptive streaming
00:31:32.633 over HTTP, which we'll talk about in
00:31:35.166 lecture seven, that's why there’s this buffering
00:31:37.466 delay before they start playing.
00:31:40.966 And, of course, the lack of framing
00:31:42.700 complicates the application design, you have to
00:31:44.900 parse the data to make sure you've got all the data;
00:31:47.166 there's no message boundaries in there,
00:31:50.033 so you have to parse the data.
00:31:51.966 It doesn't tell you, the connection doesn't
00:31:53.700 tell you, when you've received all the data.
00:31:57.433 So that's it for TCP.
00:32:00.433 It delivers data reliably. It uses sequence
00:32:03.533 numbers and acknowledgments to indicate when the
00:32:06.133 data arrived.
00:32:07.633 It uses timeouts to indicate that a
00:32:09.733 connection has failed. And it uses this
00:32:12.433 idea of triple duplicate ACKs to indicate
00:32:14.866 that a packet has been lost,
00:32:16.300 and trigger a retransmission of any lost data.
00:32:19.833 What I’ll talk about in the next
00:32:21.333 part is QUIC and how it differs
00:32:23.266 from the way TCP handles reliability.
Part 4: Reliable Data Transfer with QUIC
The final part of the lecture discusses reliable data transfer using QUIC. It outlines the QUIC service model, and how it differs from that of TCP, and shows how QUIC achieves reliable data transfer. It discusses how QUIC provides multiple streams within a single connection, and consider how this affects head-of-line blocking and latency. Approaches to making best use of multiple streams are discussed..
00:00:00.100 In this final part I’d like to
00:00:02.533 talk about how reliable data transfer works
00:00:04.633 with QUIC, and how it's different to
00:00:07.100 reliable data transfer with TCP.
00:00:09.533 I’ll talk a little bit about the
00:00:11.733 QUIC service model, and how it handles
00:00:13.966 packet numbers and retransmission. I’ll talk about
00:00:16.166 the multi-streaming features of QUIC. And I’ll
00:00:19.133 talk about how it avoids head-of-line blocking.
00:00:23.333 The service model for TCP, as we
00:00:26.533 saw previously, is that it delivers a
00:00:29.100 single reliable, ordered, byte stream of data.
00:00:32.700 Applications write a stream of bytes in,
00:00:34.933 and that stream of bytes is delivered
00:00:37.033 to the receiver, eventually.
00:00:39.166 QUIC, by contrast, delivers several ordered reliable
00:00:42.200 byte streams within a single connection.
00:00:45.166 Applications can separate the data they're sending
00:00:47.933 into different streams, and each stream is
00:00:49.966 delivered reliably and in order.
00:00:52.066 QUIC doesn't preserve the ordering between the
00:00:54.666 streams within a connection, so if you
00:00:57.266 send in one stream, and then send
00:00:59.866 in a second stream, then the data
00:01:02.500 you sent second, in that second stream,
00:01:04.700 may arrive first, but it preserves the
00:01:06.666 ordering with a stream.
00:01:09.300 And you can treat each stream as
00:01:11.833 if it were running multiple TCP connections
00:01:15.366 in parallel, so it gives you the
00:01:17.100 same service model with several streams of
00:01:19.433 data, or you could perhaps treat each stream as a
00:01:22.900 sequence of messages to be sent,
00:01:25.600 with the streams indicating message boundaries.
00:01:30.366 QUIC delivers data in packets.
00:01:33.466 Each QUIC packet has a packet sequence
00:01:36.366 number, a packet number,
00:01:38.266 and the packet numbers
00:01:41.333 are split into two packet number spaces.
00:01:44.666 The packets sent during the initial QUIC
00:01:48.033 handshake start with packet sequence number zero,
00:01:50.900 and that packet sequence number increases by
00:01:53.033 one for each packet sent during the handshake.
00:01:56.066 Then, when the handshake’s complete, and it
00:01:58.800 switches to sending data, it resets the
00:02:01.666 packet sequence number to zero and starts again.
00:02:05.166 Within each of these packet number spaces,
00:02:07.666 the handshake space, and the data space,
00:02:10.833 the packet number sequence starts at zero,
00:02:13.400 and goes up by one for every packet sent.
00:02:16.566 That is, the sequence numbers in QUIC,
00:02:18.733 the packet numbers in QUIC, count the
00:02:20.966 number of packets of data being sent.
00:02:23.233 That's different to TCP. In TCP,
00:02:25.400 the sequence number in the header counts
00:02:27.966 the offset within the byte stream,
00:02:30.400 it counts how many bytes of data
00:02:32.166 have been sent. Whereas in QUIC,
00:02:34.300 the packet numbers count the number of packets.
00:02:38.033 Inside a QUIC packet is a sequence
00:02:40.833 of frames. Some of those frames may
00:02:43.100 be stream frames, and stream frames carry data.
00:02:46.600 Each stream frame has a stream ID,
00:02:50.066 so it knows which of the many sub-streams
00:02:52.200 it’s carrying data for, and it
00:02:53.766 also has the amount of data being carried,
00:02:57.033 and the offset of that data from the start of the stream.
00:02:59.866 So, essentially the stream contains sequence numbers
00:03:03.833 which play the same role as TCP
00:03:05.400 sequence numbers, in that they count bytes
00:03:07.366 of data being sent in that stream.
00:03:09.500 And the packets have sequence numbers that
00:03:11.766 count the number of packets being sent.
00:03:14.533 And we can see this in the
00:03:16.533 diagram on the right, where we see
00:03:18.366 the packet numbers going up, zero,
00:03:20.466 one, two, three, four. And the stream
00:03:22.433 numbers, packet zero carries data from the
00:03:24.566 first stream, bytes zero through 1000.
00:03:27.733 Packet one carries data from the first
00:03:29.733 stream, bytes 1001 to 2000. And packet
00:03:32.700 two carries bytes 2001 to 2500
00:03:36.833 from the first stream, and zero to
00:03:38.866 500 from the second stream, and so on.
00:03:41.566 And we see that we can send
00:03:44.333 data on multiple streams in a single packet.
00:03:50.400 QUIC doesn't preserve message boundaries within the
00:03:53.200 streams. In the same way that,
00:03:56.000 within a TCP stream, if you write
00:03:59.300 data to the stream and the amount you write is too big
00:04:02.300 to fit into a packet, it may
00:04:04.666 be arbitrarily split between packets.
00:04:06.900 Or if the data you send in a TCP Stream is too small,
00:04:09.566 and doesn't fill a whole packet,
00:04:11.500 it may be delayed waiting for more
00:04:13.433 data, to be able to fill up
00:04:15.033 the packet before it’s sent.
00:04:16.666 The same thing happens with QUIC.
00:04:18.633 If the amount of data you write to a stream is too big to
00:04:21.500 fit into a QUIC packet, then it
00:04:23.366 will be split across multiple packets.
00:04:26.166 Similarly, if the amount of data you
00:04:27.866 write to a stream is very small,
00:04:29.633 QUIC may buffer it up, delay it,
00:04:31.766 wait for more data, so it can
00:04:33.366 send it and fill a complete packet.
00:04:36.666 In addition, QUIC can take data from
00:04:39.466 more than one stream, and send it
00:04:41.300 in a single packet, if there’s space to do so.
00:04:44.566 And if there's more than one stream
00:04:46.833 with data that's available to send,
00:04:48.766 then the QUIC sender can make an
00:04:51.033 arbitrary decision, how it prioritises that data,
00:04:53.300 and how it delivers frames from each stream.
00:04:56.033 And usually it will split those,
00:04:59.200 the data from the streams, so each
00:05:01.700 packet has data from, half the data from, one stream,
00:05:05.200 and half from another stream. But it
00:05:07.400 may alternate them if it wants,
00:05:08.833 sending one packet with data from stream
00:05:10.966 1, one from stream 2, one from
00:05:12.600 stream 1, one from stream 2, and so on.
00:05:17.966 On the receiving side, the receiver sends,
00:05:20.566 the QUIC receiver sends acknowledgments for the
00:05:22.766 packets it receives.
00:05:24.166 So, unlike TCP which acknowledges the next
00:05:27.000 expected sequence number, a QUIC receiver just
00:05:29.566 sends an acknowledgement to say “I got this packet”.
00:05:33.500 So when packet zero arrives, it sends
00:05:35.866 an acknowledgement saying “I got packet zero”.
00:05:38.066 And when packet one arrives, it sends
00:05:39.900 an acknowledgement saying “I got packet one”, and so on.
00:05:43.566 The sender needs to remember what data
00:05:46.200 it puts in each packet, so it
00:05:47.800 knows when it gets an acknowledgement for packet two that,
00:05:51.033 in this case, it contained bytes 2001
00:05:54.800 to 2500 from stream one, and bytes
00:05:57.700 zero through 500 from stream two.
00:06:00.233 That information isn't in the acknowledgments.
00:06:02.766 What's in the acknowledgments it's just the
00:06:04.500 packet numbers, so the sender needs to
00:06:06.466 keep track of how it puts the
00:06:08.466 data from the streams into the packets.
00:06:12.366 The acknowledgments in QUIC are also a
00:06:15.133 bit more sophisticated than they are in
00:06:17.900 TCP, in that it doesn't just have
00:06:20.666 an acknowledgement number field in the header.
00:06:23.533 Rather, it sends the acknowledgments as frames
00:06:26.566 in the packets coming back.
00:06:28.833 And this gives a lot more flexibility, because
00:06:32.533 it can have a fairly sophisticated frame
00:06:35.700 format, and it can change the frame
00:06:37.400 format to include different, to support different
00:06:41.266 ways of sending a header, if it needs to.
00:06:45.233 In the initial version of QUIC,
00:06:47.133 what's in the frame format, in the
00:06:49.666 ACK frames coming back from the receiver to the sender,
00:06:53.266 is a field indicating the largest acknowledgement,
00:06:56.633 which is essentially the same as the
00:06:59.433 TCP acknowledgment – it tells you what's
00:07:02.866 the highest sequence number received.
00:07:06.166 There's an ACK delay field, that tells
00:07:08.933 you how long between receiving that packet
00:07:11.633 the receiver waited before sending the acknowledgement.
00:07:15.000 So this is the delay in the
00:07:16.866 receiver. And by measuring the time it
00:07:20.100 takes for the acknowledgment come back,
00:07:22.100 and removing this ACK delay field,
00:07:24.966 you can estimate the network round trip
00:07:27.366 time excluding the processing delays in the receiver.
00:07:31.466 There’s a list of ACK ranges.
00:07:35.300 And the ACK ranges are a way
00:07:37.100 of the receiver saying “I got a range of packets”.
00:07:40.366 So you can send an acknowledgement that
00:07:42.233 says, I got packets from five through seven
00:07:44.266 in a single go. And you can
00:07:46.800 split this up, with multiple ACK ranges.
00:07:48.833 So you could have an acknowledgement that
00:07:50.766 says “I got packet five; I got packets
00:07:53.466 seven through nine; and I got packets
00:07:55.433 11 through 15” and you can send
00:07:57.533 that all within a single acknowledgement block,
00:07:59.566 in an ACK frame, within the reverse path stream.
00:08:03.433 And this gives it more flexibility,
00:08:05.433 so it doesn't just have to acknowledge
00:08:07.833 the most recently received packet, which gives
00:08:11.200 the sender more information to make retransmissions.
00:08:14.466 This is a bit like the TCP
00:08:16.666 selective acknowledgement extension.
00:08:21.766 Like TCP, QUIC will retransmit lost data.
00:08:26.000 The difference is that TCP retransmits packets,
00:08:30.700 exactly as they would be originally sent,
00:08:33.400 so the retransmission looks just the same
00:08:35.466 as the original packet.
00:08:37.633 QUIC never retransmits packets.
00:08:40.500 Each packet in QUIC has a unique packet sequence number,
00:08:45.166 and each packet is only ever transmitted once.
00:08:48.366 What QUIC rather does, is it retransmits
00:08:51.000 the data which was in those packets
00:08:53.233 in a new packet.
00:08:55.533 So in this example, we see that
00:08:57.600 packet, on the slide, we see that
00:08:59.900 packet number two got lost, and it
00:09:01.633 contain the data bytes 2001 to 2500
00:09:06.033 from stream one, and bytes zero through 500 from stream two.
00:09:10.333 And, when it gets the acknowledgments indicating
00:09:12.933 that packet was lost, it resends that data.
00:09:16.233 And in this case it's sending in
00:09:18.733 packet six, it’s resending the first bytes
00:09:21.766 of data from stream, it’s sending the
00:09:25.333 bytes 2001 to 2500 from stream one,
00:09:28.533 and it will eventually, at some point
00:09:30.533 later, retransmit the data from stream two.
00:09:36.700 As we say, each packet has a
00:09:38.466 unique packet sequence number. Since we're not,
00:09:41.700 since each packet is acknowledged as it
00:09:43.666 arrives, and it's not acknowledging the highest,
00:09:46.666 not acknowledging the next sequence number expected
00:09:49.400 in the same way TCP does,
00:09:51.833 you can’t do the triple duplicate ACK
00:09:53.700 in the same way, because you don't
00:09:55.933 get duplicate ACKs. Each ACK acknowledges the
00:09:58.266 next new packet.
00:09:59.666 Rather QUIC declares a packet to be
00:10:02.333 lost when it's got ACKS for three
00:10:05.033 packets with higher packet numbers than the
00:10:07.500 one which it sent.
00:10:09.333 At that point, it can retransmit the
00:10:11.333 data that was in that packet.
00:10:13.366 And that’s QUIC’s equivalent to the triple
00:10:15.633 duplicate ACK; it's three following sequence numbers
00:10:18.600 rather than three duplicate sequence numbers.
00:10:20.766 And also, just like TCP, if there's
00:10:22.666 a timeout, and it stops getting ACKs,
00:10:24.533 then it declares the packets to be lost.
00:10:31.366 QUIC delivers multiple streams within a single
00:10:35.500 connection. And within each stream, the data
00:10:39.433 is delivered reliably, and in the order it was sent.
00:10:43.466 If a packet’s lost, then that clearly
00:10:46.100 causes data for the stream, streams,
00:10:48.533 where the data was included in that packet to be lost.
00:10:52.600 Whether a packet loss effects one,
00:10:55.600 or more, streams really depends on how
00:10:57.400 the sender chooses to put the data
00:10:59.266 from different streams into the packets.
00:11:02.300 It’s possible that a QUIC packet can
00:11:04.700 contain data from several streams. We saw
00:11:08.333 in the examples, how the packets contain
00:11:10.700 data from both stream one and stream two simultaneously.
00:11:13.566 In that case, if a packet is
00:11:15.833 lost, it will affect both of the
00:11:18.500 streams, all of the streams if there’s
00:11:20.333 data from more than two streams in the packet.
00:11:23.333 Equally, a QUIC sender can choose to
00:11:27.133 alternate, and send one packet with data
00:11:29.933 from stream one, and then another packet
00:11:32.066 with data from stream two, and only
00:11:34.266 ever put data from a single stream in each packet.
00:11:37.400 The specification puts no requirements on how
00:11:40.433 the sender does this, and different senders
00:11:42.766 can choose to do it differently depending
00:11:47.233 whether they're trying to make progress on
00:11:50.000 each stream simultaneously, or whether they want to
00:11:54.000 they want to alternate, and make sure
00:11:57.200 that packet loss only ever affects a single stream.
00:12:01.266 Depending on how they do this,
00:12:03.300 the streams can suffer from head of
00:12:05.366 line blocking independently.
00:12:07.500 If data is lost on a particular
00:12:09.800 stream, then that stream can't deliver later
00:12:14.866 data to the application, until that
00:12:18.033 lost data has been transmitted. But the
00:12:21.500 other streams, if they've got all the
00:12:23.533 data, can keep delivering to the application.
00:12:26.100 So streams suffer from head of line
00:12:28.300 blocking individually, but there's no head of
00:12:30.133 line blocking between streams.
00:12:32.600 This means that the data is delivered
00:12:35.466 reliably, and in order, on a stream,
00:12:37.866 but order’s not preserved between streams.
00:12:42.266 It’s quite possible that one stream can
00:12:45.033 be blocked, waiting for a retransmission of
00:12:47.000 some of the data in the packets,
00:12:48.800 while the other streams are continuing to
00:12:50.900 deliver data and haven't seen any loss
00:12:52.833 on that stream.
00:12:54.700 Each stream is sent and received independently.
00:12:57.866 And this means if you're careful with how you split data
00:13:00.800 across streams, and if the implementation is
00:13:04.300 careful with how it puts data from
00:13:05.900 streams into different packets, it can limit
00:13:08.233 the duration of the head of line
00:13:09.600 blocking, and make the streams independent in
00:13:11.766 terms of head of line blocking and data delivery.
00:13:18.566 QUIC delivers, as we've seen, several ordered,
00:13:21.900 reliable, byte streams of data in a single connection.
00:13:27.333 How you treat these different bytes streams,
00:13:30.000 is, I think, still a matter of interpretation.
00:13:33.600 It's possible to treat a QUIC connection
00:13:36.266 as though it was several parallel TCP connections.
00:13:40.333 So, rather than opening multiple TCP connections
00:13:42.700 to a server, you open one QUIC
00:13:45.100 connection, and you send and receive several
00:13:47.500 streams of data within that.
00:13:49.300 And then you treat each stream of
00:13:51.266 data as-if it were a TCP stream,
00:13:54.466 and you parse and process the data
00:13:56.800 as if it were a TCP stream.
00:13:58.500 And you possibly send multiple requests,
00:14:00.366 and get multiple responses, over each stream.
00:14:04.066 Or, you can treat the streams more as a framing device.
00:14:07.766 You can say that each stream,
00:14:10.300 you can choose to interpret each stream,
00:14:12.433 as sending a single object. And then,
00:14:15.466 when you send data from the stream,
00:14:17.000 on that stream, once you finish sending
00:14:18.833 that object, you close the stream and
00:14:20.933 move on to use the next one.
00:14:23.266 And, on the receiving side, you just
00:14:25.366 read all the data until you see
00:14:27.500 the end of stream marker, and then
00:14:30.200 you process it knowing you’ve got a complete object.
00:14:34.066 And I think that the best practices,
00:14:36.666 the way of thinking about a QUIC connection,
00:14:39.966 and the streams within a connection, is still evolving.
00:14:42.500 And it's not clear which of these
00:14:44.133 two approaches is the necessarily the right
00:14:46.433 way to do it. And I think
00:14:48.033 it probably depends on the application what
00:14:49.766 makes the most sense.
00:14:53.966 So, to conclude for this lecture.
00:14:57.366 We spoke a little bit about best
00:14:59.566 effort packet delivery on the Internet,
00:15:01.300 and why the IP layer delivers data
00:15:04.933 unreliably, and why it's appropriate to have
00:15:09.200 a best effort network.
00:15:11.200 Then we spoke a bit about the different transports.
00:15:14.266 The UDP transport that provides an unreliable,
00:15:17.500 but timely, service on which you can
00:15:20.433 build more sophisticated user space application protocols.
00:15:25.166 We spoke about TCP, that provides a
00:15:27.966 reliable ordered stream delivery service. And we
00:15:30.800 spoke about QUIC, that provides a reliable
00:15:33.600 ordered delivery service with multiple streams of
00:15:36.400 data. And it’s clear there’s different services,
00:15:38.800 different transport protocols, for different needs.
00:15:41.733 What I want to move on to
00:15:43.566 next time, is starting to talk about
00:15:45.300 congestion control and how all these different
00:15:49.166 transport protocols manage the rate at which they send data.
Lecture 5 discussed reliable data transfer over the Internet. It started with a discussion of best effort packet delivery, and an explanation of why it makes sense for the Internet to be designed to be an unreliable network. Then, it moved on to discuss UDP and how to make applications and new transport protocols that work on an unreliable network. There's a trade-off between timeliness and reliability that's important here, and the lecture gave some examples of this to illustrate why many real-time applications used UDP.
The bulk of the lecture discussed TCP. It spoke about how TCP sends acknowledgement for packets, how timeouts and triple-duplicate ACKs indicate loss, and why a triple-duplicate ACK is chosen as the loss signal. It also discussed head-of-line blocking, and how the in-order, single stream, reliable service model of TCP leads to head-of-line blocking and potential latency.
Finally, It discussed the differences between QUIC and TCP. QUIC acknowledges packets rather than bytes within a stream, uses ACK frames rather than an ACK header, and delivers multiple streams of data, allowing it to avoid head-of-line blocking in many cases.
The focus of the discussion will be on how TCP ensures reliability, to make sure the mechanism is understood, and on the differences between the TCP and QUIC service models and how QUIC can improve latency. We'll also discuss how UDP can form a substrate, to easily allow new transports, to suit different needs, to be built and deployed.