csperkins.org

Networked Systems H (2021-2022)

Lecture 2: Connection Establishment in a Fragmented Network

Lecture 2 discusses connection establishment. It begins by reviewing the operation of TCP and showing how TCP connections are established, and what factors influence the performance of connection establishment. It considers the impact TLS and IPv6 on connection establishment, and discusses the need for connection racing. And it reviews the idea of peer-to-peer connection, and the difficulties network address translation causes for peer-to-peer connection establishment. The lecture concludes with a brief explanation of how NAT binding discovery, and the ICE algorithm for peer-to-peer connection establishment, work.

Part 1: Connection Establishment in a Fragmented Network

The 1st part of this lecture reviews the operation of TCP. It outlines the TCP service model, segment format, and programming model. Then, it discusses how TCP connections are established and considers the impact of network latency on TCP connection establishment performance. It concludes with a review of how this connection establishment latency affects protocol design.

Slides for part 1

 

00:00:00.366 In this lecture, I’ll talk about the

00:00:02.533 problem of connection establishment in a fragmented

00:00:05.066 network, and discuss some issues that affect

00:00:07.433 the performance of TCP connection establishment and

00:00:10.166 data transfer.

 

00:00:12.400 There are five parts to this lecture.

 

00:00:15.133 In this first part, I’ll review the

00:00:17.333 TCP transport protocol and its programming model,

00:00:20.100 and talk about client-server connection establishment.

 

00:00:23.433 In the next part, I’ll discuss the

00:00:25.366 implications of Transport Layer Security, TLS,

00:00:28.466 and the use of IPv6 on connection establishment,

00:00:31.600 and show how these changes affect performance.

 

00:00:35.100 Following that, in part three,

00:00:37.500 I’ll talk about peer-to-peer connections and the impact of

00:00:40.200 network address translation.

 

00:00:42.266 Then, in the last two parts,

00:00:44.366 I’ll talk about some of the problems

00:00:46.566 caused by NATs, network address translation devices,

00:00:49.300 and outline how NAT traversal works.

 

00:00:53.133 To begin, I’ll briefly review the Transmission

00:00:57.300 Control Protocol, TCP. I’ll talk about the

00:00:59.166 purpose of TCP, then review the TCP

00:01:01.533 segment format, and the service model TCP

00:01:04.066 offers to applications.

 

00:01:06.433 Then, I’ll discuss how TCP connections

00:01:08.966 are written using the Berkeley sockets API.

 

00:01:13.166 TCP is currently the most widely used

00:01:15.466 transport protocol in the Internet.

 

00:01:18.133 A TCP connection provides a reliable,

00:01:20.900 ordered, byte stream delivery service

00:01:23.466 that runs on the best-effort IP network.

 

00:01:26.400 Once a TCP connection is established,

00:01:28.900 an application can write a sequence of bytes

00:01:31.166 into the socket, representing the connection,

00:01:33.600 and TCP will deliver those bytes to the receiver,

00:01:36.433 reliably, and in the correct order.

 

00:01:39.800 If any of the IP packets containing TCP

00:01:42.333 segments are lost, the TCP stack

00:01:44.933 in the operating system will notice this,

00:01:46.700 and automatically retransmit the missing segments.

 

00:01:49.766 Similarly, if any of the IP packets

00:01:52.633 are delayed and arrive out of order,

00:01:54.833 the TCP stack will put the data

00:01:56.700 back into the correct order before delivering

00:01:58.633 it to the application.

 

00:02:00.866 Finally, TCP will adapt the speed

00:02:03.166 at which it sends the data

00:02:04.766 to match the available network capacity.

 

00:02:07.700 If the network is busy,

00:02:09.566 TCP will slow down the rate at which it sends data,

00:02:12.300 to fairly share the capacity between transmissions.

 

00:02:16.733 Similarly, if the network becomes idle,

00:02:19.400 TCP connections will speed up to use the spare capacity.

 

00:02:23.233 This process is known as congestion control,

00:02:25.966 and TCP implements sophisticated

00:02:28.200 congestion control algorithms.

 

00:02:30.600 We’ll talk more about TCP congestion control in Lecture 6.

 

00:02:35.566 Applications using TCP are unaware of retransmissions,

00:02:39.266 reordering, and congestion control.

 

00:02:41.600 They just see a socket,

00:02:43.300 into which they write a stream of bytes.

 

00:02:45.833 Those bytes are then delivered reliably to the receiver.

 

00:02:49.233 Internally, those bytes are split into TCP segments.

00:02:52.966 Each segment has a header added to it,

00:02:56.000 to identify the data.

 

00:02:57.633 The segment is placed inside the data part of an IP packet.

 

00:03:01.700 That IP packet is, in turn, put inside the

00:03:04.133 data part of a link layer frame,

00:03:06.066 and sent across the network.

 

00:03:08.466 The IP layer just sees a sequence

00:03:10.833 of TCP segments that it must deliver,

00:03:12.833 and is unaware of their contents.

 

00:03:15.066 Equally, TCP gives data segments to the

00:03:18.800 IP layer to deliver, and is unaware

00:03:20.733 whether the underlying network is Ethernet,

00:03:22.566 WiFi, optical fibre, or something else.

 

00:03:27.766 The diagram on the slide shows the

00:03:30.066 format of a TCP segment header,

00:03:32.133 inside an IPv4 packet.

 

00:03:35.066 Data is sent over the link in the order shown,

00:03:38.066 left-to-right, top-to-bottom, in the

00:03:40.033 payload part of a link layer frame.

 

00:03:43.133 The IP header is sent first,

00:03:45.333 then the TCP segment header,

00:03:47.033 then the TCP payload data.

 

00:03:49.666 Looking at the TCP segment header,

00:03:52.333 highlighted in green, we see that it

00:03:54.800 comprises a number of fields.

 

00:03:58.000 A TCP segment starts with the source

00:04:00.166 and destination port numbers.

00:04:02.100 The source port number identifies the socket that sent the

00:04:05.500 segment, while the destination port number identifies

00:04:08.333 the socket to which it should be delivered.

 

00:04:11.166 When establishing a TCP connection, a TCP

00:04:14.166 server binds to a well-known port that

00:04:16.366 identifies the type of service it offers.

 

00:04:19.000 For example, web servers bind to port 80.

 

00:04:22.700 This gives a known destination port

00:04:24.733 to which clients can connect.

 

00:04:27.300 Clients specify the destination port to which they connect,

00:04:30.766 but usually leave their source port unspecified.

00:04:34.200 The operating system then chooses

00:04:36.333 an unused source port for that connection.

 

00:04:39.333 All TCP segments sent from the client to the server,

00:04:42.666 as part of a single connection,

00:04:44.400 have the same source and destination ports,

00:04:46.833 and the responses come back with those ports swapped.

 

00:04:50.166 Each new connection gets a new source port.

 

00:04:54.300 Following the port numbers in the TCP segment header,

00:04:57.033 come the sequence number and the acknowledgement number.

 

00:04:59.966 At the start of a TCP connection,

00:05:02.200 the sequence number is set to a random initial value.

 

00:05:05.533 As data is sent, the TCP sequence

00:05:08.033 number inserted by the sender increases,

00:05:10.233 counting the number of bytes of data being sent.

 

00:05:14.100 The acknowledgement number indicates the next byte

00:05:16.500 of data that’s expected by the receiver.

 

00:05:19.366 For example, if a TCP segment is

00:05:21.900 received with sequence number 4,000,

00:05:24.733 and that segment contains 100 bytes of data,

00:05:27.966 then the acknowledgement number will be 4,100.

 

00:05:31.266 This indicates that the next TCP segment

00:05:33.600 expected is that with sequence number 4,100.

 

00:05:39.000 If the acknowledgement number that comes back

00:05:41.200 is different from that expected,

00:05:42.900 it’s a sign that some of the packets have been lost.

 

00:05:45.766 TCP will then retransmit the

00:05:47.700 segments that were in the lost packets.

 

00:05:50.233 The data offset field indicates where

00:05:52.600 the payload data starts in the segment.

 

00:05:54.500 That is, it indicates the size of any TCP options

00:05:57.133 included in the packet.

 

00:05:59.766 The reserved bits are not used.

 

00:06:03.266 There are then six single-bit flags.

00:06:06.133 The URG bit indicates whether the urgent pointer is valid.

 

00:06:10.166 The ACK bit indicates whether the

00:06:11.966 acknowledgement number is valid.

 

00:06:13.700 The PSH bit indicatesthat this is the last

00:06:16.066 segment in a message, and should be pushed

00:06:18.300 up to the application as soon as possible.

 

00:06:20.966 The SYN, synchronise, bit is set on

00:06:23.966 the first packet sent on a connection.

 

00:06:25.900 The FIN bit indicates that the connection

00:06:28.500 should be cleanly closed.

 

00:06:30.166 And the RST bit indicates that the connection will reset,

00:06:33.400 aborted, without cleanup.

 

00:06:35.933 The receive window size allows the receiver

00:06:38.400 to indicate how much buffer space it

00:06:40.800 has available to receive new data.

 

00:06:42.633 This allows the receiver tell the sender to slow down,

00:06:45.466 if it’s sending faster than the receiver

00:06:47.200 can process the data.

00:06:49.000 The checksum is used to detect corrupted

00:06:52.333 packets, that can then be retransmitted.

 

00:06:55.000 Finally, the Urgent Pointer allows TCP senders

00:06:58.100 to indicate that some data is to be processed

00:07:00.200 urgently by the receiver.

 

00:07:02.366 Unfortunately, experience has shown that the urgent

00:07:05.300 data mechanism in TCP is not usable in practice,

00:07:08.166 due to a combination of

00:07:10.333 an ambiguous specification and inconsistent implementation.

 

00:07:13.933 The fixed TCP segment header is

00:07:16.166 followed by TCP option headers, that allow TCP

00:07:19.100 to add new features and extensions,

00:07:21.266 and then the payload data.

 

00:07:25.433 TCP retransmits any data

00:07:27.600 that was sent in packets that are lost,

00:07:29.600 and makes sure that data is delivered to the

00:07:31.833 application in the order in which it was originally sent.

 

00:07:34.933 It also adapts the speed at which it sends data to

00:07:37.333 match the available network capacity.

 

00:07:39.833 As a result, TCP provides a reliable, ordered,

00:07:43.366 byte stream delivery service.

 

00:07:45.900 A limitation of the TCP service model

00:07:48.600 is that message boundaries are not preserved.

 

00:07:51.266 That is, if an application writes,

00:07:53.533 for example, a block of 2,000 bytes

00:07:55.900 to a TCP connection, then TCP will

00:07:58.533 deliver those 2,000 bytes to the receiver,

00:08:00.633 reliably, and in the order they are sent.

 

00:08:03.966 However, what TCP does not do,

00:08:06.266 is guarantee that those bytes are delivered to the

00:08:08.733 receiving application as a single block of 2,000 bytes.

 

00:08:12.200 That might happen.

 

00:08:14.166 Equally, they could be delivered to the

00:08:16.333 application as two blocks of 1,000 bytes.

 

00:08:19.200 Or as a block of 1,500 bytes,

00:08:21.266 followed by a block of 500 bytes.

00:08:23.400 Or as 2,000 single bytes. Or as

00:08:25.666 any other combination, provided the data is

00:08:28.400 delivered reliably and in the order sent.

 

00:08:31.966 This complicates the design of applications that

00:08:34.366 use TCP, since they have to parse

00:08:36.900 the data received from a TCP socket

00:08:38.966 to check if they’ve got the complete message.

 

00:08:42.000 Despite this inconvenience,

00:08:43.666 TCP is the right choice for most applications.

 

00:08:46.900 If you need to deliver data reliably,

00:08:48.966 and as fast as possible, then use TCP.

 

00:08:54.400 TCP is a client-server protocol.

00:08:56.833 Servers listen for, and respond to, requests from clients.

 

00:09:01.300 The way you write code to use

00:09:03.466 a TCP connection depends on whether you’re

00:09:05.766 writing a client or a server.

 

00:09:09.000 On the server side, you begin by creating a socket.

 

00:09:13.100 The first argument to the socket() call

00:09:15.933 is the constant PF_INET, from the sys/socket.h header,

00:09:19.866 if you want the server to listen for connections on IPv4.

 

00:09:24.333 Alternatively,if you want the server to listen

00:09:26.800 for IPv6 connections, use PF_INET6.

 

00:09:31.400 The second argument will be the constant,

00:09:33.600 SOCK_STREAM, to indicate that a TCP server is wanted.

 

00:09:37.733 The third argument is unused as must be zero.

 

00:09:43.000 You then call the bind() function,

00:09:44.966 passing the file descriptor representing the newly

00:09:48.066 created socket as the first argument.

 

00:09:50.400 The other arguments to bind() specify the

00:09:52.666 port number on which the server

00:09:54.366 should listen for incoming connections.

 

00:09:56.600 The bind() function assigns the requested

00:09:59.433 port number to the socket.

 

00:10:02.000 You then call the listen() function.

00:10:04.966 This starts the server listening for incoming connections.

00:10:08.800 Then, you call accept(), to indicate that

00:10:11.766 your server is ready to accept a new connection.

 

00:10:14.700 The accept() function doesn’t return

00:10:16.766 until a client connects to the server.

 

00:10:18.800 This could potentially be a long wait.

 

00:10:22.966 Meanwhile, the client application creates its own socket.

00:10:26.933 This is done using the socket() function,

00:10:29.666 in exactly the same way as the server.

 

00:10:33.000 The client then calls connect(),

00:10:34.833 passing the file descriptor for its newly created socket

00:10:37.966 as the first argument. The subsequent arguments

00:10:40.466 contain the IP address of the server,

00:10:42.733 and the port number on which the server is listening.

 

00:10:45.700 The connect() call makes TCP establish a

00:10:48.433 connection from the client to the server.

 

00:10:50.833 When it returns, either the connection has

00:10:53.200 been successfully established, or the server is unreachable.

 

00:10:57.133 When the connection request reaches the server,

00:10:59.866 the accept() call completes.

00:11:02.000 The return value is a new file descriptor,

00:11:04.366 representing the newly established connection.

 

00:11:07.433 The original file descriptor,

00:11:08.900 that was listening for incoming connections,

00:11:10.700 remains unchanged.

 

00:11:14.000 The client and server can now call send() and recv(),

00:11:17.033 to send and receive data over the connection.

 

00:11:20.466 They can send and receive as much,

00:11:22.266 or as little, data as they want, and TCP places

00:11:25.033 no restrictions on the order in which

00:11:26.933 client and server send.

 

00:11:28.933 Remember to use the file descriptor representing

00:11:31.566 the accepted connection, not the file descriptor

00:11:33.833 representing the listening socket, when writing the

00:11:36.033 server code.

 

00:11:38.533 Finally, when they’ve finished, client and server

00:11:41.633 call the close() function, to cleanly shut

00:11:43.433 down the connection.

 

00:11:45.166 Once the client has closed the connection, it’s done.

 

00:11:48.466 A server can repeatedly accept new connections

00:11:50.666 from the listening socket.

 

00:11:56.366 How does connection establishment actually work?

00:11:58.933 And what are the factors that affect how

00:12:00.900 quickly connections can be established?

 

00:12:03.233 In the following, I’ll talk in detail

00:12:05.833 about how client-server connection establishment works for

00:12:08.700 TCP, and what factors limit performance.

 

00:12:12.533 TCP-based applications

00:12:14.700 usually work in a client-server manner.

 

00:12:16.933 The server listens for connections on a well-known port,

00:12:20.100 and the client connects to the server, sends a request,

00:12:23.133 and receives a response.

 

00:12:25.000 TCP can also, in principle, be used

00:12:27.666 in a peer-to-peer manner.

 

00:12:30.033 If two devices create TCP sockets,

00:12:32.733 bind to known ports, and simultaneously attempt

00:12:35.633 to connect() to each other, then TCP

00:12:37.633 will able to create a connection,

00:12:39.466 provided there are no firewalls or NAT

00:12:41.400 devices blocking traffic.

 

00:12:43.233 This is known as simultaneous open.

 

00:12:45.866 TCP simultaneous open can work,

00:12:48.600 but isn’tespecially useful, since it requires both peers

00:12:51.700 to try to connect at the same time.

 

00:12:54.033 It’s usually better to work in client-server mode,

00:12:56.933 since the server can wait for clients

00:12:58.700 and the client and server don’t need to synchronise

00:13:01.033 when to connect.

 

00:13:05.000 How does client-server TCP connection

00:13:07.333 establishment actually work?

 

00:13:10.133 First, the server creates a TCP socket,

00:13:13.366 binds it to a port, tells it

00:13:16.733 to listen for connections, and calls accept().

00:13:20.100 At that point, the server blocks,

00:13:22.966 waiting for a connection.

 

00:13:25.000 The client creates a TCP socket and calls connect().

 

00:13:28.800 This triggers the TCP connection setup handshake,

00:13:31.333 and causes the client to send a

00:13:34.066 TCP segment to the server. This segment

00:13:36.800 will have the SYN (“synchronise”) bit set

00:13:39.500 in its TCP segment header, to indicate

00:13:42.233 that it’s the first packet of the connection.

 

00:13:44.466 It will also include a randomly chosen sequence number.

 

00:13:47.500 This is the client’s initial sequence number.

 

00:13:50.700 The initial segment does not include any data.

 

00:13:55.200 When this initial segment arrives at the

00:13:57.533 server, the server will send a TCP segment in response.

 

00:14:01.200 This segment will also have the SYN

00:14:03.833 bit set, because it’s the first segment

00:14:06.400 sent by the server on the connection.

 

00:14:08.033 It will also include a randomly chosen

00:14:09.666 initial sequence number.

00:14:11.100 This is the server’s initial sequence number.

 

00:14:14.333 The segment will also have the ACK

00:14:15.966 bit set, because it’s acknowledging the initial

00:14:18.533 segment sent from the client.

 

00:14:20.933 TCP acknowledgements report the next

00:14:23.033 sequence number expected,

00:14:24.466 and by convention a SYN segment will

00:14:26.733 consume one sequence number.

 

00:14:28.700 Accordingly, the acknowledgement number

00:14:30.833 in the TCP segment header will

00:14:32.500 be the client’s initial sequence number plus one.

 

00:14:36.366 Since this segment has both the SYN

00:14:39.000 and ACK bits set, it’s known as a SYN-ACK packet.

 

00:14:42.533 When the SYN-ACK packet arrives at the

00:14:44.600 client, the client acknowledges its receipt back

00:14:47.200 to the server. The TCP segment it

00:14:49.833 generates to do this will have its

00:14:52.433 ACK bit set to one, to indicate

00:14:55.033 that its acknowledgement number is valid,

00:14:57.266 and the acknowledgement number will be the

00:14:59.866 server’s initial sequence number plus one.

00:15:02.100 The SYN bit is not set on

00:15:04.800 this packet, since it’s not the first

00:15:07.400 packet sent from the client to the

00:15:10.033 server.

00:15:11.133 The sequence number in the TCP segment

00:15:13.833 header will equal the client’s initial sequence

00:15:16.466 number plus one, since the SYN packet

00:15:19.066 consumes one sequence number. This packet also

00:15:21.666 doesn’t include any data, since it’s sent

00:15:24.266 before the connect() call completes.

00:15:26.133 Once this packet has been sent,

00:15:28.466 the client considers the connection established,

00:15:30.700 and the connect() function returns. The client

00:15:33.300 can now send or receive data on the connection.

 

00:15:36.200 Once this final ACK packet arrives at

00:15:38.433 the server, the three-way handshake is complete.

00:15:40.866 At this point the accept() function completes,

00:15:43.266 returning a file descriptor the server can

00:15:45.700 use to access the new connection.

00:15:47.766 The server now considers the connection to

00:15:50.200 be open.

 

00:15:52.000 Once the three-way handshake has completed,

00:15:54.566 the client and server can send and

00:15:57.466 receive data over the connection.

00:15:59.500 The typical case is that the client

00:16:02.500 sends data to the server immediately after

00:16:05.366 it connects, and the server responds with

00:16:08.266 the requested data. There’s no requirement that

00:16:11.133 the client sends first, though, or that

00:16:14.033 client and server alternate in sending data.

 

00:16:17.000 The slide shows an example where the

00:16:19.200 client sends a request comprising a single

00:16:21.400 segment’s worth of data to the server.

00:16:23.600 The server then responds by sending a

00:16:25.800 larger response back, including the acknowledgement for

00:16:28.000 the request on the first segment of

00:16:30.200 the response. Finally, we see the client

00:16:32.400 acknowledge receipt of the segments that comprise

00:16:34.566 the response.

00:16:35.200 This is the typical pattern when a

00:16:37.500 browser fetches a web page from a

00:16:39.700 web server.

00:16:40.333 What’s interesting, is that if we look

00:16:42.633 at the time from when the client

00:16:44.833 calls connect(), until the time it receives

00:16:47.033 the last of data segment from the

00:16:49.233 server, a significant part of that time

00:16:51.433 is taken up by the connection setup

00:16:53.633 handshake.

00:16:54.700 It takes a certain amount of time,

00:16:57.000 known as the round trip time,

00:16:58.866 to send a minimal sized request to

00:17:01.066 the server and get a minimal sized

00:17:03.266 response. Larger requests and responses add to

00:17:05.466 this, based on the time to send

00:17:07.666 the additional data down the link,

00:17:09.533 known as the serialisation time for the

00:17:11.733 data. But, if the amount of data

00:17:13.933 being requested from the server is small,

00:17:16.133 it’s often the round trip time that dominates.

 

00:17:19.666 For example, let’s assume a browser is

00:17:22.066 requesting a simple web page from a

00:17:25.033 web server using HTTP running over TCP.

00:17:28.000 That web page comprises a single HTML

00:17:30.966 file, a CSS style sheet, and an

00:17:33.933 image. The HTML and CSS files are

00:17:36.900 on one server, while the image is

00:17:39.866 located on a different server.

00:17:42.000 How long does it take to retrieve

00:17:45.066 that page?

 

00:17:46.000 Well, the client initially connects to the

00:17:48.366 server where the HTML file is located.

00:17:50.766 This takes one round-trip time for the

00:17:53.133 SYN and SYN-ACK packets to be exchanged,

00:17:55.533 and for the connect() call to complete.

 

00:17:58.000 As soon as the connect() completes,

00:18:00.066 the client sends the request for the

00:18:02.500 HTML. It takes another round-trip for the

00:18:04.900 request to reach the server and the

00:18:07.333 first part of the response to come

00:18:09.766 back, followed by the serialisation time for

00:18:12.166 the rest of the response.

 

00:18:14.000 When it’s received the HTML, the client

00:18:16.433 knows that it needs to retrieve the

00:18:18.866 CSS file and the image.

00:18:20.600 It reuses the existing connection to the

00:18:23.133 first server to retrieve the CSS file.

00:18:25.566 This takes an additional round trip,

00:18:27.666 plus the serialisation time of the CSS

00:18:30.100 data.

00:18:31.200 In parallel to this, it opens a

00:18:33.733 TCP connection to the second server,

00:18:35.800 sends a request for the image,

00:18:37.900 and downloads the response. This takes two

00:18:40.333 round trips, plus the time to send

00:18:42.766 the image data.

00:18:43.800 Whichever of these takes the longest,

00:18:46.000 plus the amount of time to make

00:18:48.433 the initial connection and fetch the HTML,

00:18:50.866 determines the total time to download the page.

 

00:18:53.233 The round trip time depends on the

00:18:55.333 distance from the client to the server.

00:18:57.700 The serialisation time depends on the available

00:19:00.133 capacity of the network. For example,

00:19:02.166 if the image is 1 megabyte,

00:19:04.166 8 megabits, in size, and the available

00:19:06.500 bandwidth of the link is 2 megabits

00:19:08.866 per second, then the image will take

00:19:11.200 4 seconds to download, in addition to

00:19:13.566 the round trip time.

 

00:19:16.000 It can be seen that the total

00:19:18.666 download time depends on both the round

00:19:21.233 trip time, the available bandwidth, and the

00:19:23.800 size of the data being downloaded.

00:19:26.033 What’s a typical round trip time?

00:19:28.333 This depends on the distance between the

00:19:31.000 client and server, and on the amount

00:19:33.566 of network congestion. The table on the

00:19:36.133 right gives some typical values, measured from

00:19:38.700 a laptop on my home network to various destinations.

 

00:19:42.000 There’s a lot of variation. In the

00:19:44.666 best case, it takes around 33ms to

00:19:47.266 get a response from a server in

00:19:49.833 the UK, around 100ms to get a

00:19:52.400 response from a server on the East

00:19:54.966 coast of the US, around 165ms from

00:19:57.533 a server in California, and around 300ms

00:20:00.100 from Australia. Worst case, when there’s other

00:20:02.666 traffic on the network, is considerably higher.

00:20:05.266 This means that a request sent to

00:20:07.933 a server in New York takes at

00:20:10.500 least 1/10th of a second, irrespective of

00:20:13.066 how much data is requested.

 

00:20:15.000 What about available bandwidth?

00:20:16.566 Well, ADSL typically gets around 25 megabits

00:20:19.433 per second.

00:20:20.233 VDSL, often known as fibre to the

00:20:23.100 kerb, where the connection runs over your

00:20:25.866 home phone line to a cabinet in

00:20:28.633 your street, then over fibre to the

00:20:31.400 exchange and beyond, typically gets around 50

00:20:34.166 megabits per second.

00:20:35.333 And fibre to the premises, where the

00:20:38.200 optical fibre runs direct into your home,

00:20:40.966 can transmit several hundred megabits per second.

00:20:43.733 4G wireless is highly variable, depending on

00:20:46.600 the options enabled by your provider and

00:20:49.366 the reception quality, but somewhere in the

00:20:52.133 15-30 megabits per second range is typical.

 

00:20:56.000 What does this mean in practice?

00:20:58.066 Let’s take the example of a single

00:21:00.500 web page we used before, comprising HTML,

00:21:02.800 a CSS style sheet, and a single

00:21:05.100 image, and plug in some typical numbers

00:21:07.400 for the file sizes, as shown on

00:21:09.733 the slide. Let’s also assume that the

00:21:12.033 round trip time is the same for

00:21:14.333 both servers, to make the numbers easier.

00:21:16.633 The table then plots the total time

00:21:19.033 it would take to download that simple

00:21:21.366 web page, given different values for bandwidth

00:21:23.666 and the round trip time to the

00:21:25.966 servers.

00:21:27.066 The slowest case is the bottom left

00:21:29.466 of the table, where it would take

00:21:31.766 45.1 seconds to download the page,

00:21:33.733 assuming a 1 megabit per second link

00:21:36.066 to a server with a 300ms round

00:21:38.366 trip time. This models a slow connection

00:21:40.666 to a server in Australia.

00:21:42.333 The fastest is the top right of

00:21:44.733 the table, where it takes 0.04 seconds

00:21:47.033 to download the page from a server

00:21:49.333 located 1ms away on a gigabit link.

 

00:21:51.466 What’s interesting is how the download time

00:21:53.833 varies as the link speed improves.

00:21:56.233 If we look at the top row,

00:21:59.166 with 1ms round trip time, we see

00:22:02.000 that if we increase the bandwidth by

00:22:04.833 a factor of ten, from 100Mbps to

00:22:07.666 1Gbps, the time taken to download the

00:22:10.466 page goes down by a factor of

00:22:13.300 ten, a 90% reduction. The link is

00:22:16.133 ten times faster, and the page downloads

00:22:18.966 ten time faster.

00:22:20.166 If we look instead at the bottom

00:22:23.100 row, with 300ms round trip time,

00:22:25.500 increasing the link speed from 100Mbps to

00:22:28.333 1Gbps gives only a 22% reduction in

00:22:31.166 download time.

00:22:31.966 Other links are somewhere in the middle.

 

00:22:35.000 Internet service providers like to advertise their

00:22:37.166 services based on the link speed.

00:22:39.033 They proudly announce that they can now

00:22:41.233 provide gigabit links, and that these are

00:22:43.400 now more than ten times faster than before!

 

00:22:46.066 And this is true.

00:22:47.600 But, in terms of actual download time,

00:22:50.466 unless you’re downloading very large files,

00:22:52.866 the round trip time is often the

00:22:55.666 limiting factor. The download time, for typical

00:22:58.433 pages, may only improve by a factor

00:23:01.233 of two if the link gets 10x

00:23:04.033 faster.

00:23:05.166 Is it still worth paying extra for

00:23:08.066 that faster Internet connection?

 

00:23:10.000 What does this mean for protocol design?

00:23:12.800 The example shows an HTTP/1.1 exchange.

00:23:15.200 Once the connection has been opened,

00:23:17.533 the client sends the data shown in

00:23:20.233 blue, prefixed with the letter “C:”,

00:23:22.533 to the server. The server then responds

00:23:25.233 with the data shown in red,

00:23:27.533 prefixed with the letter “S:”, comprising some

00:23:30.233 header information and the requested page.

00:23:32.566 Everything is completed in a single round

00:23:35.366 trip. Request. Then response.

 

00:23:38.000 Compare that with this example, showing the

00:23:40.633 Simple Mail Transfer Protocol, SMTP, used to

00:23:43.200 send email.

00:23:43.933 As with the previous slide, data sent

00:23:46.566 from client to server is shown in

00:23:49.133 blue and prefixed with the letter “C:”,

00:23:51.666 and that sent from the server to

00:23:54.200 the client is in red, prefixed with

00:23:56.766 the letter “S:”.

00:23:57.866 We see that the protocol is very

00:24:00.500 chatty.

00:24:01.600 Once the connection is established, after the

00:24:04.266 SYN, SYN-ACK, and ACK, the server sends

00:24:06.800 an initial greeting. Establishing the connection and

00:24:09.366 sending this initial greeting takes two round trips.

 

00:24:12.800 The client then sends HELO, and waits

00:24:14.466 for the go ahead from the server.

00:24:16.700 This takes one more round trip.

 

00:24:19.166 The client then sends the from address,

00:24:21.466 and waits for the server. One more

00:24:23.966 round trip.

00:24:24.666 The client then send the recipients,

00:24:26.900 and waits for the server. One more

00:24:29.366 round trip.

00:24:30.066 The client then says it’d like to

00:24:32.666 send data now, and waits for the

00:24:35.133 server. One more round trip.

 

00:24:37.000 Then, finally the client gets to send

00:24:40.033 the data, and once it’s confirmed that

00:24:43.033 the data was received, sends QUIT,

00:24:45.633 waits, then closes the connection.

00:24:47.766 The whole exchange takes eight round trips.

 

00:24:51.000 Is this necessary or efficient?

00:24:52.866 No!

00:24:54.100 If the protocol were designed differently,

00:24:56.433 all the data could be sent at

00:24:59.066 once, as soon as the connection was

00:25:01.700 opened, and the server could respond with

00:25:04.300 an okay or an error. The eight

00:25:06.933 round trips could be reduced to two:

00:25:09.566 one to establish the connection, one to

00:25:12.166 send the message and get confirmation from

00:25:14.800 the server.

00:25:15.566 This is why email is slow to send.

 

00:25:19.066 TCP establishes connections using a three-way handshake.

00:25:22.166 SYN, SYN-ACK, ACK.

00:25:23.500 The time to establish a connection depends

00:25:26.766 on round trip time and the bandwidth.

 

00:25:30.000 Links are now fast enough that the

00:25:32.400 round trip time is generally the dominant

00:25:34.833 factor, even for relatively slow links.

 

00:25:37.000 The best way to improve application performance

00:25:39.366 is usually to reduce the number of

00:25:41.733 messages that need to be sent from

00:25:44.100 client to server. That is, to reduce

00:25:46.466 the number of round trips. Unless you’re

00:25:48.833 sending a lot of data, increasing the

00:25:51.200 bandwidth generally makes very little difference to

00:25:53.566 performance.

Part 2: Impact of TLS and IPv6 on Connection Establishment

The 2nd part of the lecture discusses the impact of TLS and IPv6 on TCP connection establishment. It shows how the use of TLS, to secure connections, increases the connection establishment latency. And it discusses the "happy eyeballs" technique for connection racing, to reduce connection establishment delays, in dual stack IPv4 and IPv6 networks.

Slides for part 2

00:00:00.000 In the previous part, I discussed TCP

00:00:03.366 connection establishment, and highlighted that the round-trip

00:00:05.700 time is often the limiting factor for

00:00:08.066 performance.

00:00:09.133 In the following, I want to discuss

00:00:11.600 the performance implications of adding transport layer

00:00:13.933 security to TCP connections, and how to

00:00:16.300 achieve good performance when the destination is

00:00:18.633 a dual-stack IPv4 and IPv6 host.

 

00:00:21.000 In the previous part, I showed how

00:00:23.566 the network round trip time can be

00:00:26.066 the limiting factor in performance. This is

00:00:28.533 because every TCP connection needs at least

00:00:31.033 two round trip times: one to establish

00:00:33.500 the connection, and one for the client

00:00:36.000 to send a request and receive a

00:00:38.466 response from the server.

00:00:39.900 I also showed how the protocol running

00:00:42.500 over TCP can make a significant difference

00:00:44.966 to performance, with the examples of HTTP,

00:00:47.466 which sends a request and receives a

00:00:49.933 response in a single round trip,

00:00:52.066 and SMTP, which makes multiple unnecessary round trips.

 

00:00:55.166 One of the important protocols that runs

00:00:58.133 over TCP is the transport layer security

00:01:01.266 protocol, TLS.

00:01:02.133 TLS provides security for a TCP connection.

00:01:05.366 That is, it allows the client and

00:01:08.500 server to agree encryption and authentication keys

00:01:11.633 to make sure that the data sent

00:01:14.766 over that TCP connection is confidential and

00:01:17.900 protected from modification in transit.

00:01:20.133 TLS is essential to Internet security.

 

00:01:23.000 When you retrieve a secure web page

00:01:25.700 using HTTPS, it first opens a TCP

00:01:28.366 connection to the server. Then, it runs

00:01:31.066 TLS to enable security for that connection.

00:01:33.766 Then, it asks to retrieve the web

00:01:36.433 page.

00:01:37.566 Depending on the version of TLS used,

00:01:40.366 this adds additional time to the connection.

00:01:43.033 With the latest version of TLS,

00:01:45.466 TLS v1.3, it takes one additional round

00:01:48.133 trip to agree the encryption and authentication

00:01:50.833 keys. That is, after the TCP connection

00:01:53.500 has been established, via the SYN -

00:01:56.200 SYN-ACK - ACK handshake, then the client

00:01:58.900 and server need an additional round trip

00:02:01.566 to enable TLS, before they can request data.

 

00:02:04.033 The TLS handshake is in three parts.

00:02:06.800 First, the client sends a TLS ClientHello

00:02:09.600 message to the server to propose security

00:02:12.400 parameters. Then, the server responds with a

00:02:15.200 TLS ServerHello, containing its keys and other

00:02:18.000 security parameters. Finally, assuming there’s a match,

00:02:20.800 the client responds with a TLS Finished

00:02:23.600 message to set the encryption parameters.

00:02:26.000 The client then immediately follows this by

00:02:28.800 sending the application data, such as an

00:02:31.600 HTTP GET request, without waiting for a

00:02:34.400 response.

00:02:35.566 This adds one additional round trip time, in most cases.

 

00:02:39.100 Older versions of TLS take longer.

00:02:41.266 TLS v1.2, for example, takes at least

00:02:43.900 two round trips to negotiate a secure

00:02:46.533 connection.

 

00:02:48.000 What impact does the additional round trip

00:02:50.833 due to TLS have on performance?

00:02:53.266 Well, let’s look again at the simple

00:02:56.200 web page download examples from the previous

00:02:59.033 part.

00:03:00.200 When the round-trip time is negligible,

00:03:02.733 as on the top row with 1ms

00:03:05.566 round trip time, performance is unchanged.

00:03:08.000 As we go down the table,

00:03:10.533 though, performance gets worse. With 100ms round

00:03:13.366 trip time, both the overall performance and

00:03:16.233 the benefit of increasing the link speed

00:03:19.066 go down. The download time for the

00:03:21.900 page on a gigabit link is increased

00:03:24.733 by 45%, from 0.44 to 0.64 seconds,

00:03:27.566 compared to a connection without TLS.

00:03:30.000 And the benefit of going from a

00:03:32.833 100 megabit link to a gigabit link

00:03:35.666 is only 36%, rather than 45% without

00:03:38.533 TLS.

00:03:39.666 With 300ms round trip time the behaviour

00:03:42.600 is even worse. Total download time increases

00:03:45.466 by 48% compared to the non-TLS case,

00:03:48.300 and there’s only a 22% reduction in

00:03:51.133 download time when going from a 100

00:03:53.966 megabit link to a gigabit link.

00:03:55.966 This is not to say that TLS

00:03:58.566 is bad! Far from it – security

00:04:01.133 is essential.

00:04:01.866 Rather, it further highlights that the number

00:04:04.533 of round trips that a connection must

00:04:07.133 perform, between client and server, is often

00:04:09.700 the limiting factor in performance.

00:04:11.533 Applications that have good performance will try

00:04:14.200 to reduce the number of TCP connections

00:04:16.766 that they establish, since each connection takes

00:04:19.333 time to establish. They also try to

00:04:21.900 limit the number of request-response exchanges,

00:04:24.100 each taking a round trip, they make

00:04:26.700 on each connection.

00:04:27.800 TLS v1.3, standardised in 2018, was a

00:04:30.466 big win here, because it reduces the

00:04:33.033 number of round trips needed to enable

00:04:35.600 security from two, down to one.

00:04:37.800 When used with TCP, this gives the

00:04:40.366 best possible performance: one round trip to

00:04:42.933 establish the TCP connection, and one to

00:04:45.500 negotiate the security, before the data can

00:04:48.100 be sent.

00:04:48.833 We’ll talk more about TLS and how

00:04:51.500 to improve the performance of secure connections

00:04:54.066 in lectures 3 and 4.

 

00:04:57.000 The other factor affecting TCP connection performance

00:05:00.266 is the ongoing transition to IPv6.

00:05:03.033 This transition means that we currently have

00:05:06.400 two Internets: the IPv4 Internet and the

00:05:09.633 IPv6 Internet.

00:05:10.566 Some hosts can only connect using IPv4.

00:05:13.933 Some hosts can only connect using IPv6.

00:05:17.166 And some hosts have both types of address.

 

00:05:21.066 Similarly, some network links can only carry

00:05:24.166 IPv4 traffic, some only IPv6, and some

00:05:27.333 links can carry both types of traffic.

00:05:30.500 And some firewalls, or other middleboxes,

00:05:33.333 block IPv4, some block IPv6, and some

00:05:36.500 block both types of traffic.

00:05:38.733 Importantly, the IPv6 network is not a

00:05:42.033 subset of the IPv4 network. It’s a

00:05:45.200 separate Internet, that overlaps in places.

 

00:05:49.000 Given that some hosts will be reachable

00:05:52.000 over IPv4 but not IPv6, and vice

00:05:54.866 versa, how do you establish connections during

00:05:57.766 the transition?

00:05:58.566 Well, given a hostname, you perform a

00:06:01.566 DNS lookup to find the IP addresses

00:06:04.433 for that host using the getaddinfo() call.

00:06:07.333 This returns a list of possible IP

00:06:10.200 addresses for the host, including both IPv4

00:06:13.100 and IPv6 addresses.

00:06:14.333 The simple approach is a loop,

00:06:16.900 trying each address in turn,

00:06:18.666 until one successfully connects.

 

00:06:20.733 This works, but can be very slow.

 

00:06:24.300 In the example on the slide,

00:06:26.066 Netflix has 16 possible IP addresses,

00:06:28.533 eight IPv6 and eight IPv4, and lists

00:06:31.433 the IPv6 addresses first in its DNS

00:06:34.300 response. If you have only IPv4 connectivity,

00:06:37.200 it may take a long time to

00:06:40.066 try, and fail, to connect to eight

00:06:42.966 different IPv6 addresses before you get to

00:06:45.833 an IPv4 address that works.

 

00:06:49.000 To get good performance, applications use a

00:06:52.233 technique known as “Happy Eyeballs”.

00:06:54.500 This involves making two separate DNS lookups,

00:06:57.733 in parallel, one asking for only IPv4

00:07:00.866 addresses and one for only IPv6 addresses.

00:07:04.033 Starting with whichever of these DNS lookups

00:07:07.266 completes first, the client makes a connection

00:07:10.433 to the first address returned by the

00:07:13.566 server. If that hasn’t succeeded within 100ms,

00:07:16.700 it starts another connection request to the

00:07:19.866 next possible address, alternating between IPv4 and

00:07:23.000 IPv6 addresses.

 

00:07:24.000 The different connection requests proceed in parallel,

00:07:27.366 until once eventually succeeds. That first successful

00:07:30.700 connection is used, whether over IPv4 or

00:07:34.066 IPv6, and the other connection requests are cancelled.

 

00:07:37.900 The happy eyeballs technique tries to balance

00:07:40.900 the time taken to connect vs.

00:07:43.366 the network overload of trying many possible

00:07:46.266 connections at once in parallel. It adds

00:07:49.166 complexity to the connection setup, to achieve

00:07:52.066 good performance.

 

00:07:54.000 The two factors affecting TCP performance are

00:07:56.900 bandwidth and latency. In many cases,

00:07:59.300 the latency, the round trip time,

00:08:01.700 dominates.

00:08:02.866 The are five ways in which applications

00:08:05.766 using TCP improve their performance.

00:08:07.766 The first is that a client should

00:08:10.666 use something like happy eyeballs, overlapping connection

00:08:13.466 requests if the server, if the server

00:08:16.266 has more than one address. This is

00:08:18.466 more complicated to implement than trying to

00:08:20.966 connect to each different address in turn,

00:08:22.800 but connects a lot faster.

 

00:08:25.566 The second way to improve TCP performance

00:08:28.566 is to reduce the number of TCP

00:08:31.133 connections made. Each connection takes time to

00:08:33.700 establish. If you can make a single

00:08:36.300 TCP connection and reuse it for multiple

00:08:38.866 requests, that’s faster than making a new

00:08:41.433 connection for each request.

 

00:08:43.000 Third, if you reduce the number of

00:08:45.600 request-response exchanges made over each connection,

00:08:47.833 you reduce the impact of the round

00:08:50.433 trip latency.

00:08:51.166 All these are possible for any application,

00:08:53.866 by using TCP connections effectively.

00:08:55.733 There are also two more radical changes

00:08:58.400 that can be made.

 

00:09:00.000 The first is to overlap and TCP

00:09:02.533 and TLS connection setup handshakes, by sending

00:09:05.066 the security parameters along with the initial

00:09:07.633 connection request, so that both the connection

00:09:10.166 setup and security parameters can be negotiated

00:09:12.700 in a single round trip. This isn’t

00:09:15.233 possible with TCP, but the QUIC transport

00:09:17.766 protocol, that we’ll discuss in lecture 4,

00:09:20.333 does allow this.

00:09:21.400 Finally, one can always improve performance by

00:09:24.066 reducing the round trip latency. This latency

00:09:26.600 depends on two things: the speed at

00:09:29.133 which the signal propagates down the link,

00:09:31.666 and the amount of other traffic.

00:09:33.833 Since signals travel down electrical cables and

00:09:36.500 optical fibres at the speed of light,

00:09:39.033 there’s little that can be done to

00:09:41.566 increase the propagation speed, although low earth

00:09:44.100 orbit satellites can help, we’ll discuss in

00:09:46.633 lecture 6.

00:09:47.366 Reducing the amount of other traffic queued

00:09:50.000 up at intermediate links is a possibility

00:09:52.566 though, and this can be affected by

00:09:55.100 the choice of TCP congestion control algorithm.

00:09:57.633 We’ll talk about this in lectures 5

00:10:00.166 and 6.

 

00:10:02.000 To summarise, one of the limiting factors

00:10:04.900 with TCP performance is the round trip

00:10:07.666 latency.

00:10:08.833 The use of TLS is essential to

00:10:11.700 improve security, but comes at the expense

00:10:14.500 of an additional round trip that slows

00:10:17.266 down connection establishment. This is solved by

00:10:20.066 the upcoming QUIC transport protocol, that we’ll

00:10:22.833 discuss in lecture 4.

00:10:24.433 Similarly, the ongoing migration to IPv6 means

00:10:27.333 that servers often have both IPv4 and

00:10:30.100 IPv6 addresses, and it’s not clear which

00:10:32.900 of these are reachable. Clients must try

00:10:35.666 to establish multiple connections in parallel,

00:10:38.066 using the happy eyeballs technique, to get

00:10:40.866 good performance.

Part 3: Peer-to-peer Connections

The 3rd part starts to discuss peer-to-peer connections. It talks about how the use of Network Address Translation (NAT) affects addressing and connection establishment, and why it complicates creating peer-to-peer applications.

Slides for part 3

00:00:00.000 In this part, I’ll start to talk

00:00:02.733 about peer-to-peer connections, network address translation,

00:00:05.666 and how these affect Internet addressing and

00:00:07.700 connection establishment.

 

00:00:10.000 The Internet was designed as a peer-to-peer

00:00:12.366 network, and makes no distinction between clients

00:00:14.733 and servers at the IP layer.

00:00:16.733 In principle, it should be possible to

00:00:19.200 run a TCP server, or a UDP-

00:00:21.566 or TCP-based peer-to-peer application, on any host

00:00:23.933 on the network. As long as the

00:00:26.300 clients have some way of finding the

00:00:28.666 server’s IP address, and knowing what port

00:00:31.000 number it’s using, and as long as

00:00:33.366 any firewall pinholes are opened, then it

00:00:35.733 shouldn’t matter whether a server is located

00:00:38.100 in someone’s home or in a data

00:00:40.466 centre.

00:00:41.533 A server in a data centre is

00:00:44.000 likely to have better performance, of course,

00:00:46.366 because it’s probably got a faster connection

00:00:48.433 to the rest of the network.

 

00:00:49.933 It’s also likely to be more robust,

00:00:52.666 because the data centre will have redundant

00:00:55.333 power and network links, air conditioning,

00:00:57.600 and professional system administrators. But, at the

00:01:00.266 protocol level, there shouldn’t be a difference.

00:01:02.933 In practise, this is not the case.

00:01:05.700 It’s difficult to run a server on

00:01:08.366 a host connected to most residential broadband

00:01:11.033 connections, and it’s difficult to make peer-to-peer

00:01:13.700 connections work.

00:01:14.466 The reason for this is the widespread

00:01:17.233 use of network address translation – NAT.

 

00:01:21.000 What is network address translation?

00:01:22.966 NAT is the process by which several

00:01:25.833 devices can share a single public IP

00:01:28.600 address. It allows several hosts to form

00:01:31.366 a private internal network, with IP addresses

00:01:34.133 assigned from a special-use range. One device

00:01:36.900 – the network address translator, the NAT

00:01:39.666 – is connected to both the private

00:01:42.466 network and to the Internet, and can

00:01:45.233 forward packets between the two networks.

00:01:47.600 As it does so, it rewrites,

00:01:49.966 translates, the IP addresses, and the TCP

00:01:52.733 and UDP port numbers, so all the

00:01:55.500 packets appear to come from the NAT’s

00:01:58.266 IP address.

00:01:59.066 Essentially, it hides an entire private network

00:02:01.933 behind a single IP address.

 

00:02:04.000 This is useful because there aren’t enough

00:02:06.400 IPv4 addresses for every device that wants

00:02:08.766 to connect to the network, and because

00:02:11.166 it’s taking a long time to deploy IPv6.

 

00:02:14.033 NAT is a work around, to let

00:02:17.166 you keep using IPv4 devices, with some

00:02:20.300 limitations, even though there aren’t enough IPv4

00:02:23.466 addresses.

 

00:02:25.000 How does NAT work? Well, let’s first

00:02:27.766 step back, and think about how a

00:02:30.533 single host connects to the network.

 

00:02:33.000 In the figure, a customer owns a

00:02:35.666 single host. That host connects to a

00:02:38.333 network run by an Internet service provider.

00:02:41.000 That ISP, in turn, connects to the

00:02:44.033 broader Internet.

 

00:02:45.000 The ISP owns a range of IP

00:02:47.333 addresses that it can assign to its customers.

 

00:02:50.066 In this example, it owns the IPv4

00:02:53.866 prefix 203.0.113.0/24. That is, the set of

00:02:57.733 IPv4 addresses where the first 24 bits,

00:03:01.600 known as the network part of the

00:03:05.466 address, match those of 203.0.113.0 are assigned

00:03:09.333 to the ISP.

00:03:11.000 These are the IPv4 addresses in the

00:03:17.933 range 203.0.113.0 to 203.0.113.255.

 

00:03:22.000 The address with the host part equal

00:03:24.533 to zero represents the network, and cannot

00:03:27.033 be assigned to a device. The ISP

00:03:29.566 assigns the first usable IP address in

00:03:32.100 the range, 203.0.113.1, to the internal network

00:03:34.600 interface of the router that connects it

00:03:37.133 to the rest of the network,

00:03:39.300 and assigns the rest of the addresses

00:03:41.833 to customer machines.

 

00:03:43.000 One particular customer is assigned IP address

00:03:46.100 203.0.113.7 for their device.

00:03:47.900 The external, Internet-facing, side of the router

00:03:51.100 that connects the ISP to the rest

00:03:54.233 of the network has an IP address

00:03:57.333 assigned by the network to which the

00:04:00.466 ISP connects. In this example, it gets

00:04:03.566 IP address 192.0.2.47.

 

00:04:06.000 The customer’s host connects to a server

00:04:09.800 on the Internet. The server happens to

00:04:13.633 have IP address 192.0.2.53.

00:04:15.800 The customer’s host sends packets that have

00:04:19.733 their destination IP address equal to that

00:04:23.533 of the server, 192.0.2.53, and source IP

00:04:27.366 address equal to that of the customer’s

00:04:31.166 host, 203.0.113.7.

00:04:32.266 Those packets travel through the network without

00:04:36.166 change, and when they arrive that the

00:04:40.000 server, they still have destination IP address

00:04:43.800 192.0.2.53 and source IP address 203.0.113.7.

00:04:47.066 When it sends a reply, the server

00:04:51.000 will set the destination IP address to

00:04:54.800 that of the customer’s device, 203.0.113.7,

00:04:58.066 and use its own address, 192.0.2.53,

00:05:01.366 as the source IP address.

00:05:04.066 No address translation takes place.

 

00:05:08.000 At some point later, the customer buys

00:05:11.033 another host. How does it connect to

00:05:14.033 the network?

 

00:05:15.000 What’s supposed to happen is as follows.

00:05:17.666 First, the customer buys an IP router,

00:05:20.433 or is given one by the ISP.

00:05:23.133 The router is used to create an

00:05:25.800 internal network for the customer, that connects

00:05:28.466 to the ISP’s network.

00:05:30.000 This could be an Ethernet, a WiFi

00:05:33.433 network, or whatever.

 

00:05:35.000 The external interface of that router,

00:05:38.266 that connects the customer to the ISP,

00:05:42.066 inherits the IP address that was previously

00:05:45.900 assigned to the customer’s single device,

00:05:49.133 in this case 203.0.113.7.

00:05:51.333 The ISP also assigns a new IP

00:05:55.233 address range to the customer. This will

00:05:59.033 be a subset of the IP address

00:06:02.866 range the ISP owns. In this example,

00:06:06.666 the customer is assigned the IP address

00:06:10.466 range 203.0.113.16/28. That is, IP addresses where

00:06:14.266 the first 28 bits match those of

00:06:18.100 203.0.113.16, namely the range 203.0.113.16 to 203.0.113.31.

 

00:06:22.000 The customer assigns the first usable address

00:06:25.933 in that range, 203.0.113.17, to the internal

00:06:29.866 network interface of the router, and assigns

00:06:33.800 other addresses to their two hosts.

00:06:37.166 In this example, the two hosts are

00:06:41.100 given addresses 203.0.113.18 and 203.0.113.19.

 

00:06:44.000 The end result is that the ISP

00:06:46.133 delegates some of the IP addresses they

00:06:48.266 own to their customer, and the customer

00:06:50.366 uses them in their network.

 

00:06:53.000 One of the customer’s hosts connects to

00:06:56.133 a server on the Internet.

00:06:58.366 As expected, to do so, that host

00:07:01.600 sends an IP packet with the source

00:07:04.733 IP address set to its IP address,

00:07:07.866 in this case 203.0.113.18, and the destination

00:07:11.000 address set to the IP address of

00:07:14.133 the server, 192.0.2.53.

00:07:15.466 That packet travels through the customer’s network

00:07:18.700 to its router, and is forwarded on

00:07:21.833 to the ISP’s network. It traverses the

00:07:24.966 ISP’s network to the router connecting the

00:07:28.100 ISP to the Internet, and is forwarded

00:07:31.233 on from there to the Internet.

00:07:33.933 Eventually, the packet arrives at the server.

00:07:37.066 When it arrives, it still has destination

00:07:40.200 address equal to that of the server,

00:07:43.333 192.0.2.53, and source address equal to that

00:07:46.466 of the host that sent it,

00:07:49.133 203.0.113.18.

00:07:50.333 When it sends a reply, the server

00:07:53.566 will set the destination IP address to

00:07:56.700 that of the customer’s device, 203.0.113.18 and

00:07:59.833 use its own address, 192.0.2.53, as the

00:08:02.966 source IP address.

00:08:04.300 No address translation takes place.

 

00:08:07.000 That’s what’s supposed to happen, but what

00:08:09.733 actually happens?

00:08:10.500 Well, most likely the ISP either doesn’t

00:08:13.300 have enough IPv4 addresses to delegate some

00:08:16.033 of them to their customer, or they

00:08:18.766 want to charge a lot extra to

00:08:21.466 do so.

00:08:22.266 Accordingly, the customer buys a network address

00:08:25.066 translator, and connects it to the ISP’s

00:08:27.800 network in place of their single original host.

 

00:08:31.100 The external interface of the NAT gets

00:08:34.533 the IP address assigned to the customer’s

00:08:38.066 original host, 203.0.113.7.

00:08:39.600 The customer sets up their internal network

00:08:43.233 as before, but instead of using IP

00:08:46.766 addresses assigned by their ISP, they use

00:08:50.300 one of the private IP address ranges.

00:08:53.833 @ 08:50 In this example, they use

00:08:57.366 addresses in the range 192.168.0.0 to 192.168.255.255.

 

00:09:01.000 The internal interface of the NAT is

00:09:05.500 given IP address 192.168.0.1, and the two

00:09:10.033 hosts get addresses 192.16.0.2 and 192.168.0.3.

 

00:09:15.000 One of the customer’s hosts again connects

00:09:18.333 to a server on the Internet.

00:09:21.166 As expected, that host sends an IP

00:09:24.600 packet with the source IP address set

00:09:27.933 to its IP address, in this case

00:09:31.266 192.168.0.2, and the destination address set to

00:09:34.566 the IP address of the server, 192.0.2.53.

 

00:09:38.166 That packet travels through the customer’s network

00:09:40.666 to its NAT router. The NAT rewrites

00:09:43.333 the source address of the packet to

00:09:46.000 match the external address of the NAT,

00:09:48.700 in this case 203.0.113.7, and also rewrites

00:09:51.366 the TCP or UDP port number to

00:09:54.033 some new port number that’s unused on

00:09:56.700 the NAT, and forwards the packet on

00:09:59.366 to the ISP’s network.

 

00:10:01.000 Internally, the NAT keeps a record on

00:10:04.200 the changes it made, associated with the

00:10:07.366 port.

00:10:08.566 The packet traverses the ISP’s network to

00:10:11.866 the router connecting the ISP to the

00:10:15.066 Internet, and is forwarded on from there

00:10:18.266 to the Internet. Eventually, the packet arrives

00:10:21.433 at the server. When it arrives,

00:10:24.166 it still has destination address equal to

00:10:27.366 that of the server, 192.0.2.53, but source

00:10:30.566 address will equal to that of the NAT, 203.0.113.7.

 

00:10:34.133 To the server, the packet appears to

00:10:37.066 have come from the NAT. When it

00:10:40.100 sends a reply, the server will set

00:10:43.166 the destination IP address to that of

00:10:46.233 the NAT, 203.0.113.7, and use its own

00:10:49.266 address, 192.0.2.53, as the source address.

 

00:10:52.000 The reply will traverse the network until

00:10:54.400 it reaches the NAT. The NAT looks

00:10:56.766 at the TCP or UDP port number

00:10:59.166 to which the packet is destined,

00:11:01.200 and uses this to retrieve its internal

00:11:03.566 record of the rewrites that were performed.

00:11:05.966 It then uses this to do the

00:11:08.333 inverse rewrite, changing the destination IP address

00:11:10.733 and port in the packet to those

00:11:13.100 of the host on the private network,

00:11:15.500 then forwards the packet onto the private

00:11:17.866 network for delivery.

 

00:11:20.000 Essentially, the NAT hides a private network

00:11:25.533 behind a single public IP address.

00:11:30.266 The private network can use one of

00:11:35.900 three private IPv4 address ranges: 10.0.0.0/8,

00:11:40.633 176.16.0.0/12, and 192.168.0.0/16.

00:11:43.000 Machines in a private network can directly

00:11:45.600 talk to each other using these private

00:11:48.200 IP addresses, provided that communications stays within

00:11:50.800 the private network.

 

00:11:52.000 When they communicate with the rest of

00:11:54.433 the network, the IP addresses are rewritten

00:11:56.833 so that, to the rest of the

00:11:59.266 network, the private network looks like a

00:12:01.700 single device, with one IP address matching

00:12:04.133 that of the external address of the

00:12:06.533 NAT. This gives the illusion that there

00:12:08.966 are more IPv4 addresses available, by allowing

00:12:11.400 the same private address ranges being re-used

00:12:13.833 in different parts of the network.

 

00:12:16.000 Your home network, for example, almost certainly

00:12:18.833 uses addresses in the 192.168.0.0/16 private address

00:12:21.633 range, and is connected to the rest

00:12:24.466 of the network via a NAT router

00:12:27.300 provided by your ISP.

 

00:12:30.000 This concludes our review of how NAT

00:12:32.400 routers allow multiple devices to share a

00:12:34.766 single IP address. In the next part,

00:12:37.666 I’ll explain some of the problems NATs cause.

Part 4: Problems due to Network Address Translation

The 4th part of the lecture continues the discussion of the problems cause by NAT devices, and why they are used despite these problems. It talks about the use of NAT as a work-around for the lack of IPv4 address space, as a possible translation mechanism between IPv4 and IPv6, and to avoid renumbering. And it talks about the implications of NAT for TCP connections and UDP flows.

Slides for part 4

 

00:00:00.133 In the previous part we discussed what

00:00:03.233 is network address translation, and walked through

00:00:05.466 some examples showing how NAT routers allow

00:00:07.700 several hosts on a private network to

00:00:09.933 share a single IP address.

00:00:11.533 In the following, I want to talk

00:00:13.866 about some of the problems caused by

00:00:16.100 NATs, and to discuss some of the

00:00:18.333 reasons why NATs are used despite these

00:00:20.566 problems.

 

00:00:22.000 The first issue with NAT routers is

00:00:24.500 that they break certain classes of application,

00:00:26.800 and encourage centralisation.

 

00:00:28.400 NATs are designed to support client-server

00:00:30.133 applications, where the client is behind the NAT

00:00:33.133 and the server is a host on

00:00:35.633 the public Internet. Packets sent by a

00:00:38.133 host with a private IP address can

00:00:40.600 pass out through the NAT, and will

00:00:43.100 have their IP address and port translated

00:00:45.600 to use the public IP address of

00:00:47.433 the NAT before they’re forwarded to the

00:00:49.600 public Internet. The NAT will also retain

00:00:51.900 state, so that the reverse translation will

00:00:54.100 be applied to replies to those packets,

00:00:56.066 allowing them to pass back through the NAT.

 

00:00:59.066 This behaviour allows clients to connect to

00:01:01.933 servers, setting up NAT translation state in the process,

00:01:04.900 and to receive responses.

 

00:01:07.800 The reverse doesn’t work, though.

 

00:01:10.133 NAT routers rely on outgoing packets to

00:01:12.366 establish the mappings they need to translate

00:01:14.700 incoming packets. That is, when an incoming

00:01:17.066 TCP or UDP packet arrives at some

00:01:19.400 particular port on a NAT, the NAT

00:01:21.766 looks at its record of what it

00:01:24.100 previously sent from that port, and how

00:01:26.466 it was translated, to know what’s the

00:01:28.800 reverse translation to make. If there’s been

00:01:31.166 no outgoing packet on that port,

00:01:33.166 the NAT won’t know how to translate

00:01:35.500 the incoming packet. It won’t know which

00:01:37.866 of the private IP addresses to use

00:01:40.200 as the destination address for the translated packet.

 

00:01:43.133 This complicates running a server behind a

00:01:45.800 NAT, since the NAT won’t know how

00:01:48.600 to translate incoming requests for the server.

00:01:51.433 It’s possible to manually configure the NAT

00:01:54.233 to forward packets appropriately, of course,

00:01:56.633 and protools like UPnP can help with

00:01:59.433 this, but these approaches are complicated or

00:02:02.266 unreliable. It’s generally easier and more reliable

00:02:05.066 to pay a cloud computing provider to

00:02:07.866 host the server, which encourages centralisation onto

00:02:10.666 large hosting services.

00:02:11.866 NATs also make it hard to write

00:02:14.800 peer-to-peer applications. In part, this is because

00:02:17.600 NATs make incoming connections difficult. But it’s

00:02:20.400 also because hosts located behind a NAT

00:02:23.200 only know their private address, so can’t

00:02:26.033 give their peer a public address to

00:02:28.833 which it can connect. There are solutions

00:02:31.633 to this, that I’ll talk about in

00:02:34.433 the next part of this lecture,

00:02:36.866 but they’re complicated, slow, and wasteful.

00:02:39.266 Unless you really need the privacy and

00:02:42.066 latency benefits of a direct peer-to-peer connection,

00:02:44.866 it’s often easier to relay traffic via

00:02:47.666 a server hosted in a data centre

00:02:50.500 somewhere, with a public IP address,

00:02:52.900 again encouraging centralisation of services.

 

00:02:56.000 If NAT routers are so problematic,

00:02:58.400 why do people use them? There are

00:03:01.166 three reasons.

00:03:01.966 The first is to work around the

00:03:04.866 lack of IPv4 address space.

00:03:06.866 As shown in the figure on the

00:03:09.733 right, the Regional Internet Registries have run

00:03:12.533 out of IPv4 addresses. There are no

00:03:15.300 more IPv4 addresses available for ISPs and

00:03:18.100 companies that want to connect to the

00:03:20.900 Internet, and they can’t provide enough IPv4

00:03:23.666 addresses to fulfil demand.

00:03:25.266 The result is that IPv4 addresses are

00:03:28.166 scarce and expensive. ISPs either don’t have

00:03:30.933 enough addresses to meet their customers needs,

00:03:33.733 or the cost of those addresses is

00:03:36.500 prohibitive, and customers use a private network

00:03:39.300 with a NAT instead of using public

00:03:42.100 IPv4 addresses.

00:03:42.866 The transition to IPv6 will solve this

00:03:45.766 problem, since IPv6 makes addresses cheap and

00:03:48.566 plentiful. The smallest possible address allocation for

00:03:51.333 an IPv6 network is a factor of

00:03:54.133 four billion times larger than the entire

00:03:56.933 IPv4 Internet! Unfortunately, the transition to IPv6

00:03:59.700 has been slow.

 

00:04:02.000 This suggests the second reason why NAT

00:04:05.066 is used: to translate between IPv4 and

00:04:08.100 IPv6 addresses.

00:04:08.966 In this model, an ISP, or other

00:04:12.133 network operator, runs IPv6 internally in their

00:04:15.200 network, and does not support IPv4.

00:04:17.800 This gives the ISP a clean,

00:04:20.433 modern, and future-proof network.

00:04:22.166 The ISP also runs two sets of

00:04:25.333 NATs.

00:04:26.500 For customers that want to use IPv4

00:04:29.666 internally, the customer uses a private IPv4

00:04:32.733 network, and the NAT translates the IPv4

00:04:35.766 packets into IPv6 packets when they leave

00:04:38.833 the customer’s network. The principle is the

00:04:41.900 same as the NAT routers we discussed

00:04:44.933 in the last part of this lecture,

00:04:48.000 except that rather than rewriting packets with

00:04:51.066 private IP addresses to have public IPv4

00:04:54.100 addresses, the NAT rewrites the entire IPv4

00:04:57.166 header and replaces it with an IPv6 header.

 

00:04:59.133 When packets get to the edge of

00:05:01.900 the ISPs network, where it connects to

00:05:05.700 the public Internet, they’re either forwarded as

00:05:08.566 native IPv6 if the destination is accessible

00:05:11.400 via IPv6, or translated to IPv4 by

00:05:14.266 another NAT.

00:05:15.100 The expectation in this approach to running

00:05:18.033 a network is that, over time,

00:05:20.500 the number of customers and destination networks

00:05:23.333 that need IPv4 will go down,

00:05:25.800 and more traffic will run IPv6 end-to-end.

00:05:28.633 NAT is used as a, hopefully temporary,

00:05:31.500 workaround.

 

00:05:33.000 The third reason to use NAT is

00:05:35.666 to avoid renumbering.

00:05:36.833 Networks that have a public IP address

00:05:39.600 range tend, over time, to end up

00:05:42.266 hard coding IP addresses from that range

00:05:44.966 into configuration files, applications, and settings.

00:05:47.266 This is a mistake. Applications should always

00:05:49.933 use DNS names, to allow the IP

00:05:52.600 addresses to change, but people do it

00:05:55.300 anyway.

00:05:56.433 The result is that it’s difficult to

00:05:59.200 change the IP addresses used by machines

00:06:01.866 on a network. The longer a host

00:06:04.566 has used a particular IP address,

00:06:06.866 the more likely it is that something,

00:06:09.533 somewhere, on the network has that address

00:06:12.200 hard-coded, and will fail if the host’s address changes.

 

00:06:15.033 If a network has an IP address

00:06:17.633 range delegated to it from its ISP,

00:06:20.300 what’s known as a provider allocated IP

00:06:22.933 address range, and wants to change ISP,

00:06:25.600 then it will need to change the

00:06:28.233 IP address range it uses to one

00:06:30.900 delegated from its new provider. Many organisations

00:06:33.533 have found this sufficiently difficult that it’s

00:06:36.200 easier to keep the old IP addresses

00:06:38.833 internally, and use a NAT to translate

00:06:41.500 addresses to the range assigned by the

00:06:44.133 new ISP.

 

00:06:45.000 A similar problem can occur if one

00:06:47.900 company buys another, and has to integrate

00:06:50.800 the IT systems of the new company

00:06:53.700 into its existing network.

00:06:55.366 IPv6 has better auto-configuration support than IPv4,

00:06:58.366 and tries to make renumbering easier,

00:07:00.866 but it’s not clear how well this

00:07:03.766 works. As a result, some network equipment

00:07:06.666 vendors have started selling NATs that translate

00:07:09.566 between two different IPv6 prefixes, to ease renumbering.

 

00:07:12.833 In both cases, a better approach is

00:07:15.833 that an organisation gets what’s known as

00:07:18.666 provider independent IP addresses, directly from one

00:07:21.500 of the Regional Internet Registries, so it

00:07:24.333 owns the IP addresses it uses.

00:07:26.766 In this case, the organisation pays its

00:07:29.600 ISP to route traffic to the addresses

00:07:32.433 it owns, and can move to a

00:07:35.266 new ISP without renumbering.

 

00:07:38.000 Given these reasons why NAT routers will

00:07:40.733 be used, despite their problems, what are

00:07:43.466 the implications for NAT routers on TCP

00:07:46.200 connections?

00:07:47.333 Well, as I’ve explained, outgoing connections create

00:07:50.166 state in the NAT, so replies can

00:07:52.900 be translated to reach the correct host

00:07:55.633 on the private network. The question is,

00:07:58.366 then, how does the NAT know what translation state to setup?

 

00:08:01.833 The way this works is that the

00:08:04.966 NAT router looks at TCP segments it’s

00:08:07.966 translating and forwarding, and watches for packets

00:08:10.933 representing a TCP connection establishment handshake.

00:08:13.466 If the NAT sees an outgoing SYN

00:08:16.466 packet, followed by an incoming SYN-ACK,

00:08:19.000 then an outgoing ACK, with matching sequence

00:08:21.966 and acknowledgment numbers, then it can infer

00:08:24.966 that this is the start of a

00:08:27.933 TCP connection, and setup the appropriate translation.

 

00:08:31.000 TCP connections have a similar exchange at

00:08:33.800 the end of the connection, with FIN,

00:08:36.633 FIN-ACK, and ACK packets. The NAT router

00:08:39.433 can watch for these exchanges, and infer

00:08:42.266 that the corresponding TCP connections have finished,

00:08:45.066 and that the translation state can be

00:08:47.900 removed.

00:08:49.033 Unfortunately, applications and hosts sometimes crash,

00:08:51.566 and connections disappear without sending the FIN,

00:08:54.366 FIN-ACK, and ACK packets. For this reason,

00:08:57.200 NAT routers also implement a timeout.

00:08:59.600 If a connection waits too long between

00:09:02.433 sending packets, the NAT will assume it’s

00:09:05.233 failed, and remove the translation state.

 

00:09:07.066 The recommendation from the IETF is that

00:09:09.433 NATs use a two hour timeout,

00:09:11.533 but measurements have shown that many NATs

00:09:13.966 ignore this and use a shorter timer.

00:09:16.400 The result is that long-lived TCP connections,

00:09:18.966 that would otherwise go idle, need to

00:09:21.400 send something, even if just an empty

00:09:23.833 TCP segment, every few minutes, to prevent

00:09:26.266 NATs on the path from timing out

00:09:28.700 and dropping the connection.

00:09:30.100 If you’ve ever used ssh to login

00:09:32.666 to a remote system, gone to do

00:09:35.100 something else, then come back after a

00:09:37.533 couple of hours and wondered why the

00:09:39.966 ssh connection has failed, this may well

00:09:42.400 be due to NAT timeout.

00:09:44.166 The other issue, as I mentioned at

00:09:46.700 the start of this part, is that

00:09:49.133 the NAT won’t have state for incoming

00:09:51.566 connections, unless manually configured to do so.

00:09:54.033 This makes it difficult to run a

00:09:56.466 server or peer-to-peer application behind the NAT.

 

00:10:00.000 The implications of NAT for UDP flows

00:10:02.866 are similar to those for TCP,

00:10:05.366 except that the lack of connections with

00:10:08.233 UDP complicates things.

00:10:09.466 For TCP, a NAT can watch for

00:10:12.466 the connection establishment and teardown segments,

00:10:14.933 and know when the TCP connections start

00:10:17.800 and finish. TCP connections can fail without

00:10:20.700 sending the FIN, FIN-ACK, ACK exchange,

00:10:23.166 but this is rare, and NAT routers

00:10:26.033 generally rely on watching the TCP connection

00:10:28.933 setup and teardown messages to manage translation state.

00:10:32.966 UDP, on the other hand, has no

00:10:35.133 connection establishment,

00:10:36.900 since it has no concept of connections.

 

00:10:39.133 This is not a great problem when

00:10:41.400 it comes to establishing state in a

00:10:43.833 NAT. If the NAT sees any outgoing

00:10:46.233 UDP packet with a particular address and

00:10:48.633 port, it sets up the state in

00:10:51.066 the NAT to allow replies.

00:10:52.766 The problem comes with knowing when to

00:10:55.300 remove that translation state in the NAT.

00:10:57.700 Since UDP has no “end of connection”

00:11:00.100 message, the only way to do this

00:11:02.533 is with a timeout.

 

00:11:04.000 The most widely used UDP application,

00:11:06.200 historically, has been DNS. DNS clients tend

00:11:08.800 to contact a lot of different servers,

00:11:11.400 but exchange only a small amount of

00:11:13.966 data with each. As a result,

00:11:16.200 many NATs have very short timeouts -

00:11:18.766 on the order of tens of seconds

00:11:21.366 - for UDP translation state, to prevent

00:11:23.933 them accumulating state for too many UDP flows.

 

00:11:27.166 An unfortunate consequence of this, is that

00:11:30.166 applications that use UDP, such as video

00:11:33.300 conferencing and gaming, must send packets frequently,

00:11:36.466 in both directions, to make sure the

00:11:39.633 NAT bindings stay open. The IETF recommends

00:11:42.766 that such applications send and receive something

00:11:45.933 at least once every 15 seconds.

00:11:48.633 This can generate unnecessary traffic.

 

00:11:51.000 There is one benefit, though, that comes

00:11:53.733 from the lack of connection establishment signalling

00:11:56.500 in UDP. With TCP, the NAT can

00:11:59.233 see the SYN, SYN-ACK, ACK exchange,

00:12:01.600 and knows the exact addresses and ports

00:12:04.333 that the client and server are using.

00:12:07.066 This allows the NAT to create a

00:12:09.833 very specific binding, and reject traffic from

00:12:12.566 other addresses.

00:12:13.366 These very specific bindings are a security

00:12:16.200 benefit, but make peer-to-peer connections harder to

00:12:18.933 establish. UDP applications tend to be more

00:12:21.700 flexible in where they accept packets from,

00:12:24.433 so NATs generally establish bindings that allow

00:12:27.166 any UDP packets that arrive on the

00:12:29.933 correct port to be translated and forwarded

00:12:32.666 across the NAT. This makes peer to

00:12:35.400 peer connection establishment much easier for NAT,

00:12:38.166 as we’ll see in the next part.

 

00:12:42.000 NATs work around three real problems:

00:12:44.566 lack of IPv4 address space, IPv4 to

00:12:47.600 IPv6 transition, and renumbering. They work well

00:12:50.600 for client-server applications, where the client is

00:12:53.633 behind the NAT and the server is

00:12:56.633 on the public Internet, but make it

00:12:59.633 hard to run peer-to-peer applications and to

00:13:02.666 host servers on networks that use NATs.

 

00:13:04.733 This encourages centralisation

00:13:06.733 of the Internet infrastructure

00:13:08.200 onto cloud providers and, as we’ll see

00:13:11.700 in the next part, greatly complicates certain

00:13:13.566 classes of application.

Part 5: NAT Traversal and Peer-to-Peer Connection Establishment

The final part of the lecture, discusses NAT traversal and peer-to-peer connection establishment. It outlines the binding discovery process, by which a client can establish that it's behind a NAT and find the external IP address of that NAT, and the ICE algorithm for candidate exchange and peer-to-peer connection establishment.

Slides for part 5

 

00:00:00.133 In the final part of this lecture,

00:00:03.400 I’d like to discuss the problem of

00:00:05.766 NAT traversal. That is, how applications can

00:00:08.166 work around the presence of NAT routers

00:00:10.533 to establish peer-to-peer connections.

 

00:00:13.000 As I described in the previous part,

00:00:15.633 NATs are designed to support outbound connections

00:00:18.266 from a client in the private network

00:00:20.900 to a server on the public Internet,

00:00:23.533 and this use case works well.

00:00:25.766 Other scenarios are less successful.

00:00:27.766 Incoming connections, to a server located in

00:00:30.500 the private network, will fail. This happens

00:00:33.133 because the NAT can’t know how to

00:00:35.733 translate the incoming packets. There are work

00:00:38.366 arounds for this, that involve manually configuring

00:00:41.000 the NAT to forward incoming connections to

00:00:43.633 the correct device, but this is difficult

00:00:46.266 to do correctly.

00:00:47.400 Similarly, peer-to-peer connections through a NAT will

00:00:50.133 also fail, unless the packets are sent

00:00:52.766 in a way that makes the NAT,

00:00:55.400 or the NATs if there are several

00:00:58.033 peers all located in private networks,

00:01:00.266 think that a client-server connection is being

00:01:02.900 opened, and that the response is coming

00:01:05.533 from the server. In the following,

00:01:07.800 I’ll talk about how this can be

00:01:10.533 arranged.

 

00:01:12.000 The figure shows an example where two

00:01:14.866 hosts, A and B, and trying to

00:01:17.733 establish a direct peer to peer connection.

00:01:20.600 For example, this could be two devices

00:01:23.466 in people’s homes that are trying to

00:01:26.366 setup a video call.

00:01:28.000 Each of these hosts is in a

00:01:30.966 private network, and is connected to the

00:01:33.833 public Internet via a NAT. It’s possible,

00:01:36.700 indeed likely, that if these are home

00:01:39.566 networks, then both of the private networks

00:01:42.433 will be using the IP address range

00:01:45.333 192.168.0.0./16, since that’s the default for most

00:01:48.200 home NAT routers. A consequence is that

00:01:51.066 Host A and Host B could both

00:01:53.933 be using the same private IP address,

00:01:56.800 for example both hosts could be using

00:01:59.666 IP address 192.168.0.2.

 

00:02:01.000 This isn’t a problem, since Host A

00:02:03.566 and Host B are on different private

00:02:06.100 networks, each hidden behind a different NAT.

00:02:08.666 The two NATs have different public IP

00:02:11.233 address on the external interface of the

00:02:13.800 router, and what’s used internally is not

00:02:16.333 visible to the rest of the network.

 

00:02:19.000 How do these two hosts go about

00:02:21.433 establishing a connection?

00:02:22.500 Well, Host A can’t send a packet

00:02:25.033 directly to Host B, because it has

00:02:27.466 the same IP address. If it tries,

00:02:29.900 the packet will come straight back to

00:02:32.366 itself!

00:02:33.466 Rather, in order to connect to Host

00:02:36.000 B, Host A will have to discover

00:02:38.433 the external address and port number that

00:02:40.866 NAT B is using for packets sent

00:02:43.333 by Host B. It can then send

00:02:45.766 its packets to NAT B, that will

00:02:48.200 translate and forward them to host B.

 

00:02:50.266 To do this, the two peers,

00:02:52.233 Host A and Host B, both make

00:02:54.866 connections to a referral server located somewhere

00:02:57.500 on the public Internet. This is shown

00:03:00.100 in the dashed red lines on the

00:03:02.733 slide. They ask that server where their

00:03:05.333 packets appear to be coming from.

00:03:07.600 This process is known as binding discovery,

00:03:10.200 and lets the hosts find out how

00:03:12.833 their NAT is translating packets. The result

00:03:15.466 is a candidate address for each host,

00:03:18.066 that it thinks is the external address

00:03:20.700 of the NAT that will translate incoming

00:03:23.300 packets and forward them to it.

00:03:25.566 The peers then exchange these candidate addresses

00:03:28.266 with each other, via the referral server.

 

00:03:31.000 Once they’ve received the candidate addresses from

00:03:33.466 their peer, the two hosts systematically send

00:03:35.933 probe packets, to check if any of

00:03:38.400 these candidates actually work to reach the

00:03:40.866 peer. That is, the hosts check if

00:03:43.366 the outgoing probe packets they send will

00:03:45.833 correctly setup translation state in the NAT,

00:03:48.300 so that incoming probes from the peer

00:03:50.766 will be translated and forwarded to them.

00:03:53.233 And they check that there are no

00:03:55.700 firewalls that are blocking the traffic.

00:03:57.833 If the probes are successfully received,

00:04:00.033 in both directions, then the two hosts

00:04:02.500 can switch to using the direct peer-to-peer

00:04:04.966 path, shown as the solid blue line

00:04:07.433 on the slide, and no longer need

00:04:09.933 the server.

00:04:10.633 If the probes fail, then a direct

00:04:13.200 peer-to-peer connection may not be possible,

00:04:15.300 and the hosts may have to relay

00:04:17.766 all traffic via the referral server.

 

00:04:21.000 The process of finding out what translations

00:04:23.633 a NAT is performing is known as

00:04:26.233 NAT binding discovery.

00:04:27.366 The Session Traversal Utilities for NAT,

00:04:29.700 STUN, is a commonly used protocol that

00:04:32.333 performs NAT binding discovery in the Internet.

00:04:34.933 When a host on a private network

00:04:37.666 sends a packet to a host on

00:04:40.266 the public network, the NAT at the

00:04:42.900 edge of the private network will translate

00:04:45.533 the source IP address and port number

00:04:48.133 in the packet. The host on the

00:04:50.766 private network doesn’t know what translation has

00:04:53.366 been done, but the server that receives

00:04:56.000 the packet can inspect its source address

00:04:58.633 and port, to find out where it

00:05:01.233 came from.

00:05:02.000 For example, when using a UDP socket,

00:05:04.700 an application can use the recvfrom() system

00:05:07.333 call to retrieve both the contents of

00:05:09.933 a UDP packet and its source address.

00:05:12.566 Similarly, for TCP connections, the accept() system

00:05:15.266 call returns the address of the client.

 

00:05:18.000 The server then replies to the client,

00:05:20.333 telling it where the packet appeared to

00:05:22.633 come from. This is what’s known as

00:05:24.966 a server reflexive address. That is,

00:05:26.933 the address that a server thinks the

00:05:29.266 client has.

00:05:29.933 If there’s a NAT between the client

00:05:32.333 and the server, then the server reflexive

00:05:34.666 address will be different to the address

00:05:36.966 from which the client sent the packet.

00:05:39.300 If the client’s addresses and the server

00:05:41.600 reflexive address are the same, the client

00:05:43.933 knows there’s no NAT between it and

00:05:46.233 the server.

 

00:05:47.000 You might ask why a host that’s

00:05:49.300 in a private network doesn’t just ask

00:05:51.633 its NAT how it will translate the

00:05:53.933 packets? Two reasons.

00:05:54.933 The first is that by the time

00:05:57.333 we realised that binding discovery was needed,

00:05:59.633 there were already tens of millions of

00:06:01.966 NATs deployed, with no way to upgrade

00:06:04.266 them to add a way to ask

00:06:06.566 how they’ll translate packets.

00:06:07.900 The second is that a host might

00:06:10.300 not know that it’s behind a NAT,

00:06:12.633 or might be behind more than one

00:06:14.933 NAT, and so won’t know what NAT

00:06:17.233 to ask for the binding.

 

00:06:20.000 When performing binding discovery, it’s important that

00:06:22.633 a host discovers every possible candidate address

00:06:25.266 on which it might be reachable.

00:06:27.533 For example, think about a phone that

00:06:30.266 has both 4G and WiFi interfaces.

00:06:32.533 Each of these interfaces can have an

00:06:35.166 IPv4 address and an IPv6 address,

00:06:37.400 representing the point of attachment to networks

00:06:40.033 it directly connects to. This could be

00:06:42.666 a total of four possible IP addresses

00:06:45.300 for the phone.

00:06:46.433 The phone may be behind IPv4 NAT

00:06:49.166 routers on each of those interfaces,

00:06:51.433 and so each interface might also have

00:06:54.066 a server reflexive address on which it

00:06:56.700 can be reached, that the host can

00:06:59.333 discover using STUN. This can give another

00:07:01.966 two addresses, bringing the total to six.

00:07:04.600 It’s unlikely, but the phone could also

00:07:07.333 be connected via one or more IPv6

00:07:09.966 NATs. This potentially gives two more server

00:07:12.600 reflexive addresses on which it can be

00:07:15.233 reached.

00:07:16.366 In case these server reflexive addresses don’t

00:07:19.100 work, the phone may also be able

00:07:21.733 to use the referral server to relay

00:07:24.366 for it, using a protocol called TURN,

00:07:27.000 acting as a proxy to deliver traffic

00:07:29.633 if a direct connection isn’t possible.

00:07:31.900 This proxy endpoint might be accessible via IPv4 and IPv6.

 

00:07:35.133 The phone might also have a VPN

00:07:37.600 connection, and be able to send and

00:07:40.233 receive traffic via the VPN, as well

00:07:42.833 as directly. That VPN endpoint could be

00:07:45.433 accessible over IPv4 or IPv6, and might

00:07:48.066 itself be behind a NAT, so it’s

00:07:50.666 necessary to check for server reflexive addresses

00:07:53.266 on the VPN interface.

00:07:54.766 Not all of these will exist for

00:07:57.500 every device, of course, but the point

00:08:00.100 is that a modern networked device is

00:08:02.700 often reachable in many different ways.

00:08:04.933 If it’s to successfully connect to another

00:08:07.566 device, in a peer-to-peer manner, it needs

00:08:10.166 to find as many of these candidate

00:08:12.766 addresses as possible.

 

00:08:15.000 Having run a binding discovery protocol to

00:08:17.533 find all its possible candidates, a host

00:08:20.066 sends the list of candidates to the

00:08:22.566 referral server, and the referral server sends

00:08:25.100 them on to its peer. Its peer

00:08:27.633 does the same, and the host receives

00:08:30.166 the peer’s candidates via the referral server.

00:08:32.666 At this point, the two hosts know

00:08:35.200 each others candidate addresses, and are ready

00:08:37.733 to check which of the addresses work.

00:08:40.266 Given that the two peers can communicate

00:08:42.866 via the referral server, you might ask

00:08:45.400 why the peers bother to establish a

00:08:47.933 peer-to-peer connection, and don’t instead just keep

00:08:50.466 communicating via the relay?

 

00:08:52.000 The primary reason is because a direct

00:08:54.700 peer-to-peer connection is usually lower latency than

00:08:57.366 a connection via a relay, and for

00:09:00.066 peer-to-peer applications like video calls, latency matters.

00:09:02.766 The second is that the relay server

00:09:05.533 can eavesdrop on connections that it’s relaying,

00:09:08.233 but not on direct peer-to-peer connections.

00:09:10.533 This is perhaps less of a concern

00:09:13.233 than you might think, since the traffic

00:09:15.933 can be encrypted so it can’t be

00:09:18.600 read by the server. Also, the server

00:09:21.300 knows that the call is happening anyway,

00:09:24.000 and sometimes knowledge that two people are

00:09:26.666 talking is almost as sensitive as knowing

00:09:29.366 what they’re talking about.

 

00:09:32.000 Once they’ve exchanged candidates, the two hosts

00:09:34.633 systematically send probe packets from every one

00:09:37.266 of their candidate addresses to every one

00:09:39.900 of the peer’s candidate addresses in turn,

00:09:42.533 to see if the can establish a

00:09:45.133 direct connection.

 

00:09:46.000 The idea is that a probe packet

00:09:48.533 sent, for example, from Host A to

00:09:51.100 a server reflexive address of Host B,

00:09:53.633 will open a binding in NAT A,

00:09:56.166 even if it fails to reach host

00:09:58.733 B. This open binding will allow a

00:10:01.266 later probe from Host B to the

00:10:03.800 server reflexive address of Host A to

00:10:06.366 reach Host A. This will, in turn,

00:10:08.900 trigger Host A to probe again in

00:10:11.466 response, and this time the probe from

00:10:14.000 Host A to the server reflexive address

00:10:16.533 of Host B will succeed because the

00:10:19.100 probe from Host B opened the necessary

00:10:21.633 binding on NAT B. The two hosts

00:10:24.166 then start sending traffic and keep-alive messages

00:10:26.733 on that path to keep the bindings

00:10:29.266 active, while the probing continues on all

00:10:31.800 the other candidates.

 

00:10:33.000 The probing can take a long time,

00:10:35.466 so candidate addresses are assigned a priority

00:10:37.966 based on how likely the host thinks

00:10:40.433 it is to be reachable on that

00:10:42.900 address, and on its expectation of how

00:10:45.400 well that address will perform. The checks

00:10:47.866 take place in priority order, to quickly

00:10:50.333 try to find a pair of candidates

00:10:52.833 that works.

00:10:53.533 If more than one pair of candidate

00:10:56.100 addresses successfully succeeds, the hosts choose the

00:10:58.600 best path, for example the path with

00:11:01.066 the lowest latency, and drop the other connections.

 

00:11:03.900 The Interactive Connectivity Establishment algorithm, ICE,

00:11:06.733 described by the IETF in RFC 8445

00:11:09.933 describes this probing process is detail.

00:11:12.666 When making a peer-to-peer phone or video

00:11:15.966 call, the ICE algorithm and the probing

00:11:19.133 usually happens while the phone is ringing,

00:11:22.333 so the connection is ready when the

00:11:25.533 call is answered.

 

00:11:28.000 What should be clear by now is

00:11:31.166 that NAT binding discovery, and the systematic

00:11:34.333 connection probing needed for NAT traversal,

00:11:37.033 is complex, slow, and generates a lot

00:11:40.200 of traffic. The RFCs that describe how

00:11:43.366 the process works are almost 200 pages

00:11:46.533 long, and are not easy to implement

00:11:49.700 correctly.

00:11:50.900 The result is reasonably effective for UDP

00:11:54.166 traffic.

00:11:55.366 The STUN protocol, and the ICE algorithm,

00:11:58.633 were developed to support voice-over-IP applications,

00:12:01.333 that run over UDP, and the result works well.

 

00:12:03.766 It’s less effective for peer-to-peer TCP connections.

00:12:07.233 NATs tend to be quite permissive for

00:12:10.433 UDP, translating any incoming UDP packet that

00:12:13.666 reaches the correct address and port,

00:12:16.400 but are often stricter for TCP connections,

00:12:19.633 and check for matching TCP sequence numbers,

00:12:22.833 etc., This makes peer-to-peer TCP connections less

00:12:26.066 likely to be successful.

 

00:12:29.000 In this lecture, I’ve outlined how client-server

00:12:32.166 connection establishment works, and how the use

00:12:35.333 of TLS and IPv6 can affect connection

00:12:38.500 establishment, and can require connection racing using

00:12:41.666 the “happy eyeballs” technique. I also showed

00:12:44.833 that connection establishment latency is often a

00:12:47.966 critical factor limiting the performance of TCP connections.

 

00:12:50.866 In the later parts, I outlined how

00:12:53.200 and why NAT routers are used,

00:12:56.433 their advantages and disadvantages, and how NAT

00:12:59.633 traversal techniques work to establish

00:13:01.666 peer-to-peer connections.

 

00:13:04.066 Establishing a connection used to be a

00:13:06.600 simple task. What I hope to have

00:13:09.200 shown you is that it’s no longer

00:13:11.800 simple, not in the client-server case,

00:13:14.033 and especially not when peer-to-peer connections are needed.

Discussion

Lecture 2 discussed connection establishment in a fragmented network. It started with a review of the TCP service model, and how it establishes a client-server connection. Then, it showed some of the factors that affect connection establishment performance.

One of the key factors affecting performance was latency, and the number of round trips between the client and server needed to establish the connection. With the aid of a simple example, I tried to show that latency is often the main limitation, rather than bandwidth. Think whether the example look reasonable to you? Did the results surprise you? Given this behaviour, would you be willing to pay your ISP for a higher bandwidth Internet connection?

We then discussed dual-stack connection establishment for networks that have both IPv4 and IPv6 hosts. I highlighted that the IPv4 and IPv6 networks are separate, and showed that parallel connection establishment is needed. Consider how does the complexity of parallel connection establishment compare to the sequential DNS look-up code shown in lab 1? How would you implement this parallel connection establishment and racing? Do you think the complexity is worth the effort to speed-up connection establishment?

Finally, the lecture discussed network address translation and NAT traversal, allowing several hosts to share an IP address. It showed how NAT devices work, and discussed some of the problems they cause. Review when does NAT work well and when is it problematic? Think about what types of application do NATs break? Given this, why do people use NAT devices?

NAT traversal uses binding discovery and the ICE algorithm to establish peer-to-peer connections, using a referral server to exchange addresses then probing to check is candidate addresses work. Review this algorithm to determine whether the approach makes sense. How effectively do you think this approach to NAT traversal works? How easy do you think it would be to implement?