csperkins.org

Networked Systems H (2021-2022)

Lecture 9: Networks and Internet Routing

Lecture 9 discusses content distribution networks and Internet routing. It discusses what are CDNs and what role they play in the Internet, as mechanism to spread load and reduce latency. The problem of inter-domain routing is then introduced, and the BGP routing protocol is reviewed as a mechanism for providing policy routing across the Internet. Some of the security limitations of BGP are highlighted, along with current approaches to try to address these. Finally, intra-domain routing, routing within a network, is briefly reviewed.

Part 1: Content Distribution Networks (CDNs)

The lecture begins by discussing content distribution networks (CDNs). It outlines what are CDNs, and the role they play in the network to help distribute load and reduce latency. Two approaches to locating CDN nodes, using DNS and anycast routing, are briefly introduced.

Slides for part 1

 

00:00:00.433 In this lecture I want to talk

00:00:01.866 about how routing works in the Internet.

 

00:00:05.866 I’ll start, in this part, by talking

00:00:07.700 briefly about content distribution networks, which we

00:00:10.733 discussed a couple of lectures ago.

 

00:00:13.133 And then, in the later parts,

00:00:14.866 I’ll talk about inter-domain routing, how network

00:00:18.966 operators cooperate to deliver data across the

00:00:22.033 wide-area network. I’ll talk briefly about routing

00:00:24.833 security. And I’ll talk about intra-domain routing,

00:00:28.366 how routing within an operator’s network.

 

00:00:34.066 So, in this part, I’d like to

00:00:35.300 start by talking about content distribution networks.

 

00:00:38.500 I’ll talk about how CDNs help with

00:00:40.333 load balancing and latency reduction, and how

00:00:43.366 they're implemented using either the DNS or

00:00:46.066 using anycast routing.

 

00:00:51.866 So what is a content distribution network?

 

00:00:56.000 A content distribution network is a service

00:01:01.733 which provides scalable, load-balanced,

00:01:05.466 low-latency hosting for web content.

 

00:01:09.933 CDN operators, companies such as Akamai,

00:01:12.900 CloudFlare, and Fastly, are in the business

00:01:16.133 of hosting content for their customers.

 

00:01:20.533 Their customers give them web content,

00:01:23.033 and this may be images, it may

00:01:25.100 be software uploads, it may be video,

00:01:27.966 doesn't matter what it is, anything that

00:01:29.800 can sit on the web.

 

00:01:31.766 And the CDN hosts that content in

00:01:34.100 web caches that are spread around the world.

 

00:01:36.933 And some of these are located in

00:01:38.566 data centres, some of these are located

00:01:40.900 in edge networks operated by various ISPs.

 

00:01:45.833 And the idea is to reduce the

00:01:49.566 load on the main servers.

 

00:01:51.833 Rather than keeping a local copy of

00:01:54.166 the file, the customer gives it to

00:01:56.766 the CDN and links to the copy

00:01:58.166 hosted by the CDN. And this reduces

00:02:00.400 the load on the customer, and puts the load onto the CDN.

 

00:02:04.433 The idea is that the CDNs are

00:02:06.233 big enough, and have enough caches in

00:02:08.566 enough data centres and enough edge networks,

00:02:11.533 that this spreads the load throughout the world,

00:02:14.166 and prevents from being overloaded for high traffic sites.

 

00:02:18.066 It reduces latency for the requests,

00:02:20.566 because the CDN will have a cache

00:02:23.366 near to the person making the request, and

00:02:27.666 can spread the load around the world. And it

00:02:30.933 reduces the chances of a successful denial

00:02:33.433 of service attack,

00:02:35.166 again just because of the sheer size

00:02:37.733 of the CDN, and the sheer number of caches it has.

 

00:02:42.233 And there are many commercial CDNs available,

00:02:45.700 I think the big three of those

00:02:47.566 are listed on the slide, but there

00:02:48.933 are certainly many others.

 

00:02:50.600 And many large organisations also run their

00:02:53.566 own CDNs. In particular, that the so-called

00:02:57.500 hypergiants, big companies such as Google,

00:03:00.200 Facebook, Netflix, Apple, and the like,

00:03:04.133 all run their own large-scale content distribution networks.

 

00:03:12.033 The goal of CDNs is to distribute load.

 

00:03:16.900 They distribute load by caching content

00:03:20.700 all around the world, and by answering

00:03:22.700 most requests from a local cache.

 

00:03:27.433 And, in order to do that,

00:03:29.600 they need to have servers located everywhere.

 

00:03:32.766 They need to be very large,

00:03:34.700 and have very wide geographical distribution.

 

00:03:38.900 This means they need a large-scale investments,

00:03:41.933 large-scale cooperation with network operators, with ISPs,

00:03:46.833 with Internet exchange points,

00:03:48.133 with data centres, and the like.

 

00:03:53.566 And they try to host caches as

00:03:57.333 near to the customers as they can.

 

00:04:00.733 And to give you an idea of

00:04:01.833 the scale of this, the picture at

00:04:03.900 the top, the top-right of the slide,

00:04:05.800 comes from Netflix, and shows the reach

00:04:09.033 of their caches, their CDN.

 

00:04:13.300 And we see that they’re located,

00:04:16.066 primarily, in North America and Europe,

00:04:19.433 and to a lesser extent in South

00:04:23.466 America, where the customers of Netflix are

00:04:26.433 located. But we do see that are

00:04:30.133 massive numbers of servers in those regions,

00:04:33.833 and also servers in Australia, New Zealand, Japan,

00:04:39.300 Singapore, the Middle East, South Africa,

00:04:43.666 and so on, to try and get some more geographic spread.

 

00:04:49.033 And the statistics from Akamai, they boast

00:04:52.400 that they have more than 240,000 servers

00:04:55.566 in over 150 countries So these are

00:04:58.233 very large scale, very widely distributed, server networks.

 

00:05:04.400 And they often get this benefit by

00:05:07.100 hosting servers within edge ISP’s networks.

 

00:05:12.166 So an ISP, such as Virgin Media

00:05:15.200 in the UK, would almost certainly be

00:05:17.166 hosting CDN caches for Netflix, Akamai,

00:05:22.233 and the various other big CDNs.

 

00:05:26.666 And there’s a mutual benefit for an

00:05:28.933 ISP to hosts such a cache;

00:05:30.900 there's a mutual benefit from the ISP

00:05:33.200 to work with a CDN.

 

00:05:35.500 And, clearly, from the CDNs point of

00:05:37.500 view, it increases the reach and the

00:05:40.233 robustness of the CDN, if they can

00:05:42.000 put caches in as many networks as possible.

 

00:05:45.300 From the ISP’s point of view,

00:05:47.333 though, it reduces the load on their network.

 

00:05:50.366 The CDN can push one copy of

00:05:53.366 a file into the cache, and it

00:05:57.033 can then distribute it to the other

00:05:59.433 customers of that ISP. And it means that all of that

00:06:03.266 load is then served from within the

00:06:05.100 ISP’s network, without having to go over

00:06:07.133 the expensive wide-area links to the rest of the Internet.

 

00:06:10.400 And this avoid overloading the links from

00:06:12.933 the ISP to the outside world.

 

00:06:16.166 And the scale of some of the

00:06:19.733 popular services means that this is necessary.

 

00:06:23.866 Netflix, for example, talk about how they

00:06:27.133 distribute 10s of terabits of video per

00:06:29.400 second, and this clearly isn't possible from

00:06:32.300 a single data centre.

 

00:06:33.866 It has to be pushed out in

00:06:35.600 a hierarchy, with the central data centre

00:06:39.266 pushing data out to a CDN,

00:06:41.100 which pushes it out to edge caches,

00:06:42.600 which distribute to the customers. You can't

00:06:45.433 host all of this from a single

00:06:46.666 data centre, from a single site,

00:06:48.266 you have to spread the load around the world.

 

00:06:55.100 The other benefit of CDNs is that they reduce latency.

 

00:06:59.600 The goal is that the content is

00:07:02.333 not only spread geographically for load balancing

00:07:05.866 reasons, but it’s spread geographically so that

00:07:08.900 there's always a local copy near to

00:07:11.966 the person requesting the data.

 

00:07:15.100 That means when you request content from a popular site,

00:07:20.666 if you're based in Europe and you

00:07:23.800 request content from a popular site,

00:07:25.366 it doesn't have to go to the

00:07:28.133 US where the site is based,

00:07:31.000 but can be answered, but the request

00:07:32.733 can be answered, from a CDN cache located in Europe.

 

00:07:38.166 And this reduces the latency for your

00:07:41.000 requests, because it can be cached near to you.

 

00:07:45.433 But, of course, it requires a global

00:07:47.500 distribution of the proxy caches.

 

00:07:51.500 And, I think, one of the questions

00:07:53.700 here is how effectively CDNs are managing

00:07:56.233 to serve the entire world?

 

00:07:59.033 If we look at this picture,

00:08:01.266 from Netflix, we see that if you're

00:08:02.833 in Europe or North America, that there

00:08:06.100 are certainly many CDN caches, and there

00:08:09.433 will be one located very near to you.

 

00:08:12.466 And if you're located in certain parts

00:08:15.666 of South America, if you're located in

00:08:18.366 the populous bit of Western Australia,

00:08:21.900 if you're located in Singapore, or in

00:08:24.600 Japan, or somewhere like that,

00:08:27.433 there'll be a CDN near to you.

 

00:08:31.800 If you're in Africa, though,

00:08:35.133 you’re perhaps less well served. If you're

00:08:38.000 in large parts of Asia, you’re less well served.

 

00:08:41.600 And, increasingly, providing Internet access to developing

00:08:45.933 regions of the world needs more than

00:08:48.400 just providing connectivity.

 

00:08:50.633 If you want to provide high-quality Internet

00:08:53.133 access to parts of Africa, for example,

00:08:56.200 or parts of Asia which don't have

00:08:57.700 it yet, you don't just need to

00:08:59.333 provide bandwidth, you don't just need to

00:09:01.433 provide network links, you need to provide

00:09:03.966 data centres that can host CDN caches.

 

00:09:07.633 So it's increasing the investment needed to

00:09:12.333 get good performance.

 

00:09:19.400 And this works for cacheable static content.

00:09:24.466 CDNs have historically been focused on video,

00:09:27.900 and images, and software updates, and distributing

00:09:29.966 large files, and they work incredibly well for that.

 

00:09:34.066 But we're also starting to see people

00:09:35.766 talk about edge compute applications.

 

00:09:39.200 And applications where there is some sort

00:09:41.966 of computation going on near to the

00:09:44.366 customer. And this tends to be for

00:09:47.966 augmented reality games, and applications like that,

00:09:51.333 where you need low latency to the

00:09:53.966 compute server, the data centre.

 

00:09:56.266 And again CDNs, are starting to host

00:09:59.200 this sort of content, starting to allow

00:10:02.766 compute to be pushed into the edges.

00:10:04.933 And, again, this means that they don't

00:10:07.166 just need caching and data storage at the edges,

00:10:10.100 but they need large scale computing infrastructure.

 

00:10:12.900 Again, developed parts of the world this

00:10:17.566 is eminently achievable. In developing, in less

00:10:21.966 well-developed parts of the world,

00:10:23.466 this infrastructure isn't yet there.

 

00:10:30.333 So, how do the CDNs work? How

00:10:33.333 do they find the nearest node in

00:10:36.966 order to deliver the content? I mean

00:10:40.566 actually delivering the content is easy,

00:10:42.533 a CDN node is just a web

00:10:44.633 server which has the files located on

00:10:48.266 it, and it just delivers them using

00:10:49.866 HTTP. The question is how you find

00:10:52.000 the right CDN node, that has the file you're looking for.

 

00:10:57.300 There’s two ways they do it.

 

00:10:59.733 Some of them use the DNS,

00:11:01.866 and some of them use a technique known as anycast routing.

 

00:11:06.900 For the CDNs that use the DNS,

00:11:10.466 the goal is that they locate the

00:11:12.500 nearest CDN node based on,

00:11:16.600 and give you an answer for where

00:11:19.533 that node is, by playing games with the DNS queries.

 

00:11:25.600 So in this case, when a customer

00:11:28.533 of the CDN gives a resource to

00:11:30.900 the CDN to be hosted, the CDN

00:11:33.566 gives that resources unique domain name.

 

00:11:37.400 For example, if the site example.com is

00:11:41.033 trying to host an image of a

00:11:43.133 kitten on a CDN, the CDN would

00:11:45.633 give that a unique hostname. And in

00:11:49.933 this case, it’s picked a

00:11:53.933 hexadecimal name, 9BC1C…. etc.

 

00:12:00.466 But the point is that every image,

00:12:02.600 every resource, every file on the CDN,

00:12:04.700 has a unique host name.

 

00:12:07.700 Now, of course, they don't alway refer

00:12:09.566 to real hosts. They’re all entries in

00:12:12.066 the DNS which point to a particular

00:12:16.733 server in the cache, but it gives the flexibility.

 

00:12:20.133 Because every file, every image, every piece

00:12:23.833 of content, on the CDN has a

00:12:25.766 different hostname, the CDN can return a

00:12:28.800 different IP address for each image,

00:12:31.900 each file, and it can point it

00:12:33.733 at an appropriate replica.

 

00:12:37.966 So, the way this works is that

00:12:39.733 the CDN returns different answers to the

00:12:43.400 DNS queries for the A or AAAA

00:12:45.833 records for the names, depending on where

00:12:49.166 they're being requested from, and what CDN

00:12:51.833 caches have that data.

 

00:12:54.966 So it looks at the IP address

00:12:57.166 of the resolver making the query.

 

00:12:59.700 And the CDN,

00:13:02.833 when it gets the name look-up for this

00:13:07.266 host, redirects it by returning a different

00:13:11.233 IP address that refers to a local cache.

 

00:13:14.800 And if I look up

00:13:17.100 this name from my home, I might

00:13:19.833 get a particular CDN cache located in the

00:13:23.900 ISP I have, and if you make

00:13:26.733 the same look-up from your home,

00:13:28.866 in a different ISP’s network, you’ll get

00:13:31.066 a different IP address back for that

00:13:32.733 name, pointing to a different cache that's

00:13:35.933 hosted by the CDN.

 

00:13:39.733 And this is based on the IP

00:13:42.066 address of the resolver, because all the

00:13:44.500 CDN sees is the requests coming from the local resolvers.

 

00:13:49.100 But the DNS resolver has an extension

00:13:51.966 called DNS client subnet extension, so if

00:13:54.300 the client is not in the same

00:13:55.700 place as the resolver, the resolver can

00:13:58.333 tell the CDN the IP address where the client came from.

 

00:14:03.600 And the CDN has to look-up the

00:14:05.500 IP address where it sees the requests

00:14:07.033 coming from, that of the resolver or

00:14:09.533 that of the client with the client subnet extension,

00:14:12.366 and try and guess where in the world it is.

 

00:14:15.066 It needs to look-up the IP address,

00:14:16.966 and have a mapping of IP addresses to locations.

 

00:14:21.800 And this doesn't need to be particularly

00:14:23.700 accurate. The goal is to figure out

00:14:26.700 if you're in the UK, and direct

00:14:30.000 to the cache based in London,

00:14:31.733 rather than the cache based in New York, for example.

 

00:14:34.633 It doesn’t really care if it realises

00:14:37.533 that you're in Glasgow, or Manchester,

00:14:39.700 or wherever, the main thing is it

00:14:41.366 knows that you're in the UK,

00:14:42.566 so you should go to UK-based cache.

 

00:14:46.433 And this gives the CDN very fine-grained

00:14:49.000 control. It can put the time-to-live on

00:14:52.333 its DNS responses down to be a

00:14:54.900 small number of seconds, so

00:14:58.766 every time a client looks up an

00:15:01.066 image, for every different image, every different

00:15:04.666 resource the CDN is hosting, it can

00:15:06.400 return a different answer. So it can

00:15:08.200 very rapidly load balance among it’s different caches,

00:15:11.933 amongst it’s different data centres. But it

00:15:14.700 puts a high load on the DNS,

00:15:16.500 it means there's lots of DNS queries

00:15:18.566 happening, and they can't be cached for very long.

 

00:15:25.566 The other approach CDNs use, is known as anycast routing.

 

00:15:33.166 And this doesn't play games with DNS,

00:15:37.500 it uses the DNS in a much more traditional way.

 

00:15:41.333 In that

00:15:43.333 the DNS names for the CDN always

00:15:46.933 the same; they always just refer to

00:15:49.566 the CDN. And they always return the same answer.

 

00:15:53.466 And what it does is, each resource

00:15:56.200 the CDN is hosting, it gives it a different filename.

 

00:16:01.100 And the DNS name always maps to

00:16:03.033 the same IP address. Literally, it always

00:16:06.566 maps to the same IP address.

 

00:16:09.100 And in this example,

00:16:11.633 the CDN has three data centres,

00:16:14.900 all of which are using IP address 192.0.2.4.

 

00:16:22.266 And the CDN has many data centres

 

00:16:25.366 around the world, and they all use

00:16:27.233 the same IP address ranges. And they

00:16:29.733 advertise those IP address ranges into the

00:16:32.566 routing system, into the BGP routing system

00:16:35.566 we’ll talk about in the next part.

 

00:16:38.300 And the Internet routing then ensures that

00:16:40.533 the traffic goes to the closest data centre to source.

 

00:16:45.066 By advertising the same IP address into

00:16:47.833 the routing from multiple places, the routing

00:16:50.533 system makes sure that the traffic goes

00:16:52.366 to the nearest data centre.

 

00:16:55.666 And it's an abuse of routing.

 

00:16:57.533 It's intentionally advertising the same IP address

00:17:00.900 from multiple places,

00:17:02.900 letting the routing take care of how the data gets there.

 

00:17:09.733 Which approach do CDNs use? Probably a mix of both.

 

00:17:16.233 Some of the large ones just use

00:17:18.733 the DNS-based approach, some of them use

00:17:21.633 a mix of both approaches, and both

00:17:23.733 approaches work, and they have different trade-offs.

 

00:17:30.133 So that's what I want to say

00:17:31.333 about CDNs. The goal of CDNs is

00:17:34.200 to provide load balancing and to reduce

00:17:36.400 latency, by allowing responses for web content

00:17:41.466 to be redirected from the original sites

00:17:45.733 to the content distribution network, which in

00:17:48.566 turn hosts that content at

00:17:51.200 numerous locations around the world which are,

00:17:54.066 hopefully, are close to the end users.

 

00:17:57.766 And it can be implemented by playing

00:17:59.700 tricks with the DNS where, depending on

00:18:03.733 where you make the DNS lookup you

00:18:06.066 get a different answer back locating you to local cache,

00:18:09.233 or it can be implemented using anycast

00:18:11.300 routing, where the caches all have the

00:18:13.933 same IP address and the routing system

00:18:15.833 takes you to the nearest replica.

 

00:18:19.200 In the next part, I'll talk about

00:18:21.333 how the routing system, the BGP routing.

00:18:24.800 works in the Internet.

Part 2: Inter-domain Routing

The second part of the lecture introduces the inter-domain routing problem. It reviews the network-of-networks nature of the Internet, and the concept of Autonomous Systems (ASes), and introduces the AS graph and BGP as the basis for inter-domain routing. The differences between routing at the edges and in the core of the network is discussed, as is the role of the default-free zone and Internet Exchange Points. The operation of BGP as a path vector protocol, choosing shortest policy compliant path is reviewed; including a discussion of routing policy, the Gao-Rexford rules, and the BGP decision process.

Slides for part 2

 

00:00:00.500 In this part I'd like to talk

00:00:01.833 about routing in the Internet, in particular

00:00:04.433 the idea of interdomain routing, routing between

00:00:08.033 networks, between autonomous systems.

 

00:00:10.633 I’ll talk about what is an autonomous system.

 

00:00:13.133 I’ll talk about the AS graph,

00:00:14.833 the graph of interconnections between networks that

00:00:17.500 form the Internet.

 

00:00:18.900 I’ll talk about how routing works at

00:00:20.666 the edges, and in the core of

00:00:22.433 the network. And I’ll talk about the

00:00:24.133 Border Gateway Protocol, BGP, which enables routing

00:00:27.666 in the Internet.

 

00:00:31.666 The Internet is a network of networks.

 

00:00:37.066 Fundamentally it's built as a set of

00:00:40.333 independently owned, independently operated, networks which

00:00:45.033 talk with each other, and which collaborate to deliver data.

 

00:00:50.166 Each of these networks is what's known

00:00:52.633 as an autonomous system. It operates independently.

 

00:00:55.633 And each network is a separate routing domain.

 

00:00:58.200 It makes its own decisions internally

00:01:00.266 how to route data around its own network.

 

00:01:05.200 The problem of interdomain routing is the

00:01:07.800 problem of finding the best path across

00:01:10.100 this set of networks. It’s the problem

00:01:12.966 of finding the best path from the

00:01:14.433 source network to the destination network,

00:01:17.233 treating the set of networks as a graph.

 

00:01:21.266 So, it's not finding the best hop-by-hop

00:01:23.933 path through the network, it’s finding the

00:01:26.100 best path between the set of networks

00:01:29.100 that comprise the Internet.

 

00:01:31.866 It treats each network in the Internet

00:01:34.100 as a node on the graph,

00:01:36.066 what’s known as the AS topology graph,

00:01:38.866 and it treats the connections between the

00:01:41.500 networks as edges in the graph.

 

00:01:43.866 And it's trying to find the best

00:01:45.800 set of networks to choose, to get

00:01:47.433 from the source to the destination across the AS graph.

 

00:01:54.300 As I said, the Internet is a

00:01:57.000 network of networks. Each of these networks,

00:01:59.800 each autonomous system, is independently owned and operated.

 

00:02:05.866 And the Internet routing system, the Border

00:02:09.000 Gateway Protocol, operates based on this idea

00:02:11.833 of autonomous systems, ASes.

 

00:02:15.033 And an AS is an Internet service provider,

00:02:20.200 or some other organisation that operates a

00:02:22.733 network, and that wants to participate in the routing.

 

00:02:26.866 The University of Glasgow is an autonomous

00:02:28.933 system in routing terms, for example.

 

00:02:32.866 As would be that the various residential

00:02:35.233 ISPs, Virgin Media or BT or Talk

00:02:38.600 Talk would be autonomous systems. But so

00:02:41.400 are large companies, Facebook, and Google,

00:02:44.000 and the like are also autonomous systems

00:02:46.166 in the routing sense.

 

00:02:49.333 Some of these organisations operate more than

00:02:51.766 one autonomous system,

00:02:54.000 perhaps because they've bought other companies which

00:02:57.700 were themselves autonomous systems, or perhaps just

00:03:00.433 split their network up for ease of administration.

 

00:03:05.233 Autonomous systems are identified by unique numbers,

00:03:09.366 known as AS numbers, and these

00:03:11.333 are allocated to them by the Regional Internet Registries.

 

00:03:15.233 The AS numbers don't really have any

00:03:17.566 meaning, except that they provide a unique

00:03:20.000 identifier for each autonomous system.

 

00:03:23.266 Essentially, they start at one, and they go up.

00:03:27.066 and each new organisation, each new network,

00:03:30.633 to join the Internet routing system gets

00:03:32.566 assigned the next autonomous system number.

 

00:03:36.033 As of March 2021 there are about

00:03:39.600 115,000 autonomous systems in the Internet,

00:03:43.033 about 115,000 autonomous system numbers have been

00:03:46.533 allocated, and about 71,000 of those are

00:03:50.133 advertised in BGP, which means about 71,000

00:03:53.700 of them are active in the Internet routing.

 

00:03:58.366 And the completely unreadable graph on the

00:04:00.700 right of the slide shows the growth

00:04:02.833 in the number of ASes advertised into

00:04:05.266 the routing system over time.

 

00:04:07.133 And there are some links on the

00:04:08.433 slide, if you want to find the

00:04:10.566 list of AS numbers, and the details

00:04:13.766 of the current AS number allocations.

 

00:04:22.000 When we talk about Internet routing,

00:04:24.566 we talk a lot about the AS topology graph.

 

00:04:28.166 And this is the set of interconnections

00:04:30.500 between the ASes. The set of interconnections

00:04:33.000 between the autonomous systems, between the networks,

00:04:35.600 that form the Internet.

 

00:04:38.466 And the AS topology graph is formed

00:04:40.800 by treating each node, each network,

00:04:45.300 each autonomous system as a node in the graph.

 

00:04:49.066 And the interconnections show the links between

00:04:51.933 the different networks, they show the different

00:04:53.866 ways in which traffic can pass between

00:04:55.933 these independently operated networks.

 

00:05:00.266 The picture we see on the slide

00:05:02.200 here is a visualisation of that graph,

00:05:05.433 produced by an organisation known as CAIDA,

00:05:08.733 the Cooperative Association for Internet Data Analysis,

00:05:12.366 which operates out of the University of

00:05:14.500 California, in San Diego.

 

00:05:17.500 And the way this works is that

00:05:19.166 each point on this graph is a

00:05:21.033 network, each point on the graph is an autonomous system.

 

00:05:25.233 And the position around the circle is

00:05:27.800 done based on geography, so it's based

00:05:30.533 on geographic location.

 

00:05:32.700 And the distance from the centre towards

00:05:37.433 the edge of the circle is based

00:05:39.566 on number of connections that network,

00:05:42.533 that autonomous system, has to the rest of the network.

 

00:05:48.000 A network that has very few connections

00:05:50.233 to other networks will appear at the

00:05:52.100 edge, whereas a network that has very

00:05:54.166 many connections to other networks will appear

00:05:56.433 in the middle of the graph.

 

00:05:58.766 And, as I say, it's arranged geographically,

00:06:01.066 and it’s perhaps a little hard to

00:06:02.933 read. At about the eight o'clock position,

00:06:06.066 and going around anticlockwise, if we start

00:06:08.433 at the eight o'clock position we see Hawaii.

 

00:06:10.466 And, towards the bottom at about the

00:06:12.300 seven o'clock, position you've got San Diego,

00:06:15.233 and Los Angeles, and working the way

00:06:17.633 through the US,

00:06:20.333 round to New York and so on,

00:06:24.266 at about the four o'clock position.

 

00:06:26.566 The gap is the Atlantic Ocean,

00:06:29.500 and then from around the three o'clock

00:06:31.700 position, to about one o'clock, you see

00:06:34.733 we're working the way through Europe,

00:06:36.300 and the labels show the various European cities.

 

00:06:39.333 And it works its way around,

00:06:41.533 through Asia, and the Far East,

00:06:44.600 and back to Hawaii.

 

00:06:48.566 And we see, as you might expect,

00:06:51.466 the richness of the interconnections varies geographically,

00:06:55.966 based on where the people live,

00:06:57.666 and based on, to some extent,

00:06:59.766 how developed the countries are.

 

00:07:02.433 There's a lot of networks at the

00:07:04.766 edges, and there's a significant number,

00:07:07.933 a smaller but signficant number,

00:07:10.700 a richly connected topology in the core.

 

00:07:14.233 And that’s what you'd expect. That the

00:07:17.033 very large Internet companies,

00:07:18.566 Facebook, and Google, and Apple,

00:07:21.766 and the content distribution networks, like Akamai

00:07:25.933 and so on, are all in the

00:07:27.000 middle, interconnecting to everyone. And then there's

00:07:29.666 lots of networks around the edges,

00:07:32.133 which just provide Internet access in particular regions.

 

00:07:38.133 And this is showing the potential ways

00:07:40.533 that the traffic can flow. It’s showing

00:07:42.900 the interconnections between the autonomous systems,

00:07:45.900 between networks. So, it's giving potential routes

00:07:49.400 which traffic can flow through the network.

 

00:07:53.166 And this graph is for IPv4.

 

00:07:56.733 You can do the same thing for

00:07:58.266 IPv6, as we see on this slide,

00:08:00.733 and as you would expect, perhaps,

00:08:02.533 the IPv6 graph is somewhat sparser and

00:08:06.066 perhaps a bit easier to read,

00:08:07.833 because the IPv6 network is smaller.

 

00:08:12.733 Tt's developing in the same way, though.

 

00:08:16.133 If you look at the historic data

00:08:18.533 for the IPv4 graph, the IPv6 graph

00:08:21.600 is following the same trajectory as the

00:08:23.766 IPv4 Internet did, it’s just a few years behind.

 

00:08:30.800 And in this slide, this is data

00:08:34.066 from Google. It’s plotting the fraction of

00:08:36.600 connections going to Google that

00:08:39.400 use IPv6. We see that about a

00:08:42.600 third of the traffic to Google is

00:08:45.100 using IPv6, and that matches-up with the

00:08:49.233 graphs on the previous slide.

 

00:08:51.600 The IPv6 network is a lot less

00:08:53.733 well developed, it's a much sparser topology

00:08:57.066 compared with IPv4, and there's less traffic using it.

 

00:09:01.533 But I think that's what you'd expect.

00:09:03.700 IPv4 has had 30 years head-start on

00:09:07.400 deployment. Of course it's going to be

00:09:09.633 much more densely interconnected, of course there's

00:09:12.333 going to be much more IPv4 traffic

00:09:14.633 than IPv6 traffic. But IPv6 is developing,

00:09:17.633 it’s growing at a similar rate.

 

00:09:23.500 So how do we route traffic around

00:09:26.300 this graph? Given that mass of interconnections

00:09:29.500 we saw in the previous slides,

00:09:32.166 essentially a completely unreadable

00:09:34.333 mass of interconnections,

00:09:35.600 with so many networks, so many interconnections,

00:09:38.600 how do we route traffic around the network?

 

00:09:43.766 Well, at the edges of the network,

00:09:46.166 this is very straightforward.

 

00:09:48.133 Devices at the edge of the network

00:09:50.266 tend to have really simple routing tables.

 

00:09:54.633 If you look at machines in the

00:09:56.933 network in the Computing Science Department of

00:10:00.100 the University, for example,

00:10:02.500 all the machines in Computing Science have

00:10:05.333 IP addresses in the range

00:10:07.666 130.209.240.0/20.

 

00:10:12.266 They all have IPv4 addresses where the

00:10:15.600 first 20 bits of the address match

00:10:18.133 130.209.240.0,

00:10:22.033 and the last 12 bits identify the

00:10:24.800 particular machine on the Computing Science network.

 

00:10:29.033 And their routing table just says,

00:10:31.866 if the machine is on

00:10:34.966 the Computing Science network put it out

00:10:37.566 onto the local ethernet, and it will be delivered.

 

00:10:41.033 If it's got an IP address in the range

00:10:43.566 130.209.240.0/20

00:10:47.233 put it out on to the local

00:10:48.833 Ethernet, and it will be delivered to

00:10:51.133 the machine directly.

 

00:10:53.566 And then it has what's known as

00:10:56.033 a default entry, which says if it

00:10:57.466 has any other IP address, send it

00:11:00.600 to machine with IP address 130.209.240.48.

 

00:11:06.700 And machine 130.209.240.48

00:11:10.366 Is the router at the edge of

00:11:12.966 the Computer Science Department. It’s the router

00:11:16.300 which connects Computing Science to the rest

00:11:18.300 of campus, and from then on to the rest of the Internet.

 

00:11:23.400 And routing at the edges is often

00:11:25.833 like this. The routing table specifies “this

00:11:29.200 is the local network” and says in

00:11:31.300 order to send to any machines on

00:11:33.833 this network, just put it out onto

00:11:35.433 the Ethernet, or on to the WiFi,

00:11:38.066 and they're all directly connected. And anything

00:11:41.433 else, send it over there. And “over

00:11:44.000 there” is the router that connects to

00:11:45.533 the rest of the network.

 

00:11:48.266 If you look at the routing tables

00:11:50.866 on machines in your home, you will

00:11:52.800 see something similar. And, most likely,

00:11:55.933 you have a private network, you're behind

00:11:58.233 a network address translator,

00:12:00.433 and the routing table will say the

00:12:02.900 network is 192.168.0.0/16, and that's on your

00:12:08.266 local WiFi, and anything else you send

00:12:11.066 to, probably, machine 192.168.0.1, which will be

00:12:16.333 the WiFi base station which will,

00:12:18.866 in turn, send it out to the rest of the Internet.

 

00:12:25.166 Routing at the edges is straightforward.

 

00:12:29.733 Routing, as you get nearer the core

00:12:32.466 of the network, gets more complex.

 

00:12:35.066 We saw at the edges, the networks

00:12:37.700 can just have a default route that

00:12:39.133 points up towards the core.

 

00:12:41.533 We see it at the bottom-right of

00:12:44.166 the figure on the slide here.

 

00:12:46.500 Where there’s some network at the edge,

00:12:48.566 which has a couple of its customers.

00:12:51.366 it has links to a couple of

00:12:53.166 customer networks, with the red arrows pointing inwards,

 

00:12:57.600 and

 

00:13:00.466 it knows what are the address ranges

00:13:03.600 assigned to those customers.

 

00:13:05.666 So it knows that if it's got

00:13:07.766 traffic to those address ranges, it can

00:13:10.266 route it down those links to those

00:13:12.033 customers. But it can have a default

00:13:14.333 route that says “for anything else,

00:13:16.600 anyone other than these two customers,

00:13:18.966 send it out towards the wider Internet”.

 

00:13:22.466 And, at the edges, this sort of

00:13:24.266 default based approach works quite well,

00:13:26.733 because there's only a small part of

00:13:28.900 the network which is known, and everything

00:13:30.766 else is “out there”.

 

00:13:34.400 As you get into the core, though,

00:13:37.766 the networks tend to need more-and-more information.

 

00:13:42.433 And, eventually, you end up in a

00:13:44.300 region of the network which is known

00:13:46.266 as the “default free zone”.

 

00:13:49.166 And the default free zone is that

00:13:51.166 part of the network which is so richly interconnected

00:13:54.366 that it stops being able to say

00:13:57.300 “send it over there to be delivered”,

00:14:00.033 because it's the part of the network

00:14:01.800 those people send it to.

 

00:14:04.333 It can't say send it towards the

00:14:06.866 middle of the Internet to be delivered,

00:14:08.600 because it is the middle of the Internet.

 

00:14:11.600 And this large core of autonomous systems

00:14:15.933 in the middle of the network,

00:14:17.200 has to keep track of essentially the

00:14:20.000 whole Internet topology, the whole AS graph.

 

00:14:23.466 So they need to store all the

00:14:25.600 paths, to all the autonomous systems in

00:14:27.533 the network, to figure out how they

00:14:29.200 can deliver data.

 

00:14:30.933 They need to keep a map of,

00:14:32.833 essentially, the entire Internet topology. And from

00:14:36.066 that, they can decide which way to

00:14:37.766 send the packets, which network to send

00:14:40.800 the packets to next, in order that they get delivered.

 

00:14:48.266 Over time, the topology, the AS graph,

00:14:51.800 is gradually getting more complex.

 

00:14:55.766 It started out being relatively simple,

00:14:58.300 like you see on the left of the slide here.

 

00:15:02.700 There were ISPs at the edges,

00:15:05.033 which provided connectivity to particular regions.

00:15:08.300 They connected to regional ISPs, which provided

00:15:11.966 wider-area connectivity. And there were a small

00:15:15.533 number of network operators that provided long-distance

00:15:18.466 international connectivity.

 

00:15:22.966 And, over time, we've gradually seen more-and-more

00:15:26.966 links being added, the links shown in

00:15:30.066 red on the right, for example.

 

00:15:32.333 We're getting a lot more interconnections at

00:15:34.633 the regional level,

00:15:37.166 a lot denser interconnections at the edges.

 

00:15:41.433 The network’s getting more-and-more connected. The ISPs,

00:15:45.466 the network operators,

00:15:46.666 the companies that form the network,

00:15:48.266 are gradually building more-and-more interconnections

00:15:50.400 between themselves.

 

00:15:52.733 And the traffic is less flowing up

00:15:54.800 towards the core, and then through this

00:15:56.766 small set of long-distance providers, and then back down,

00:15:59.766 and is increasingly going from the edges

00:16:02.700 up to some sort of regional transit

00:16:05.733 layer, or from the edges directly to

00:16:07.766 the destination network, without having to go

00:16:10.533 via these long-distance transit providers.

 

00:16:14.466 And we're seeing more

00:16:16.766 interconnection by large Internet companies, Google for

00:16:22.233 example, or the content distribution networks,

00:16:25.200 Akamai, CloudFlare, Fastly, and the like,

00:16:29.466 connecting at the regional level, connecting to

00:16:32.266 the edge ISPs directly, in order to

00:16:34.933 improve connectivity for their customers.

 

00:16:38.000 And we're seeing increasing numbers of what

00:16:40.466 are known as Internet Exchange Points.

 

00:16:42.733 Locations where network operators can come together

00:16:46.533 and interconnect themselves.

 

00:16:50.133 A prominent example of that, in this

00:16:53.266 country, is the London Internet Exchange,

00:16:55.966 where there's approximately 800-850 different networks,

00:17:00.533 all come together in a particular building,

00:17:03.300 that just connect their networks together.

 

00:17:06.433 And the picture shows it, as you

00:17:09.800 see it's just a regular office building.

 

00:17:12.700 If you go into one of these

00:17:14.433 places, what you find is that the

00:17:16.233 core of it is just an enormous

00:17:17.566 Ethernet switch. And all of the networks

00:17:21.133 bring their equipment in, and they all

00:17:22.800 plug-in to, essentially, a massive Ethernet which

00:17:26.800 allows them to just exchange traffic.

 

00:17:30.866 And the LINX, the London Internet Exchange,

00:17:33.600 talks about how it has several terabytes

00:17:37.700 per second of traffic flowing through it.

 

00:17:40.066 And this type of scale is pretty

00:17:41.566 commonplace. There’s tens, possibly hundreds, of these

00:17:46.166 in Europe, and many more of them

00:17:47.866 around the world. And they’re the points

00:17:50.866 at which this interconnection tends to happen.

 

00:17:59.933 The Internet, as we've said, is a

00:18:02.133 network of networks. The autonomous systems are

00:18:05.966 independently operated and, in many cases,

00:18:08.900 they are competitors.

 

00:18:11.500 If you think about the edges of

00:18:13.433 the network in the UK, for example,

00:18:15.966 you've got autonomous systems such as BT,

00:18:19.633 Virgin Media, Talk Talk, O2, and all

00:18:22.600 the others, all of which are competing

00:18:25.366 for business. They’re all competing to be

00:18:28.100 your Internet provider.

 

00:18:30.700 These autonomous systems have to cooperate to

00:18:34.033 deliver data between themselves, and deliver data

00:18:36.966 to the rest of the Internet,

00:18:39.133 but fundamentally they’re competitors.

 

00:18:43.033 They're competing for business, they're competing for

00:18:46.100 customers with each other.

 

00:18:48.000 And this is true at all of

00:18:49.533 the levels of the hierarchy. The autonomous

00:18:52.333 systems, the networks that comprise the Internet,

00:18:56.266 need to cooperate to make the Internet

00:18:59.000 work, but fundamentally they don't trust each

00:19:01.766 other. They’re competitors, they're operating in different

00:19:05.600 places, they have different goals, different values.

 

00:19:10.566 And, as a result, business and political

00:19:14.300 and economic relationships very much influence routing.

 

00:19:19.633 Internet routing, of course, is based on

00:19:22.966 what's the most efficient way to get

00:19:24.966 data to a particular destination, but it's

00:19:27.800 also based on policy.

 

00:19:31.566 And policy restrictions very much determine the

00:19:35.133 topology. They determine the interconnections between the

00:19:37.966 networks, and they determine which of those

00:19:40.100 interconnections are used.

 

00:19:43.966 And, at the coarsest sense, they determine

00:19:48.200 the interconnectivity, because they determine which networks

00:19:51.133 actually physically interconnect to each other.

 

00:19:54.933 Which of these networks actually have put

00:19:59.166 in place a physical link to allow

00:20:01.533 traffic to flow between themselves,

00:20:03.833 versus punting it up to some other

00:20:05.966 level of the hierarchy?

 

00:20:08.966 But also, once those links are in

00:20:11.266 place, who gets to use them? Which

00:20:13.700 traffic gets to flow over those links?

 

00:20:16.400 And not all of the traffic which

00:20:18.000 could flow over a particular link is

00:20:20.466 necessarily allowed to, depending on the policy

00:20:22.700 choices that have been made.

 

00:20:26.066 And these various policy choices might prioritise

00:20:29.733 traffic so that it goes over non-shortest

00:20:32.066 path routes, over not necessarily optimal routes.

 

00:20:37.300 Network operators might prioritise shortest path,

00:20:42.166 they might prioritise the lowest latency path

00:20:45.166 when they’re choosing a route.

 

00:20:47.433 But they might also prioritise the highest bandwidth path.

 

00:20:50.666 Or the cheapest path.

 

00:20:53.733 Or they might have restrictions which prioritise

00:20:57.333 paths which avoid certain networks, or avoid

00:21:00.233 certain parts of the world.

 

00:21:03.766 They might be trying to avoid traffic

00:21:06.166 going through certain regions, or through certain

00:21:08.133 network operators, for political reasons or for

00:21:11.966 economic reasons.

 

00:21:14.233 And these policy considerations very much influence

00:21:17.233 the way Internet routing works.

 

00:21:24.466 The routing in the Internet operates using

00:21:27.700 a system known as the Border Gateway Protocol.

 

00:21:31.600 There's two parts to the Border Gateway

00:21:33.733 Protocol, two parts to BGP.

 

00:21:36.266 External BGP and internal BGP.

 

00:21:40.733 External BGP provides the connectivity between autonomous

00:21:46.966 systems. It’s used by ASes to exchange

00:21:50.533 information with their neighbours, to tell them

00:21:53.033 which paths are available.

 

00:21:56.366 External BGP runs over TCP connections,

00:22:00.166 it runs over TCP connections between routers,

00:22:03.666 one in each autonomous system, so it

00:22:07.133 interconnects the autonomous systems.

 

00:22:10.300 And it allows those two autonomous systems

00:22:12.566 to exchange knowledge of the AS topology,

00:22:15.500 which they’ve filtered according to their policies.

 

00:22:19.000 External BGP is the way two autonomous

00:22:22.300 system will talk to each other,

00:22:23.900 to exchange information about the structure of the network.

 

00:22:27.900 And from that they can compute

00:22:31.333 interdomain routes, they can compute the paths

00:22:35.600 that are available across the network.

 

00:22:39.566 Internal BGP is the part of BGP

00:22:43.300 that’s used within an autonomous system for

00:22:46.366 distributing that information to the other edge

00:22:48.500 routers, and for distributing that information to

00:22:50.966 the internal routers in that system.

 

00:22:54.066 Internal BGP allows an autonomous system to

00:22:58.400 coordinate routing information internally. It tells the

00:23:03.500 routers that comprise a network, how to

00:23:08.433 get to the edges, how to get

00:23:10.666 out to the rest of the world.

 

00:23:13.433 And external BGP is used for talking

00:23:16.133 between autonomous systems to coordinate their view

00:23:18.600 of what the rest of the world looks like.

 

00:23:21.733 We’ll talk about intradomain routing, routing within

00:23:25.333 a network, in one of the later

00:23:27.200 parts. But for the rest of this

00:23:28.866 part of lecture, I want to talk about external BGP,

00:23:31.200 and how the routing between autonomous systems works.

 

00:23:38.366 At the external BGP level,

00:23:42.033 the autonomous systems, the routers at the

00:23:44.633 edges of the autonomous systems, advertise out

00:23:47.800 IP address ranges, and advertise the AS

00:23:51.100 paths in order to get to those IP address ranges.

 

00:23:56.333 And these combine to form what's known as a routing table.

 

00:24:00.466 Essentially, you have a list of IP

00:24:03.133 address ranges, what’s known as a list

00:24:05.000 of prefixes, and for each prefix,

00:24:08.133 you have the list of autonomous systems

00:24:11.366 you need to get through to get to that prefix.

 

00:24:16.333 And the table at the bottom,

00:24:18.033 is an example of a small part

00:24:20.433 of the Internet routing table.

 

00:24:22.633 And the whole thing is enormous.

00:24:24.266 The whole thing is

00:24:25.733 a few million lines of this.

00:24:27.800 And there's something like half-a-million prefixes being

00:24:31.933 advertised into the Internet, and each one

00:24:34.200 has multiple ways of getting to it,

00:24:35.800 so there are several million lines of this data.

 

00:24:39.600 What we see, highlighted in yellow,

00:24:41.866 is the entries for a particular prefix.

 

00:24:45.433 In this case, it's the IP addresses

00:24:48.133 which match 12.10.231.0/24,

00:24:53.066 where the first 24 bits match 12.10.231.0.

 

00:25:00.233 And,

00:25:02.666 in the middle the middle column,

00:25:04.800 the next hop column, we see that

00:25:06.633 there are seven different ways of getting

00:25:11.166 to that, via seven different next hop routers.

 

00:25:15.533 And, for each of these, we see

00:25:17.100 an AS path which shows how to get there.

 

00:25:20.833 So, for example, if you look at

00:25:23.433 the first line highlighted in yellow,

00:25:25.300 we see we can get to the

00:25:26.333 prefix 12.10.231.0/24

00:25:30.700 via next hop 194.68.130.254

 

00:25:36.866 If we send a packet destined to

00:25:39.000 that prefix, to that next top router,

00:25:42.333 it will go to the autonomous system

00:25:44.500 number 5459, which will send it to

00:25:48.133 5413, which will send it to 5696,

00:25:52.366 which will send it to 7369.

 

00:25:55.200 And 7369, because it’s at the end

00:25:58.200 of the AS path, is the one that owns the prefix.

 

00:26:02.733 And “i” just means this was gathered

00:26:06.166 by internal BGP from some other autonomous

00:26:09.333 system. It’s been passed through this router

00:26:12.466 from one of other ASes in the network.

 

00:26:17.833 And we see the next line,

00:26:20.566 if you send a packet destined

00:26:22.966 for the same prefix, instead to the

00:26:27.233 router with IP address 158.43.133.48,

00:26:33.366 it will follow a longer path.

 

00:26:35.233 It will go via autonomous systems 1849,

00:26:38.300 702, 701, 6113, 5696, and eventually to

00:26:43.500 7369, the destination.

 

00:26:46.200 And so on.

 

00:26:48.433 And that line highlighted in red is

00:26:50.766 the preferred path. You send a packet

00:26:53.733 destined for prefix 12.10.231.0, and if you

00:26:58.033 send it to the next hop router

00:26:59.833 202.232.1.8, it will go via autonomous systems

00:27:05.133 2497, 5696, and then reach 7369 the destination.

 

00:27:14.833 And the entire routing table comprises this

00:27:18.333 set of information. It's a list of

00:27:20.166 prefixes and next hops,

00:27:22.500 which routers this autonomous system can send

00:27:25.300 the data to next, in order to

00:27:28.000 make its way towards that destination,

00:27:30.466 and the AS paths it will take,

00:27:32.633 the packets will take, if it sends them to that next hop.

 

00:27:38.233 What are the next hop IP addresses?

 

00:27:40.800 They’re the IP addresses of the routers

00:27:43.233 this autonomous system peers with in its neighbours.

 

00:27:47.700 The particular autonomous system I’ve taken this

00:27:51.000 routing table from, connects to a router

00:27:55.266 with IP address 202.231.1.8, and that router

00:28:01.266 is in one of its neighbours,

00:28:02.500 it's in autonomous system 2497.

 

00:28:08.000 And it knows that if it sends

00:28:10.133 to that next hop, it will work

00:28:11.933 its way through autonomous systems 2497,

00:28:14.866 and 5696, and 7369 which owns the

00:28:18.600 destination IP address.

 

00:28:22.100 And let's just repeats, for prefix,

00:28:23.866 after prefix, after prefix.

 

00:28:29.066 Now.

 

00:28:30.700 You can extract this information, and you

00:28:32.933 can plot, it and you can form a graph.

 

00:28:35.700 And the figure we see on the left

00:28:39.833 here, shows the view of the network

00:28:43.166 from the point where this routing table

00:28:45.366 was gathered, which is the autonomous system

00:28:47.466 highlighted in green, showing the interconnections we

00:28:50.133 found to all the others.

 

00:28:52.533 And all this is doing, is showing

00:28:54.333 each pair of adjacent autonomous systems on

00:28:56.966 the path are connected together.

 

00:29:00.300 So, if we look at the first

00:29:01.533 line, we see we can reach

00:29:05.533 the prefix 12.10.231.0 via autonomous systems 5459,

00:29:13.000 5413, 5696, and 7369.

 

00:29:18.566 And, we see from the node in

00:29:21.400 green, if we get up at about

00:29:23.200 the 10 o'clock position and around,

00:29:25.400 we follow the autonomous systems around,

00:29:27.500 we see this path through the network.

 

00:29:30.266 And if you look at each line

00:29:32.233 in turn, and look at the AS

00:29:33.633 paths, so you'll see I’ve just connected

00:29:35.566 the adjacent ASes together. And it gives

00:29:37.766 you this map, this part of the Internet topology.

 

00:29:43.333 And the arrows in a red show

00:29:46.000 the preferred paths, which are highlighted

00:29:49.166 on the segment of the routing table.

 

00:29:52.000 You can see we’re starting to build

00:29:53.866 up the AS graph. We’re starting to

00:29:56.000 build up a map of the topology graph.

 

00:29:59.066 And, if you do this for the

00:30:00.500 entire graph, if you take the entire set of

00:30:03.900 entries in the routing table, you end

00:30:06.633 up with a graph like the CAIDA

00:30:08.833 graph I showed earlier.

 

00:30:16.266 So, we see that the routing works

00:30:18.566 by each autonomous system advertising some IP

00:30:22.033 address prefixes to its neighbours.

 

00:30:25.300 BGP works by each AS telling its neighbours

00:30:29.766 “I can reach these IP prefixes”,

00:30:33.033 “if you send traffic to me,

00:30:35.100 I will deliver it to these prefixes”.

 

00:30:38.833 And each AS chooses which of these

00:30:41.300 prefixes, which of these routes, to advertise

00:30:43.466 to its neighbours.

 

00:30:46.200 But it doesn't need to advertise everything

00:30:48.900 it knows. It doesn't need to advertise

00:30:51.466 out everything it receives.

 

00:30:54.166 Indeed it's common for BGP,

00:30:58.200 it's common for autonomous systems in BGP,

00:31:00.733 to drop some routes from their advertisement.

 

00:31:06.433 And, what address ranges, what AS paths

00:31:11.700 they advertise, really depends on the relationship

00:31:15.833 between the different autonomous systems.

 

00:31:20.166 And a common way this is done,

00:31:23.166 is using what’s known as the Gao-Rexford

00:31:25.333 rules. And this is a way of

00:31:27.933 categorising autonomous systems, and categorising how the

00:31:31.066 routing should work.

 

00:31:33.933 And for any autonomous system, any AS

00:31:36.833 in the Internet, it categorises the other

00:31:39.566 autonomous systems as either being

00:31:42.366 customers, peers, or providers of that AS.

 

00:31:47.700 So customers are easy. These are the

00:31:50.066 people for whom the network sells Internet service.

 

00:31:56.533 If the network we’re considering is JANET,

00:32:01.766 the Joint Academic NETwork that connects the

00:32:04.533 UK universities together, the customers are the

00:32:07.333 individual universities.

 

00:32:11.100 The peers are the other networks with

00:32:13.733 whom it exchanges traffic,

00:32:17.300 on a peer basis, without really charging.

 

00:32:22.466 The customers are people who pay you

00:32:25.633 for Internet access; the peers are the

00:32:28.500 people you agree to share traffic with at no cost.

 

00:32:32.333 And in the case of JANET,

00:32:34.500 the academic research network in the UK,

00:32:37.266 the peers might be the other academic

00:32:39.166 research networks around Europe, for example.

 

00:32:42.466 And the providers are the people who

 

00:32:44.900 you pay for Internet access, who this

00:32:47.100 AS pays for Internet access.

 

00:32:49.800 And this might be,

00:32:51.933 in the case of JANET, it would

00:32:53.933 be GÉANT, the pan-European

00:32:56.333 interconnect, or it might be a commercial

00:32:59.566 interconnect that connects it to the rest of the Internet.

 

00:33:04.166 And, the idea is that if you

00:33:06.100 get a route from one of your

00:33:07.500 customers, so if one of your customers

00:33:09.866 says “I have this IP address range”,

00:33:13.066 “I own these IP addresses”, you will

00:33:16.933 advertise that out to everybody.

 

00:33:20.266 One of your customers, one of the

00:33:23.033 people who is paying you for Internet

00:33:25.466 access, advertises that they own a particular

00:33:27.733 IP address range, you tell your other

00:33:30.733 customers, you tell your peers, and you

00:33:33.033 tell your provider.

 

00:33:36.033 And that makes sense. The customer is

00:33:38.633 paying you to provide Internet access,

00:33:41.200 paying you to deliver traffic for them,

00:33:43.933 but also paying you to deliver traffic

00:33:45.833 to them. So if they own a

00:33:47.700 particular IP address range, they want to

00:33:49.766 receive traffic destined for those addresses,

00:33:52.600 so you tell the rest of the Internet about it.

 

00:33:57.900 If you get a route from your

00:34:01.100 one of your providers, or from one

00:34:02.666 of your peers, though, you only tell your customers.

 

00:34:10.733 This is a route you're paying to

00:34:13.200 use, rather than being paid to use,

00:34:15.933 and therefore you only tell the people,

00:34:19.466 you only tell the customers, who are

00:34:21.866 paying you to use it. And,

00:34:23.833 for a route from a provider,

00:34:25.200 this makes sense; you're explicitly paying for

00:34:27.166 access, so

00:34:28.766 you tell your customers. But you don’t

00:34:30.300 tell your peers, because you're paying for

00:34:32.166 this access. Why would you let them use it?

 

00:34:37.600 And, for routes received from your peers,

00:34:39.866 you tell your customers, because the peer

00:34:43.766 is willing to let you use this

00:34:45.866 route at no cost to your customers,

00:34:47.733 but you don't tell your provider,

00:34:49.366 you don't tell the rest of the Internet about it.

 

00:34:53.133 And the Gao-Rexford specify what routes are

00:34:56.333 advertised, so they specify potential ways traffic can flow.

 

00:35:01.700 This isn't saying “the traffic will go

00:35:04.600 this way”, it's saying there is a

00:35:07.133 potential route that traffic could follow,

00:35:09.400 if it wanted to get to this address.

 

00:35:15.000 And the result is what’s known as a valley-free

00:35:18.633 directed acyclic graph, a valley-free DAG.

 

00:35:22.233 And directed and acyclic means that

00:35:26.866 there's a direction: it shows you which

00:35:29.100 way to go, to get to a

00:35:30.800 particular range of IP addresses. It’s acyclic,

00:35:33.933 that means there are no loops.

 

00:35:35.733 And valley-free means it goes up,

00:35:38.733 and then along, and then down.

 

00:35:41.200 It never goes from a customer,

00:35:44.666 to its provider, then down to one

00:35:47.400 of its customers, and then back up

00:35:48.766 to another provider. It goes up,

00:35:50.800 then along, and then down.

 

00:35:54.233 And it's designed, essentially, to optimise for profit.

 

00:35:59.300 If someone is paying you for access,

00:36:02.533 you will advertise their routes, which allows

00:36:04.733 traffic to flow to them.

 

00:36:07.066 If you're paying for a route,

00:36:09.333 you only advertise it to people who

00:36:11.466 are paying you.

 

00:36:13.966 It’s designed to avoid advertising things which

00:36:19.200 you pay for, to people who are

00:36:20.933 not paying you for access.

 

00:36:26.833 All the autonomous systems exchange routing information

00:36:30.333 with their neighbours.

 

00:36:32.466 They exchange lists of IP prefixes,

00:36:36.633 and how they can be reached.

 

00:36:38.900 What path, what set of autonomous systems,

00:36:43.166 you have to go through to get to that prefix.

 

00:36:48.566 And they filter this based on the

00:36:49.966 policies. Maybe they apply the Gao-Rexford rules,

00:36:53.266 maybe they apply some other rules,

00:36:54.933 but they don't necessarily advertise all of

00:36:57.600 the prefixes, and all of the paths,

00:36:59.566 they know to all of their peers,

00:37:01.700 to all of their neighbours.

 

00:37:05.000 Each autonomous system has a partial view

00:37:08.133 of the AS-level topology. It knows what

00:37:11.766 its neighbours are willing to tell it.

 

00:37:15.733 And it takes that view of the

00:37:17.700 topology, and it applies a set of

00:37:19.966 rules that enforce its policy.

 

00:37:25.133 And maybe they filter out certain routes.

 

00:37:28.066 Maybe they don't tell their neighbours about

00:37:31.466 the existence of certain routes, because they

00:37:33.633 don't want them to use those routes for some reasons.

 

00:37:38.466 Maybe it filters out certain routes its neighbours tell it.

 

00:37:43.033 The neighbouring AS is willing to deliver

00:37:45.533 traffic in that direction, but it doesn't

00:37:47.500 want the traffic to flow that way,

00:37:49.166 so it filters out that prefix from its routing table.

 

00:37:54.466 Maybe it prioritises, or de-prioritises, certain other

00:37:58.633 routes. Maybe it tags particular routes for

00:38:02.366 special processing, if there's a particular business

00:38:04.933 reason to do so.

 

00:38:07.633 And it goes through, and it applies its policies.

 

00:38:13.066 The table shows the criteria people use,

00:38:18.866 and there’s a local preference, the length

00:38:22.533 of the AS path,

00:38:24.800 the type of origin; is this something

00:38:28.200 you know because it's one of your

00:38:29.700 directly connected customers, or is it something

00:38:31.933 you’ve learnt from one of the other networks?

 

00:38:35.866 There’s a multi-exit discriminator if there are

00:38:38.166 several ways of getting to a single destination.

 

00:38:42.733 And so on. there’s a bunch of policies and so on.

 

00:38:49.300 The point is that,

00:38:53.866 just because you know the existence of

00:38:56.233 a route, doesn't mean you use it.

 

00:38:59.166 And you don't necessarily

00:39:01.400 pick the shortest routes, you pick the

00:39:03.633 shortest route that matches all your policies

00:39:05.933 after filtering the graph.

 

00:39:11.166 And, this means that the route that

00:39:14.066 data takes to get through the network,

00:39:17.566 may not necessarily be the shortest route

00:39:19.933 through the network.

 

00:39:21.366 It’s the shortest route that meets all policy constraints.

 

00:39:26.466 It means there may be cases where

00:39:29.633 data can't get to a particular destination,

00:39:34.533 even if there is a potential route

00:39:36.866 there, because the autonomous systems don't have

00:39:39.900 a policy which allows it to go in that direction.

 

00:39:44.500 There are cases where the network could

00:39:47.466 deliver data to a particular destination,

00:39:50.000 but won’t, because the policy choices made

00:39:52.933 by some, or more, of the ISPs

00:39:55.000 in some parts of the world,

00:39:56.266 won't allow traffic from those parts of

00:39:58.400 the world to reach that destination.

 

00:40:03.533 It's finding the shortest policy-compliant path.

 

00:40:13.900 BGP is

00:40:17.966 a very political protocol.

 

00:40:22.933 How the information is exchanged is straightforward.

 

00:40:27.033 The autonomous systems exchange lists of prefixes,

00:40:32.000 and the AS path in order to get to those prefixes.

 

00:40:36.300 How those paths are filtered and prioritised

00:40:40.300 is where it gets difficult.

 

00:40:46.566 In many cases the policy, and economic,

00:40:49.533 and political concerns outweigh the shortest path.

 

00:40:53.233 The routes are filtered, and they’re prioritised,

00:40:55.633 and they’re de-prioritised, based on policy choices,

00:40:59.466 based on how much it costs a

00:41:01.600 particular AS, and based on

00:41:05.433 political decisions as to which ASes,

00:41:09.066 which regions, which countries, to prefer.

 

00:41:14.100 And the autonomous systems are competitors,

00:41:17.633 they don't really trust each other.

 

00:41:22.700 And, as a result, it's hard to

00:41:25.200 say how BGP really works, because the

00:41:28.733 ASes won't tell anyone outside their own organisation.

 

00:41:36.866 We know what information, we can put

00:41:40.400 a monitor at some point in the

00:41:41.633 network and see what information is reaching

00:41:44.600 that point of network, we can see

00:41:46.800 what other ASes are willing to advertise

00:41:49.533 to a monitor at that point in the network.

 

00:41:54.300 We can get a friendly AS to

00:41:56.100 show us the BGP data they're receiving.

 

00:41:59.466 And there are projects, such as RIPE

00:42:02.466 RIS, or the RouteViews project from the

00:42:05.633 University of Oregon, which archive this data,

00:42:08.266 and store it, and make it available for people.

 

00:42:11.766 And we know the BGP decision process,

00:42:14.300 we know the algorithm the routers follow

00:42:17.233 to exchange the data. We saw that

00:42:20.333 in a previous slide, and it's deterministic

00:42:22.533 about how they pick a particular route.

 

00:42:26.900 But what we don't know is the

00:42:28.300 data which is going into that algorithm.

 

00:42:31.566 We know the set of routes that

00:42:33.000 are being advertised, but they are then

00:42:34.800 filtered, and prioritised, and de-prioritised, and munged,

00:42:38.566 before they go into the decision process in the routers.

 

00:42:41.866 And how each autonomous system does this,

00:42:44.266 is a trade secret of that AS,

00:42:46.266 and they won't tell the rest of

00:42:47.566 the network. And this makes it difficult

00:42:50.033 to evaluate how routing decisions are made in practice.

 

00:42:53.933 We can see the end result.

 

00:42:55.866 We can put a monitor in the

00:42:58.266 network somewhere and see the routing tables

00:43:00.800 that it gets. And, based on that,

00:43:02.966 we can infer how the data will

00:43:05.366 get to a particular destination.

 

00:43:07.733 But how those tables got filtered,

00:43:10.133 and what other routes exist which are

00:43:12.033 being de-prioritised and filtered out so we

00:43:14.766 can't see them, that we don't know.

00:43:17.033 We don't know the potential connections which

00:43:19.000 we're not allowed to use.

 

00:43:23.233 That's all I wan to say about interdomain routing.

 

00:43:27.200 We’ve got a network of networks.

 

00:43:30.066 At the edges, the routing is easy.

 

00:43:35.100 Within an edge network, you point to

00:43:38.433 the default gateway, and

00:43:41.033 between networks at the edges, again,

00:43:44.033 you can use a default route,

00:43:46.333 you just forward towards the core.

 

00:43:48.733 In the core you have the default

00:43:50.700 free zone, everyone knows everything,

00:43:53.466 everyone has to know all of the paths.

 

00:43:56.333 And they use BGP to exchange this

00:43:58.266 data, and then they filter it,

00:43:59.900 and munge it, and process it,

00:44:01.233 to suit their policy needs, and it

00:44:03.166 becomes very opaque what happens.

 

00:44:05.966 Eventually, though, the packets get delivered,

00:44:08.333 we hope, and the Internet routing works.

 

00:44:12.066 In the next part, I'll talk about

00:44:13.866 routing security, and after that I'll talk

00:44:16.333 about intradomain routing,

00:44:18.300 how routing works within a network.

Part 3: Routing Security

Some of the security limitations of BGP routing, and the potential for accidental or malicious route hijacking, are discussed. The RPKI and MANRS are discussed as possible approaches to improving BGP routing security.

Slides for part 3

 

00:00:00.366 Having discussed interdomain routing in detail in

00:00:02.366 the previous part of the lecture,

00:00:04.666 I’d like to move on and talk briefly about routing security.

 

00:00:08.600 I’ll talk about what is Internet routing

00:00:10.766 security, and the problems of secure routing

00:00:13.633 in the Internet, and I’ll talk about

00:00:15.433 two approaches to addressing some of these

00:00:17.300 problems, the Resource Public Key Infrastructure,

00:00:20.333 RPKI, and the Mutually Agreed Norms for

00:00:23.266 Routing Security, MANRS.

 

00:00:28.600 So the issue with routing in the Internet

00:00:32.666 is being able to advertise prefixes,

00:00:37.800 address ranges, into BGP.

 

00:00:40.566 And, to be sure that only the

00:00:43.866 legitimate owner of that address range,

00:00:46.600 only the legitimate owner of a particular

00:00:48.333 prefix, can do that, such that the

00:00:51.033 traffic goes to the correct destinations.

 

00:00:57.300 And the problem with BGP, and the

00:01:00.266 problem with Internet routing security, is that

00:01:03.700 it doesn't provide this guarantee.

 

00:01:06.466 The problem with BGP is that any

00:01:08.933 autonomous system participating in BGP routing can

00:01:12.533 announce any address prefix.

 

00:01:15.066 And they can announce any address prefix

00:01:17.033 whether-or-not they own that prefix.

 

00:01:21.333 Once an autonomous system has the ability

00:01:26.900 to participate in BGP, once one of

00:01:29.833 the existing BGP speakers has agreed to

00:01:32.166 peer with it and accept routes from that AS,

00:01:36.133 the expectation is that it will announce

00:01:38.200 its own routes, announce the routes to

00:01:40.566 its own address space, and to those of its customers.

 

00:01:44.300 But, if an autonomous system chooses to

00:01:47.033 announce address space owned by someone else,

00:01:50.966 then there’s nothing to stop it from doing that.

 

00:01:55.266 And this can happen accidentally. Or it

00:01:57.833 can happen because of people maliciously trying

00:02:00.866 to redirect traffic, such that traffic to

00:02:04.266 a particular destination goes to a fake

00:02:07.133 site, or follows a

00:02:10.966 path through a site which can snoop on particular traffic.

 

00:02:16.633 And the result is that the traffic

00:02:18.100 gets misdirected. It’s what’s known as a

00:02:20.366 BGP hijacking attack.

 

00:02:24.733 And this happens frequently by accident,

00:02:28.366 and these accidental hijackings of prefixes are

00:02:32.233 a serious stability problems for the network.

 

00:02:35.333 But it can also happen due to malicious activities.

 

00:02:41.166 A well-known example of the type of

00:02:44.166 problem that can happen, is linked from

00:02:47.433 the slide, and this happened when an

00:02:50.300 Internet service provider in Pakistan

00:02:54.500 managed to announce the IP address range

00:02:58.033 for YouTube to the Internet.

 

00:03:01.100 And what was happening was that a

00:03:04.066 court in Pakistan ruled that

00:03:09.200 ISPs in that country were to block

00:03:12.666 access to YouTube,

00:03:15.100 because the content, some of the content,

00:03:18.233 on YouTube was ruled to infringe local

00:03:21.433 laws. And the ISPs in Pakistan were

00:03:24.500 told to block access to this content.

 

00:03:27.900 And the way this ISP tried to

00:03:30.066 do that, was by injecting a route

00:03:33.500 to the IP address ranges owned by,

00:03:36.166 and used by, YouTube,

00:03:38.633 to its part of the network.

 

00:03:42.966 And the idea was that all of

00:03:44.533 its customers, within the country, would see

00:03:49.533 this route advertisement, and their traffic would

00:03:52.833 be redirected to a page that says

00:03:55.066 “access to the site is blocked in this country”.

 

00:03:59.333 And, if they’d successfully sent that announcement

00:04:02.800 only into Pakistan, that would have worked

00:04:05.466 just fine. That’s a perfectly reasonable technical

00:04:09.133 method of blocking access to a particular

00:04:11.366 site, is that you inject the route that way.

 

00:04:15.766 The problem is that they misconfigured their

00:04:17.666 routers, and also announced it to the

00:04:19.366 rest of the Internet, as well as to

00:04:23.100 their customers within the country.

 

00:04:26.633 And, as a result of that,

00:04:28.000 all of the YouTube traffic in the

00:04:30.133 network was redirected to this site in

00:04:32.633 Pakistan, which stated that the traffic was blocked.

 

00:04:37.033 Now, as you can imagine, this was

00:04:39.366 noticed fairly quickly. The particular ISP that

00:04:43.200 was making the incorrect announcement was located,

00:04:46.300 and the announcement was filtered out

00:04:48.933 very near to that ISP, and so

00:04:52.433 the problem didn't last long.

 

00:04:54.433 But it does show that it's possible

00:04:56.433 to accidentally disrupt global routing operations,

00:05:01.200 in a really quite surprising, and widespread, way.

 

00:05:08.700 And this type of problem happens,

00:05:11.300 in perhaps less high-profile ways, on a

00:05:13.933 daily basis. And there are also malicious attacks, where

00:05:20.200 sites are redirected to a fake version

00:05:22.600 of a site, or traffic is redirected

00:05:25.000 so that it passes through a particular

00:05:26.833 network, where an attacker can snoop on that traffic.

 

00:05:31.700 And this is a serious problem.

00:05:33.333 We'd like to solve this problem,

00:05:35.133 we'd like to make sure that only

 

00:05:36.766 the legitimate owner of a prefix can

00:05:38.400 advertise routes to that prefix.

 

00:05:44.300 How is this done?

 

00:05:48.233 Well, the

00:05:50.033 current best approach to solving this is

00:05:53.033 a technique, known as the Resource Public

00:05:54.866 Key Infrastructure, RPKI.

 

00:05:59.100 And the RPKI is an attempt to secure Internet routing.

 

00:06:04.366 And what it does, is it allows

00:06:06.266 autonomous systems to make signed

00:06:08.666 route origin authorisations.

 

00:06:12.900 And these are messages which get sent in BGP

00:06:17.000 which provide a digital signature for a

00:06:20.166 particular prefix announcement.

 

00:06:23.300 So, along with the announcement that

00:06:26.033 an autonomous system owns a particular IP

00:06:30.233 address range, and can route traffic to

00:06:33.366 that address range, which goes into BGP

00:06:36.266 as normal, and you get the usual

00:06:38.300 AS paths like we saw in the previous part,

00:06:41.833 RPKI allows the autonomous systems to send

00:06:46.500 a digital signature.

 

00:06:48.633 And this also progresses through the BGP

00:06:52.266 system, and follows the same route through

00:06:54.533 BGP, and gets filtered and processed in

00:06:57.200 BGP in the same way that the

00:06:59.266 route advertisements do.

 

00:07:01.300 But it also includes a digital signature,

00:07:04.233 stating that the ISP owns this particular

00:07:07.800 address range, and signed by the next

00:07:10.033 level up in the hierarchy of the routing system.

 

00:07:15.400 So, at the top-level, the regional Internet

00:07:18.133 registries, RIPE, and ARIN, and so on,

00:07:21.233 which assign IP address ranges to ISPs,

00:07:25.300 provide a signed statement that they have

00:07:27.500 delegated a particular address range to a

00:07:30.533 particular autonomous system, a particular ISP.

 

00:07:33.500 And if that ISP delegates a subset

00:07:35.533 of that address range to one of

00:07:37.133 its customers, it can make a signed

00:07:38.833 announcement to do so, and that is, in turn, signed.

 

00:07:42.733 The signatures ripple up all the way to the root.

 

00:07:47.600 So you get this hierarchical delegation,

00:07:50.766 with digitally signed statements announcing the delegation

00:07:53.866 of the prefixes.

 

00:07:56.966 And this allows a router which receives

00:07:59.700 a prefix advertisement, and receives one of

00:08:02.233 these Route Origin Authentication announcements, to validate

00:08:05.366 whether that prefix is authorised.

 

00:08:08.600 And the idea is that valid prefixes

00:08:10.833 will have one of these ROAs,

00:08:14.933 the Route Origin Authorisation digital signatures provided,

00:08:18.500 and the invalid prefixes, the hijacked prefixes, will not.

 

00:08:23.766 And when applying BGP policy, the other

00:08:27.100 networks that comprise the Internet can look,

00:08:29.633 and they can prefer prefixes which are

00:08:32.100 digitally signed than those which are not.

 

00:08:34.833 And that makes it harder to hijack a prefix.

 

00:08:39.766 And RPKI is starting to get traction.

00:08:42.866 It's a relatively new standard, it's maybe

00:08:47.966 10 years old now, and the measurements

00:08:52.133 in the paper we see linked on

00:08:54.633 the slide here, show that, as of

00:08:56.333 a couple of years ago, about 10-12%

00:08:58.933 of the IPv4 addresses

00:09:01.033 are covered by a prefix with a

00:09:03.400 valid signature, and this was growing rapidly.

 

00:09:07.466 And the links to the CloudFlare blog,

00:09:10.200 and to the isbgpsafeyet.com site,

00:09:13.600 present more up-to-date statistics, and its continuing

00:09:20.233 to grow, and RPKI is starting to become widely used.

 

00:09:25.466 And it's starting to become possible to

00:09:27.466 validate the authenticity of the routing announcements.

 

00:09:35.933 The other approach to routing security is

00:09:39.300 a system known as MANRS.

 

00:09:42.133 And MANRS is a set of mutually

00:09:44.266 agreed norms for routing security.

 

00:09:48.066 It's a project which is sponsored by the Internet society,

00:09:52.733 and is a collaboration between a set

00:09:55.633 of network operators to improve routing security.

 

00:10:00.000 And it's mostly there to share best practices.

 

00:10:03.866 It shares information in how to effectively

00:10:06.533 use RPKI; it shares configuration options;

00:10:10.500 it shares tips and approaches for correctly

00:10:14.133 configuring routers, for correctly configuring filtering,

00:10:19.533 for providing anti-spoofing measures; and for coordinating

00:10:23.466 responses to accidental or malicious

00:10:28.566 route hijacking when it's discovered.

 

00:10:35.500 And it's mostly there's as a talking

00:10:37.700 shop, as a forum for the ISPs

00:10:39.966 to coordinate, to make sure that the

00:10:43.100 routing system is stable, to address problems

00:10:45.900 as they occur, and to share and

00:10:48.566 to develop best practices for security.

 

00:10:55.233 And that's essentially all I want to

00:10:56.733 say about routing security.

 

00:10:59.166 Historically, the Internet routing has not been

00:11:02.000 secure at all.

 

00:11:04.366 As RPKI, and as MANRS, start to

00:11:07.533 get rolled-out, we’re starting to see some

00:11:10.266 improvements here, we're starting to see people

00:11:12.466 taking this problem seriously, and trying to

00:11:15.200 bring in some security.

 

00:11:18.233 We're not there yet. The routing is

00:11:20.600 still not particularly secure. Route hijacking,

00:11:23.333 BGP hijacking, still happens on a daily

00:11:26.900 basis, but things are getting better.

Part 4: Intra-domain Routing

Moving on from the discussion of BGP and inter-domain routing, the final part of the lecture briefly reviews intra-domain routing and how it differs. The concepts of distance vector and link state routing are discussion, and the differences in scalability and convergence times are noted. The lecture concludes with a discussion of challenges in recovering from link failures in routing, including fast failover and equal cost multipath routing.

Slides for part 4

 

00:00:00.466 The previous parts of the lecture have

00:00:02.300 spoken about interdomain routing, routing between the

00:00:05.566 networks that form the Internet.

 

00:00:07.766 In this final part, I want to

00:00:09.666 talk very briefly about intradomain routing,

00:00:12.566 routing within a network, and just very

00:00:15.300 briefly recap the distance vector and link

00:00:17.666 state routing algorithms.

 

00:00:21.900 So, as we saw in the previous

00:00:24.200 parts of the lecture, BGP and interdomain

00:00:28.133 routing are about giving information on the

00:00:30.433 path to reach other networks.

 

00:00:32.766 They're on the way the set of

00:00:35.533 networks that comprise the Internet work together

00:00:40.333 to exchange information needed to route packets

00:00:44.800 across the network.

 

00:00:47.733 And BGP is very much a policy-focused

00:00:51.266 routing protocol. The challenges in interdomain routing

00:00:55.700 are primarily to do with enforcing routing policy.

 

00:01:01.700 They’re primarily to do with getting the

00:01:05.033 networks which comprise the Internet,

00:01:08.566 which are, fundamentally, competitors, to work together

00:01:13.466 enough that they can deliver data across

00:01:15.366 the network. It's about expressing the business

00:01:20.766 constraints, the economic constraints,

00:01:22.733 the political constraints,

00:01:24.033 the policy constraints, that affect the way

00:01:26.966 data is delivered.

 

00:01:30.100 The question of intradomain routing, routing within

00:01:33.633 a network, is quite different.

 

00:01:36.966 If you look at routing, how to

00:01:39.400 route traffic within an autonomous system, within a network,

00:01:43.733 you find that it's very much a single trust domain.

 

00:01:49.366 The entire network is operated by a

00:01:52.266 single operator, and that's the point of

00:01:54.233 intradomain routing, it's within a domain,

00:01:56.500 it’s within an autonomous system, it's within a network.

 

00:01:59.500 So there's a single trust domain,

00:02:01.300 and there's no real policy restrictions on

00:02:04.066 who can see the information about the

00:02:05.666 network, or on which links can be used.

 

00:02:09.900 When we're talking about BGP, and interdomain routing,

00:02:16.500 the different networks, the different parts of

00:02:19.600 the system, want to hide their internal

00:02:21.866 details. They want to hide the information

00:02:24.133 about what's going on inside their network,

00:02:25.866 from their competitors.

 

00:02:28.566 If we're considering intradomain routing, we’re routing

00:02:32.033 within a network owned and operated by

00:02:34.600 a single organisation, and the rest of

00:02:36.333 the organisation can see what's going on.

 

00:02:38.800 They can see the topology of the

00:02:40.500 network, they can understand the constraints it’s

00:02:42.733 operating under, because all of the parts

00:02:44.800 of the organisation working together for one goal.

 

00:02:48.433 So there tend not to be policy

00:02:49.933 restrictions on who can see the topology,

00:02:52.933 or which devices can understand the constraints

00:02:56.333 on the network.

 

00:02:57.933 And there tend not to be policy

00:02:59.900 restrictions on which links can be used.

 

00:03:03.166 Certainly backup links, and so on,

00:03:05.833 exist, but there's no need to hide

00:03:07.600 those links; they’re visible to the entire system.

 

00:03:12.533 And, generally, the goal is to get

00:03:14.466 very efficient routing. We’re trying to find

00:03:17.533 the shortest path through the network.

 

00:03:20.166 Unlike inter domain routing, where the goal

00:03:23.900 is to find the shortest policy-compliant path,

00:03:26.733 the goal here is just to find

00:03:28.333 the most efficient use of the resources you have.

 

00:03:33.033 There’s two fundamental approaches that people use

00:03:36.400 for intradomain routing.

 

00:03:38.633 There’s an approach known as distance vector,

00:03:41.066 which tends to get instantiated in the

00:03:43.733 Routing Information Protocol, RIP, or there’s an

00:03:47.033 approach known as link state routing,

00:03:49.100 which has been instantiated in a protocol

00:03:52.166 called the Open Shortest Path First routing protocol, OSPF.

 

00:03:59.966 So, first off, I’ll just briefly talk

00:04:02.166 about distance vector routing.

 

00:04:04.466 The idea here is that the nodes

00:04:06.833 in the network, the routers that comprise

00:04:10.300 the network, maintain a routing table which contains

00:04:15.400 the distance they are from every other

00:04:18.033 node, and the next hop to get towards that node.

 

00:04:22.666 And we have an example on the

00:04:25.366 slide here, that shows a network with

00:04:28.100 seven nodes. And in this example,

00:04:31.366 they’re labeled with the letters A, B, C, D, etc.

 

00:04:35.433 And, in a real system, these would

00:04:37.633 have IP addresses to identify them,

00:04:40.266 but that just makes the slide complicated.

 

00:04:44.333 And we see the an example of

00:04:48.066 the routing table as is shown at node A

 

00:04:51.900 And we see that node A contains

00:04:53.700 a list of all of the other

00:04:55.033 nodes of the network, destinations B,

00:04:58.000 C, D, E, F, and G.

 

00:05:00.433 And, for each of those, it maintains

00:05:02.433 the distance, how far away it is

00:05:04.933 from that node, in number of hops.

 

00:05:06.833 So it's one hop away from node

00:05:08.666 B, it’s directly connected to B,

00:05:11.433 and it can reach it via node

00:05:13.566 B, it's directly connected. Similarly, it's one

00:05:16.500 hop away from C. It's two hops

00:05:19.066 away from D, and it knows the

00:05:20.766 next hop to get there is C, and so on.

 

00:05:25.133 And each node in the network periodically

00:05:27.533 exchanges a message with its neighbours,

00:05:29.900 where it tells its neighbours, “these are

00:05:33.400 who I think my other neighbours are,

00:05:35.433 and this is how far away I think I am from them,

00:05:39.900 and this is how far where I am from the from the rest of

00:05:42.800 the network as well”. And this information

00:05:45.566 gradually spreads through the network.

 

00:05:48.066 And, in the first round of this

00:05:50.800 exchange, each node just finds out its

00:05:53.066 neighbours, then it finds out its neighbours’

00:05:55.433 neighbours, and then its neighbours’ neighbours’,

00:05:57.700 neighbours, and so.

 

00:06:01.500 And the protocol operates in rounds.

 

00:06:04.933 It continually exchanges this information with the

00:06:07.966 neighbours, and gradually fills in the map

00:06:10.466 of the network so it knows how

00:06:12.566 far away it is from every node

00:06:14.766 in the network, and what's the best

00:06:17.066 way of getting there.

 

00:06:19.633 And once it's done that, it just

00:06:21.300 forwards the packets on the shortest path

00:06:23.033 to the destination, based on the hop

00:06:25.066 count, based on the distance. And if

00:06:27.166 there's two ways of getting there with

00:06:28.733 the same hop count it can pick arbitrarily.

 

00:06:34.533 Now, distance vector routing is

00:06:37.833 relatively straightforward, and it doesn't maintain too

00:06:41.633 much information at the nodes. All it

00:06:46.200 stores is a list of the other

00:06:47.866 nodes, and the distance, and next hop,

00:06:50.933 so the amount of state it needs

00:06:53.166 is linear with the size of the network.

 

00:06:55.800 The amount of entries in the routing

00:06:57.566 table grows linearly with the number of

00:06:59.800 nodes in the network. so it's relatively

00:07:02.300 resource efficient.

 

00:07:05.000 But it's slow to converge, because of

00:07:07.266 the way it operates in rounds,

00:07:09.366 and it has a problem where certain types

00:07:14.200 of failures can lead

00:07:16.833 to a behaviour where the distance gradually

00:07:19.700 counts up by one each iteration of

00:07:23.700 the algorithm, each iteration of the routing protocol.

 

00:07:27.233 And when a failure has happened,

00:07:29.233 it gradually counts up by one until

00:07:31.066 it gets to the representation of infinity

00:07:34.000 in the system, and takes multiple rounds

00:07:36.133 to converge and detect the failure.

 

00:07:39.366 And that behaviour can lead to very

00:07:41.766 slow convergence, and the system not being

00:07:44.900 able to recover from a link failure effectively.

 

00:07:52.733 The alternative algorithm, which is widely used

00:07:56.566 in the network, is what's known as link state routing.

 

00:08:00.800 And the idea of link state routing

00:08:03.733 is that the nodes in the network

00:08:07.133 know, obviously, the links to their neighbours.

 

00:08:10.200 They know which other routers they directly connected to.

 

00:08:13.633 And they know some metric about the

00:08:15.900 cost of using those links.

 

00:08:19.133 And that may just be the link

00:08:21.633 bandwidth, as a metric, or it may

00:08:24.433 be the delay, or it may be

00:08:26.233 a hard-coded metric chosen by the operator.

 

00:08:31.266 And when a node starts up,

00:08:35.366 or when a link changes, when something

00:08:37.600 changes in the network, the nodes can

00:08:39.666 flood this information throughout the network.

 

00:08:44.066 They can send to all of their

00:08:45.500 neighbours the list of directly connected nodes,

00:08:49.466 and the cost for using that link,

00:08:51.433 along with a sequence number for these messages.

 

00:08:56.533 And this gets flooded throughout the whole

00:08:59.000 network, so every node in the network

00:09:01.700 learns every other node in the network,

00:09:04.533 and what are each node’s neighbours.

 

00:09:08.500 So node A, in this example,

00:09:10.300 will flood out through the network that

00:09:13.133 it's node A, its neighbours are B,

00:09:15.466 C, E, and F, and it will

00:09:17.366 flood out the metrics, the speed of

00:09:19.333 the links for example. And this will go everywhere.

 

00:09:23.500 This will get flooded throughout this entire

00:09:25.866 network, so node B will know what

00:09:29.300 is node A and what are its

00:09:30.500 neighbours, and so it will node C,

00:09:32.566 and D, and E, and F,

00:09:33.700 and G, and H. And every one of those

00:09:35.866 nodes knows that node A exists,

00:09:38.800 and which nodes it's directly connected to.

 

00:09:42.066 And this happens for every node.

 

00:09:43.866 Each node periodically floods this information out,

00:09:46.633 whenever anything changes.

 

00:09:50.166 And, over time, this means that the

00:09:52.000 entire network, all of the nodes in

00:09:53.900 the network, all the routers in the

00:09:55.666 network, get to learn all of the

00:09:58.233 other links in the network.

 

00:10:00.600 They get to know which nodes are directly connected.

 

00:10:04.333 At that point they can just draw a

00:10:06.800 complete map of the network. Every node

00:10:09.333 knows the complete network topology,

00:10:12.733 and at that point, it can run

00:10:14.400 Dijkstra’s algorithm, calculate the shortest path to

00:10:17.533 every other node in the network,

00:10:20.433 and use that to make the decisions

00:10:22.366 which way it forwards the packets.

 

00:10:26.633 Now, this works much better,

00:10:30.966 because every node knows the complete topology.

00:10:34.966 If something fails, they can recover quite

00:10:36.800 quickly, as soon as the message gets

00:10:38.900 to them, they don't have to wait

00:10:40.966 for the count-to-infinity cycle that the distance

00:10:43.200 vector routing has.

 

00:10:45.933 The disadvantage of it, though, is that

00:10:48.433 it needs more memory, and it needs

00:10:50.333 more compute cycles.

 

00:10:52.566 Not only does each node store the

00:10:55.666 distance to every other node, but it

00:10:57.233 stores a complete map of the network.

 

00:10:59.733 So the amount of state each router,

00:11:03.100 each node in the network, needs to

00:11:04.800 store is equal to the size of the network squared.

 

00:11:08.100 So it scales order n squared with

00:11:10.200 the size of the network, because each

00:11:12.133 node is storing the complete matrix of

00:11:14.233 all the nodes and their connections to every other node.

 

00:11:20.300 And calculating Dijkstra’s algorithm is more computationally

00:11:23.833 complex than just looking at the distances.

 

00:11:26.566 And so this algorithm, the link state

00:11:29.700 approach to routing, is more memory hungry,

00:11:32.833 and it's more computationally intensive,

00:11:34.666 than distance vector.

 

00:11:36.966 But it converges much faster.

 

00:11:40.366 It recovers much faster after errors,

00:11:42.900 after links fail.

 

00:11:47.933 So we see there’s two approaches.

 

00:11:50.666 You can use distance vector routing in

00:11:52.800 a network, which is very simple to

00:11:54.466 implement, has low resource overheads in routers,

00:11:58.300 but suffers from very slow convergence.

 

00:12:00.866 If a link in the network fails,

00:12:02.733 it takes a long time to recover,

00:12:04.733 and packets cannot be delivered, packets to

00:12:08.700 certain destinations will not be correctly

00:12:10.600 delivered during that time.

 

00:12:13.166 Or you can use the link state

00:12:14.900 approach to routing, which is more complex,

00:12:17.700 requires the routers to have more memory,

00:12:19.700 do more computations, but it's much faster to converge.

 

00:12:25.066 And, when the network was starting out,

00:12:27.500 distance vector routing was relatively popular because

00:12:31.333 memory was expensive, because machines was slow,

00:12:34.433 and because there were not particularly strict

00:12:37.166 performance bounds on the network.

 

00:12:40.733 These days, memory is cheap, machines are

00:12:44.233 fast, and so the link state approach

00:12:47.666 is generally preferred, because it converges faster,

00:12:51.333 because the network recovers from failures much faster.

 

00:12:58.366 So what are the challenges with intradomain routing?

 

00:13:04.066 Well, I think there’s two.

 

00:13:07.900 The main one is how does it

00:13:10.733 recover effectively from failures?

 

00:13:19.300 While network equipment is pretty robust,

00:13:24.166 and pretty reliable,

00:13:27.333 it turns out that construction workers are

00:13:29.700 actually surprisingly good at breaking network cables.

 

00:13:33.266 And it's surprisingly common that someone digging

00:13:35.966 up the road puts a JCB through

00:13:38.466 the cables and breaks the network.

 

00:13:42.366 And, similarly, for people operating long distance

00:13:45.933 networks, people operating the international links,

00:13:49.166 it turns out that trawlers are pretty

00:13:50.833 good at damaging undersea cables.

 

00:13:54.866 And so good network designs need to

00:13:56.866 have multiple paths from source to destination.

 

00:14:00.233 And they need to be able to

00:14:01.866 fail-over to a different path if a

00:14:03.700 link breaks, and they need to be

00:14:05.233 able to do that relatively quickly.

 

00:14:09.566 How quickly do they need to notice

00:14:11.400 this? How quickly do they need to

00:14:13.000 switch over to a backup path?

 

00:14:16.966 Well,

00:14:18.666 It depends, what sort of guarantees you've

00:14:21.833 given your customers.

 

00:14:25.266 For certain types of networks, it may

00:14:28.466 be that a few minutes downtime is

00:14:30.400 acceptable. Maybe the customers of that operator

00:14:33.533 are okay if the link goes away for half-an-hour.

 

00:14:36.766 That seems less likely, though.

 

00:14:40.400 A few seconds failure? That's getting more acceptable.

 

00:14:46.366 it's noticeable, probably, but it's probably acceptable

00:14:49.533 if the link goes down for 10 seconds, for a lot of users.

 

00:14:54.000 But if the links are being used

00:14:56.066 to carry real-time traffic,

00:14:58.833 and if you want to have the

00:15:01.533 links, have the failures, recovered in a

00:15:04.033 way that doesn't disrupt that traffic,

00:15:06.300 maybe you're providing the network link for

00:15:09.666 the BBC, maybe you're providing a network

00:15:13.200 link for a service which is carrying

00:15:16.666 production quality video,

00:15:19.233 critical video, for example,

00:15:21.566 and if you want to recover such

00:15:25.566 that it doesn't affect that sort of

00:15:27.300 media, you need to be able to

00:15:28.700 recover within the duration of a single frame.

 

00:15:31.966 So you need to be able to

00:15:33.100 switch-over to a backup link within maybe

00:15:35.500 a 60th of a second. And have

00:15:37.866 that link, have that backup link,

00:15:40.200 have similar latency to the original so

00:15:42.366 it doesn't cause a

00:15:44.233 significant gap in the packets being received.

 

00:15:49.600 And so, a lot of the challenge

00:15:51.100 is how quickly can you fail-over,

00:15:52.966 and how quickly do you need to

00:15:54.466 fail-over in the event of a link

00:15:56.333 failure, for your customers?

 

00:15:59.300 If you’re a network operator, what demands

00:16:02.666 are your customers placing on how quickly it recovers?

 

00:16:06.533 And different service level guarantees,

00:16:10.700 different service level agreements, obviously affect how

00:16:13.400 much you charge your customers. But also

00:16:15.333 they affect how you organise, and how

00:16:17.233 you design, the network, and what mechanisms

00:16:19.433 you put in place for detecting failures.

 

00:16:21.933 And how you tune the protocol to

00:16:23.800 handle failures, and to recover from failures.

 

00:16:28.133 And quite often, this involves techniques

00:16:30.833 to pre-calculate alternative paths, so the system has

00:16:35.900 several different routing tables pre-configured,

00:16:40.266 accounting for different link failures, and can

00:16:43.033 just detect the failure and switch over

00:16:44.933 instantly to a pre-computed alternative, and doesn't

00:16:47.300 have to wait for the Information to propagate.

 

00:16:53.433 And the other issue is that of

00:16:55.100 load balancing. If you have multiple paths

00:16:59.133 through your network,

00:17:01.000 and you're trying to spread the amount

00:17:02.900 of traffic you have to make effective

00:17:04.466 use of those paths, of those different

00:17:06.666 paths through the network,

00:17:08.100 such that not all of the traffic

00:17:09.666 is concentrated on a single link,

00:17:11.633 but it's being spread across the network

00:17:14.333 to avoid congesting a particular link.

 

00:17:20.866 Then, quite often, the idea is what's

00:17:23.033 called equal-cost multipath. You arrange the network

00:17:26.233 so there's multiple parallel paths on

00:17:29.066 the hot links, on the links that

00:17:31.700 see most of the traffic, and you

00:17:34.000 arrange it so that it alternates the

00:17:35.466 traffic between those paths.

 

00:17:38.400 But you need to be at least somewhat careful,

00:17:41.733 because protocols like TCP, with the triple-duplicate

00:17:46.433 ACK, are at least slightly sensitive to reordering.

 

00:17:51.333 If you're sending packets down alternative routes

00:17:55.800 to a destination, and those routes have

00:17:58.066 different delays, and different amounts of traffic

00:18:00.466 on them, the packets can arrive out-of-order.

 

00:18:03.333 And this is a common source of

00:18:05.000 reordering in the network. And, as we

00:18:07.533 saw when we spoke about TCP,

00:18:09.666 and TCP recovery, it’s insensitive to a

00:18:12.933 small amount of reordering,

00:18:14.833 but if the paths, the different routes

00:18:16.866 through the network, have significantly different latency,

00:18:19.800 by spreading the load, by alternating packets

00:18:23.266 between different paths, you can introduce large

00:18:25.733 amounts of reordering.

 

00:18:27.333 Which TCP would then interpret as a

00:18:30.333 packet loss, and start retransmitting packets.

 

00:18:34.566 And different applications,

00:18:35.966 different protocols, have different

00:18:37.366 degrees of sensitivity to reordering. A lot

00:18:39.866 of the real-time applications don't care at

00:18:42.300 all, as long as the packets arrive

00:18:44.066 before their deadline.

 

00:18:46.033 But protocols like TCP and QUIC,

00:18:48.166 to at least some extent, do care

00:18:50.300 so you need to

00:18:51.866 arrange the network, so that if you

00:18:53.466 are balancing traffic between multiple routes it

00:18:56.200 doesn't accidentally cause large amounts of reordering.

 

00:19:03.766 And that's all I want to say about routing.

 

00:19:07.966 We spoke a bit about content distribution

00:19:10.400 networks, the idea of locating servers in

00:19:14.800 multiple places in the network in order to

00:19:18.433 host content near to the people who

00:19:22.100 want that content, near to the users

00:19:24.566 of that content, and how that can

00:19:27.566 be achieved using

00:19:29.333 DNS-based tricks to redirect to a local

00:19:33.233 replica, and a little bit about the

00:19:35.233 idea of anycast routing, where the same

00:19:37.166 addresses are inserted from multiple places and

00:19:39.433 the routing system takes care of getting

00:19:41.166 data to the to the nearest replica.

 

00:19:45.600 I spoke through interdomain routing, we spoke

00:19:48.033 through the idea of the Border gateway

00:19:49.966 Protocol, BGP, and how it can deliver

00:19:52.333 data, and the various policy constraints that

00:19:55.133 affect the way BGP works.

 

00:19:57.833 We spoke about routing security, of the

00:20:01.533 lack thereof, in the Internet, and we

00:20:03.933 finished up by talking a little about intradomain routing.

 

00:20:09.633 This is the final technical part of the lecture,

00:20:13.166 the final technical lecture in the course.

 

00:20:16.700 In the next lecture I’ll move on

00:20:18.533 and conclude the course, and talk about

00:20:21.833 some possible future directions, and some ways

00:20:24.533 in which the network is evolving.

Discussion

Lecture 9 discussed content distribution and routing. Part 1 considered content distribution networks (CDNs). It spoke about the need to locate proxy caches throughout the network in order to get low-latency access to content and to distribute load. And it discussed, briefly, how to implement CDNs using either DNS tricks or anycast routing.

Part 2 considered inter-domain routing. It spoke about autonomous systems (ASes) and the AS graph. It considered routing at the edge of the network, based on default routes; and in the core, the so-called default-free zone. And it highlighted the role of policy in inter-domain routing.

Inter-domain routing and routing policy is implemented using the Border Gateway Protocol. BGP exchanges prefixes and AS path information, to form a routing table. And the filtered table allows policy to expressed. The Gao-Rexford rules were outlined, describing a common set of polices.

The lack of security in inter-domain routing was mentioned, and the lecture outlined two project, RPKI and MANRS, that are trying to improve security and robustness of the routing infrastructure.

Finally, the lecture discussed intra-domain routing, including distance vector and link state protocols, and some of the challenges in network operations.

Discussion will focus on the need for, and benefits of CDNs; on inter-domain routing and the requirements for policy support and how this is expressed in BGP; and on intra-domain routing and the challenges of network operations.