Networked Systems H (2022-2023)
Lecture 9: Networks and Internet Routing
Lecture 9 discusses content distribution networks and Internet routing.
It discusses what are CDNs and what role they play in the Internet, as
mechanism to spread load and reduce latency. The problem of inter-domain
routing is then introduced, and the BGP routing protocol is reviewed as
a mechanism for providing policy routing across the Internet. Some of
the security limitations of BGP are highlighted, along with current
approaches to try to address these. Finally, intra-domain routing,
routing within a network, is briefly reviewed.
Part 1: Content Distribution Networks (CDNs)
The lecture begins by discussing content distribution networks (CDNs).
It outlines what are CDNs, and the role they play in the network to
help distribute load and reduce latency. Two approaches to locating
CDN nodes, using DNS and anycast routing, are briefly introduced.
Slides for part 1
00:00:00.433
In this lecture I want to talk
00:00:01.866
about how routing works in the Internet.
00:00:05.866
I’ll start, in this part, by talking
00:00:07.700
briefly about content distribution networks, which we
00:00:10.733
discussed a couple of lectures ago.
00:00:13.133
And then, in the later parts,
00:00:14.866
I’ll talk about inter-domain routing, how network
00:00:18.966
operators cooperate to deliver data across the
00:00:22.033
wide-area network. I’ll talk briefly about routing
00:00:24.833
security. And I’ll talk about intra-domain routing,
00:00:28.366
how routing within an operator’s network.
00:00:34.066
So, in this part, I’d like to
00:00:35.300
start by talking about content distribution networks.
00:00:38.500
I’ll talk about how CDNs help with
00:00:40.333
load balancing and latency reduction, and how
00:00:43.366
they're implemented using either the DNS or
00:00:46.066
using anycast routing.
00:00:51.866
So what is a content distribution network?
00:00:56.000
A content distribution network is a service
00:01:01.733
which provides scalable, load-balanced,
00:01:05.466
low-latency hosting for web content.
00:01:09.933
CDN operators, companies such as Akamai,
00:01:12.900
CloudFlare, and Fastly, are in the business
00:01:16.133
of hosting content for their customers.
00:01:20.533
Their customers give them web content,
00:01:23.033
and this may be images, it may
00:01:25.100
be software uploads, it may be video,
00:01:27.966
doesn't matter what it is, anything that
00:01:29.800
can sit on the web.
00:01:31.766
And the CDN hosts that content in
00:01:34.100
web caches that are spread around the world.
00:01:36.933
And some of these are located in
00:01:38.566
data centres, some of these are located
00:01:40.900
in edge networks operated by various ISPs.
00:01:45.833
And the idea is to reduce the
00:01:49.566
load on the main servers.
00:01:51.833
Rather than keeping a local copy of
00:01:54.166
the file, the customer gives it to
00:01:56.766
the CDN and links to the copy
00:01:58.166
hosted by the CDN. And this reduces
00:02:00.400
the load on the customer, and puts the load onto the CDN.
00:02:04.433
The idea is that the CDNs are
00:02:06.233
big enough, and have enough caches in
00:02:08.566
enough data centres and enough edge networks,
00:02:11.533
that this spreads the load throughout the world,
00:02:14.166
and prevents from being overloaded for high traffic sites.
00:02:18.066
It reduces latency for the requests,
00:02:20.566
because the CDN will have a cache
00:02:23.366
near to the person making the request, and
00:02:27.666
can spread the load around the world. And it
00:02:30.933
reduces the chances of a successful denial
00:02:33.433
of service attack,
00:02:35.166
again just because of the sheer size
00:02:37.733
of the CDN, and the sheer number of caches it has.
00:02:42.233
And there are many commercial CDNs available,
00:02:45.700
I think the big three of those
00:02:47.566
are listed on the slide, but there
00:02:48.933
are certainly many others.
00:02:50.600
And many large organisations also run their
00:02:53.566
own CDNs. In particular, that the so-called
00:02:57.500
hypergiants, big companies such as Google,
00:03:00.200
Facebook, Netflix, Apple, and the like,
00:03:04.133
all run their own large-scale content distribution networks.
00:03:12.033
The goal of CDNs is to distribute load.
00:03:16.900
They distribute load by caching content
00:03:20.700
all around the world, and by answering
00:03:22.700
most requests from a local cache.
00:03:27.433
And, in order to do that,
00:03:29.600
they need to have servers located everywhere.
00:03:32.766
They need to be very large,
00:03:34.700
and have very wide geographical distribution.
00:03:38.900
This means they need a large-scale investments,
00:03:41.933
large-scale cooperation with network operators, with ISPs,
00:03:46.833
with Internet exchange points,
00:03:48.133
with data centres, and the like.
00:03:53.566
And they try to host caches as
00:03:57.333
near to the customers as they can.
00:04:00.733
And to give you an idea of
00:04:01.833
the scale of this, the picture at
00:04:03.900
the top, the top-right of the slide,
00:04:05.800
comes from Netflix, and shows the reach
00:04:09.033
of their caches, their CDN.
00:04:13.300
And we see that they’re located,
00:04:16.066
primarily, in North America and Europe,
00:04:19.433
and to a lesser extent in South
00:04:23.466
America, where the customers of Netflix are
00:04:26.433
located. But we do see that are
00:04:30.133
massive numbers of servers in those regions,
00:04:33.833
and also servers in Australia, New Zealand, Japan,
00:04:39.300
Singapore, the Middle East, South Africa,
00:04:43.666
and so on, to try and get some more geographic spread.
00:04:49.033
And the statistics from Akamai, they boast
00:04:52.400
that they have more than 240,000 servers
00:04:55.566
in over 150 countries So these are
00:04:58.233
very large scale, very widely distributed, server networks.
00:05:04.400
And they often get this benefit by
00:05:07.100
hosting servers within edge ISP’s networks.
00:05:12.166
So an ISP, such as Virgin Media
00:05:15.200
in the UK, would almost certainly be
00:05:17.166
hosting CDN caches for Netflix, Akamai,
00:05:22.233
and the various other big CDNs.
00:05:26.666
And there’s a mutual benefit for an
00:05:28.933
ISP to hosts such a cache;
00:05:30.900
there's a mutual benefit from the ISP
00:05:33.200
to work with a CDN.
00:05:35.500
And, clearly, from the CDNs point of
00:05:37.500
view, it increases the reach and the
00:05:40.233
robustness of the CDN, if they can
00:05:42.000
put caches in as many networks as possible.
00:05:45.300
From the ISP’s point of view,
00:05:47.333
though, it reduces the load on their network.
00:05:50.366
The CDN can push one copy of
00:05:53.366
a file into the cache, and it
00:05:57.033
can then distribute it to the other
00:05:59.433
customers of that ISP. And it means that all of that
00:06:03.266
load is then served from within the
00:06:05.100
ISP’s network, without having to go over
00:06:07.133
the expensive wide-area links to the rest of the Internet.
00:06:10.400
And this avoid overloading the links from
00:06:12.933
the ISP to the outside world.
00:06:16.166
And the scale of some of the
00:06:19.733
popular services means that this is necessary.
00:06:23.866
Netflix, for example, talk about how they
00:06:27.133
distribute 10s of terabits of video per
00:06:29.400
second, and this clearly isn't possible from
00:06:32.300
a single data centre.
00:06:33.866
It has to be pushed out in
00:06:35.600
a hierarchy, with the central data centre
00:06:39.266
pushing data out to a CDN,
00:06:41.100
which pushes it out to edge caches,
00:06:42.600
which distribute to the customers. You can't
00:06:45.433
host all of this from a single
00:06:46.666
data centre, from a single site,
00:06:48.266
you have to spread the load around the world.
00:06:55.100
The other benefit of CDNs is that they reduce latency.
00:06:59.600
The goal is that the content is
00:07:02.333
not only spread geographically for load balancing
00:07:05.866
reasons, but it’s spread geographically so that
00:07:08.900
there's always a local copy near to
00:07:11.966
the person requesting the data.
00:07:15.100
That means when you request content from a popular site,
00:07:20.666
if you're based in Europe and you
00:07:23.800
request content from a popular site,
00:07:25.366
it doesn't have to go to the
00:07:28.133
US where the site is based,
00:07:31.000
but can be answered, but the request
00:07:32.733
can be answered, from a CDN cache located in Europe.
00:07:38.166
And this reduces the latency for your
00:07:41.000
requests, because it can be cached near to you.
00:07:45.433
But, of course, it requires a global
00:07:47.500
distribution of the proxy caches.
00:07:51.500
And, I think, one of the questions
00:07:53.700
here is how effectively CDNs are managing
00:07:56.233
to serve the entire world?
00:07:59.033
If we look at this picture,
00:08:01.266
from Netflix, we see that if you're
00:08:02.833
in Europe or North America, that there
00:08:06.100
are certainly many CDN caches, and there
00:08:09.433
will be one located very near to you.
00:08:12.466
And if you're located in certain parts
00:08:15.666
of South America, if you're located in
00:08:18.366
the populous bit of Western Australia,
00:08:21.900
if you're located in Singapore, or in
00:08:24.600
Japan, or somewhere like that,
00:08:27.433
there'll be a CDN near to you.
00:08:31.800
If you're in Africa, though,
00:08:35.133
you’re perhaps less well served. If you're
00:08:38.000
in large parts of Asia, you’re less well served.
00:08:41.600
And, increasingly, providing Internet access to developing
00:08:45.933
regions of the world needs more than
00:08:48.400
just providing connectivity.
00:08:50.633
If you want to provide high-quality Internet
00:08:53.133
access to parts of Africa, for example,
00:08:56.200
or parts of Asia which don't have
00:08:57.700
it yet, you don't just need to
00:08:59.333
provide bandwidth, you don't just need to
00:09:01.433
provide network links, you need to provide
00:09:03.966
data centres that can host CDN caches.
00:09:07.633
So it's increasing the investment needed to
00:09:12.333
get good performance.
00:09:19.400
And this works for cacheable static content.
00:09:24.466
CDNs have historically been focused on video,
00:09:27.900
and images, and software updates, and distributing
00:09:29.966
large files, and they work incredibly well for that.
00:09:34.066
But we're also starting to see people
00:09:35.766
talk about edge compute applications.
00:09:39.200
And applications where there is some sort
00:09:41.966
of computation going on near to the
00:09:44.366
customer. And this tends to be for
00:09:47.966
augmented reality games, and applications like that,
00:09:51.333
where you need low latency to the
00:09:53.966
compute server, the data centre.
00:09:56.266
And again CDNs, are starting to host
00:09:59.200
this sort of content, starting to allow
00:10:02.766
compute to be pushed into the edges.
00:10:04.933
And, again, this means that they don't
00:10:07.166
just need caching and data storage at the edges,
00:10:10.100
but they need large scale computing infrastructure.
00:10:12.900
Again, developed parts of the world this
00:10:17.566
is eminently achievable. In developing, in less
00:10:21.966
well-developed parts of the world,
00:10:23.466
this infrastructure isn't yet there.
00:10:30.333
So, how do the CDNs work? How
00:10:33.333
do they find the nearest node in
00:10:36.966
order to deliver the content? I mean
00:10:40.566
actually delivering the content is easy,
00:10:42.533
a CDN node is just a web
00:10:44.633
server which has the files located on
00:10:48.266
it, and it just delivers them using
00:10:49.866
HTTP. The question is how you find
00:10:52.000
the right CDN node, that has the file you're looking for.
00:10:57.300
There’s two ways they do it.
00:10:59.733
Some of them use the DNS,
00:11:01.866
and some of them use a technique known as anycast routing.
00:11:06.900
For the CDNs that use the DNS,
00:11:10.466
the goal is that they locate the
00:11:12.500
nearest CDN node based on,
00:11:16.600
and give you an answer for where
00:11:19.533
that node is, by playing games with the DNS queries.
00:11:25.600
So in this case, when a customer
00:11:28.533
of the CDN gives a resource to
00:11:30.900
the CDN to be hosted, the CDN
00:11:33.566
gives that resources unique domain name.
00:11:37.400
For example, if the site example.com is
00:11:41.033
trying to host an image of a
00:11:43.133
kitten on a CDN, the CDN would
00:11:45.633
give that a unique hostname. And in
00:11:49.933
this case, it’s picked a
00:11:53.933
hexadecimal name, 9BC1C…. etc.
00:12:00.466
But the point is that every image,
00:12:02.600
every resource, every file on the CDN,
00:12:04.700
has a unique host name.
00:12:07.700
Now, of course, they don't alway refer
00:12:09.566
to real hosts. They’re all entries in
00:12:12.066
the DNS which point to a particular
00:12:16.733
server in the cache, but it gives the flexibility.
00:12:20.133
Because every file, every image, every piece
00:12:23.833
of content, on the CDN has a
00:12:25.766
different hostname, the CDN can return a
00:12:28.800
different IP address for each image,
00:12:31.900
each file, and it can point it
00:12:33.733
at an appropriate replica.
00:12:37.966
So, the way this works is that
00:12:39.733
the CDN returns different answers to the
00:12:43.400
DNS queries for the A or AAAA
00:12:45.833
records for the names, depending on where
00:12:49.166
they're being requested from, and what CDN
00:12:51.833
caches have that data.
00:12:54.966
So it looks at the IP address
00:12:57.166
of the resolver making the query.
00:12:59.700
And the CDN,
00:13:02.833
when it gets the name look-up for this
00:13:07.266
host, redirects it by returning a different
00:13:11.233
IP address that refers to a local cache.
00:13:14.800
And if I look up
00:13:17.100
this name from my home, I might
00:13:19.833
get a particular CDN cache located in the
00:13:23.900
ISP I have, and if you make
00:13:26.733
the same look-up from your home,
00:13:28.866
in a different ISP’s network, you’ll get
00:13:31.066
a different IP address back for that
00:13:32.733
name, pointing to a different cache that's
00:13:35.933
hosted by the CDN.
00:13:39.733
And this is based on the IP
00:13:42.066
address of the resolver, because all the
00:13:44.500
CDN sees is the requests coming from the local resolvers.
00:13:49.100
But the DNS resolver has an extension
00:13:51.966
called DNS client subnet extension, so if
00:13:54.300
the client is not in the same
00:13:55.700
place as the resolver, the resolver can
00:13:58.333
tell the CDN the IP address where the client came from.
00:14:03.600
And the CDN has to look-up the
00:14:05.500
IP address where it sees the requests
00:14:07.033
coming from, that of the resolver or
00:14:09.533
that of the client with the client subnet extension,
00:14:12.366
and try and guess where in the world it is.
00:14:15.066
It needs to look-up the IP address,
00:14:16.966
and have a mapping of IP addresses to locations.
00:14:21.800
And this doesn't need to be particularly
00:14:23.700
accurate. The goal is to figure out
00:14:26.700
if you're in the UK, and direct
00:14:30.000
to the cache based in London,
00:14:31.733
rather than the cache based in New York, for example.
00:14:34.633
It doesn’t really care if it realises
00:14:37.533
that you're in Glasgow, or Manchester,
00:14:39.700
or wherever, the main thing is it
00:14:41.366
knows that you're in the UK,
00:14:42.566
so you should go to UK-based cache.
00:14:46.433
And this gives the CDN very fine-grained
00:14:49.000
control. It can put the time-to-live on
00:14:52.333
its DNS responses down to be a
00:14:54.900
small number of seconds, so
00:14:58.766
every time a client looks up an
00:15:01.066
image, for every different image, every different
00:15:04.666
resource the CDN is hosting, it can
00:15:06.400
return a different answer. So it can
00:15:08.200
very rapidly load balance among it’s different caches,
00:15:11.933
amongst it’s different data centres. But it
00:15:14.700
puts a high load on the DNS,
00:15:16.500
it means there's lots of DNS queries
00:15:18.566
happening, and they can't be cached for very long.
00:15:25.566
The other approach CDNs use, is known as anycast routing.
00:15:33.166
And this doesn't play games with DNS,
00:15:37.500
it uses the DNS in a much more traditional way.
00:15:41.333
In that
00:15:43.333
the DNS names for the CDN always
00:15:46.933
the same; they always just refer to
00:15:49.566
the CDN. And they always return the same answer.
00:15:53.466
And what it does is, each resource
00:15:56.200
the CDN is hosting, it gives it a different filename.
00:16:01.100
And the DNS name always maps to
00:16:03.033
the same IP address. Literally, it always
00:16:06.566
maps to the same IP address.
00:16:09.100
And in this example,
00:16:11.633
the CDN has three data centres,
00:16:14.900
all of which are using IP address 192.0.2.4.
00:16:22.266
And the CDN has many data centres
00:16:25.366
around the world, and they all use
00:16:27.233
the same IP address ranges. And they
00:16:29.733
advertise those IP address ranges into the
00:16:32.566
routing system, into the BGP routing system
00:16:35.566
we’ll talk about in the next part.
00:16:38.300
And the Internet routing then ensures that
00:16:40.533
the traffic goes to the closest data centre to source.
00:16:45.066
By advertising the same IP address into
00:16:47.833
the routing from multiple places, the routing
00:16:50.533
system makes sure that the traffic goes
00:16:52.366
to the nearest data centre.
00:16:55.666
And it's an abuse of routing.
00:16:57.533
It's intentionally advertising the same IP address
00:17:00.900
from multiple places,
00:17:02.900
letting the routing take care of how the data gets there.
00:17:09.733
Which approach do CDNs use? Probably a mix of both.
00:17:16.233
Some of the large ones just use
00:17:18.733
the DNS-based approach, some of them use
00:17:21.633
a mix of both approaches, and both
00:17:23.733
approaches work, and they have different trade-offs.
00:17:30.133
So that's what I want to say
00:17:31.333
about CDNs. The goal of CDNs is
00:17:34.200
to provide load balancing and to reduce
00:17:36.400
latency, by allowing responses for web content
00:17:41.466
to be redirected from the original sites
00:17:45.733
to the content distribution network, which in
00:17:48.566
turn hosts that content at
00:17:51.200
numerous locations around the world which are,
00:17:54.066
hopefully, are close to the end users.
00:17:57.766
And it can be implemented by playing
00:17:59.700
tricks with the DNS where, depending on
00:18:03.733
where you make the DNS lookup you
00:18:06.066
get a different answer back locating you to local cache,
00:18:09.233
or it can be implemented using anycast
00:18:11.300
routing, where the caches all have the
00:18:13.933
same IP address and the routing system
00:18:15.833
takes you to the nearest replica.
00:18:19.200
In the next part, I'll talk about
00:18:21.333
how the routing system, the BGP routing.
00:18:24.800
works in the Internet.
Part 2: Inter-domain Routing
The second part of the lecture introduces the inter-domain routing
problem. It reviews the network-of-networks nature of the Internet,
and the concept of Autonomous Systems (ASes), and introduces the AS
graph and BGP as the basis for inter-domain routing. The differences
between routing at the edges and in the core of the network is
discussed, as is the role of the default-free zone and Internet
Exchange Points. The operation of BGP as a path vector protocol,
choosing shortest policy compliant path is reviewed; including a
discussion of routing policy, the Gao-Rexford rules, and the BGP
decision process.
Slides for part 2
00:00:00.500
In this part I'd like to talk
00:00:01.833
about routing in the Internet, in particular
00:00:04.433
the idea of interdomain routing, routing between
00:00:08.033
networks, between autonomous systems.
00:00:10.633
I’ll talk about what is an autonomous system.
00:00:13.133
I’ll talk about the AS graph,
00:00:14.833
the graph of interconnections between networks that
00:00:17.500
form the Internet.
00:00:18.900
I’ll talk about how routing works at
00:00:20.666
the edges, and in the core of
00:00:22.433
the network. And I’ll talk about the
00:00:24.133
Border Gateway Protocol, BGP, which enables routing
00:00:27.666
in the Internet.
00:00:31.666
The Internet is a network of networks.
00:00:37.066
Fundamentally it's built as a set of
00:00:40.333
independently owned, independently operated, networks which
00:00:45.033
talk with each other, and which collaborate to deliver data.
00:00:50.166
Each of these networks is what's known
00:00:52.633
as an autonomous system. It operates independently.
00:00:55.633
And each network is a separate routing domain.
00:00:58.200
It makes its own decisions internally
00:01:00.266
how to route data around its own network.
00:01:05.200
The problem of interdomain routing is the
00:01:07.800
problem of finding the best path across
00:01:10.100
this set of networks. It’s the problem
00:01:12.966
of finding the best path from the
00:01:14.433
source network to the destination network,
00:01:17.233
treating the set of networks as a graph.
00:01:21.266
So, it's not finding the best hop-by-hop
00:01:23.933
path through the network, it’s finding the
00:01:26.100
best path between the set of networks
00:01:29.100
that comprise the Internet.
00:01:31.866
It treats each network in the Internet
00:01:34.100
as a node on the graph,
00:01:36.066
what’s known as the AS topology graph,
00:01:38.866
and it treats the connections between the
00:01:41.500
networks as edges in the graph.
00:01:43.866
And it's trying to find the best
00:01:45.800
set of networks to choose, to get
00:01:47.433
from the source to the destination across the AS graph.
00:01:54.300
As I said, the Internet is a
00:01:57.000
network of networks. Each of these networks,
00:01:59.800
each autonomous system, is independently owned and operated.
00:02:05.866
And the Internet routing system, the Border
00:02:09.000
Gateway Protocol, operates based on this idea
00:02:11.833
of autonomous systems, ASes.
00:02:15.033
And an AS is an Internet service provider,
00:02:20.200
or some other organisation that operates a
00:02:22.733
network, and that wants to participate in the routing.
00:02:26.866
The University of Glasgow is an autonomous
00:02:28.933
system in routing terms, for example.
00:02:32.866
As would be that the various residential
00:02:35.233
ISPs, Virgin Media or BT or Talk
00:02:38.600
Talk would be autonomous systems. But so
00:02:41.400
are large companies, Facebook, and Google,
00:02:44.000
and the like are also autonomous systems
00:02:46.166
in the routing sense.
00:02:49.333
Some of these organisations operate more than
00:02:51.766
one autonomous system,
00:02:54.000
perhaps because they've bought other companies which
00:02:57.700
were themselves autonomous systems, or perhaps just
00:03:00.433
split their network up for ease of administration.
00:03:05.233
Autonomous systems are identified by unique numbers,
00:03:09.366
known as AS numbers, and these
00:03:11.333
are allocated to them by the Regional Internet Registries.
00:03:15.233
The AS numbers don't really have any
00:03:17.566
meaning, except that they provide a unique
00:03:20.000
identifier for each autonomous system.
00:03:23.266
Essentially, they start at one, and they go up.
00:03:27.066
and each new organisation, each new network,
00:03:30.633
to join the Internet routing system gets
00:03:32.566
assigned the next autonomous system number.
00:03:36.033
As of March 2021 there are about
00:03:39.600
115,000 autonomous systems in the Internet,
00:03:43.033
about 115,000 autonomous system numbers have been
00:03:46.533
allocated, and about 71,000 of those are
00:03:50.133
advertised in BGP, which means about 71,000
00:03:53.700
of them are active in the Internet routing.
00:03:58.366
And the completely unreadable graph on the
00:04:00.700
right of the slide shows the growth
00:04:02.833
in the number of ASes advertised into
00:04:05.266
the routing system over time.
00:04:07.133
And there are some links on the
00:04:08.433
slide, if you want to find the
00:04:10.566
list of AS numbers, and the details
00:04:13.766
of the current AS number allocations.
00:04:22.000
When we talk about Internet routing,
00:04:24.566
we talk a lot about the AS topology graph.
00:04:28.166
And this is the set of interconnections
00:04:30.500
between the ASes. The set of interconnections
00:04:33.000
between the autonomous systems, between the networks,
00:04:35.600
that form the Internet.
00:04:38.466
And the AS topology graph is formed
00:04:40.800
by treating each node, each network,
00:04:45.300
each autonomous system as a node in the graph.
00:04:49.066
And the interconnections show the links between
00:04:51.933
the different networks, they show the different
00:04:53.866
ways in which traffic can pass between
00:04:55.933
these independently operated networks.
00:05:00.266
The picture we see on the slide
00:05:02.200
here is a visualisation of that graph,
00:05:05.433
produced by an organisation known as CAIDA,
00:05:08.733
the Cooperative Association for Internet Data Analysis,
00:05:12.366
which operates out of the University of
00:05:14.500
California, in San Diego.
00:05:17.500
And the way this works is that
00:05:19.166
each point on this graph is a
00:05:21.033
network, each point on the graph is an autonomous system.
00:05:25.233
And the position around the circle is
00:05:27.800
done based on geography, so it's based
00:05:30.533
on geographic location.
00:05:32.700
And the distance from the centre towards
00:05:37.433
the edge of the circle is based
00:05:39.566
on number of connections that network,
00:05:42.533
that autonomous system, has to the rest of the network.
00:05:48.000
A network that has very few connections
00:05:50.233
to other networks will appear at the
00:05:52.100
edge, whereas a network that has very
00:05:54.166
many connections to other networks will appear
00:05:56.433
in the middle of the graph.
00:05:58.766
And, as I say, it's arranged geographically,
00:06:01.066
and it’s perhaps a little hard to
00:06:02.933
read. At about the eight o'clock position,
00:06:06.066
and going around anticlockwise, if we start
00:06:08.433
at the eight o'clock position we see Hawaii.
00:06:10.466
And, towards the bottom at about the
00:06:12.300
seven o'clock, position you've got San Diego,
00:06:15.233
and Los Angeles, and working the way
00:06:17.633
through the US,
00:06:20.333
round to New York and so on,
00:06:24.266
at about the four o'clock position.
00:06:26.566
The gap is the Atlantic Ocean,
00:06:29.500
and then from around the three o'clock
00:06:31.700
position, to about one o'clock, you see
00:06:34.733
we're working the way through Europe,
00:06:36.300
and the labels show the various European cities.
00:06:39.333
And it works its way around,
00:06:41.533
through Asia, and the Far East,
00:06:44.600
and back to Hawaii.
00:06:48.566
And we see, as you might expect,
00:06:51.466
the richness of the interconnections varies geographically,
00:06:55.966
based on where the people live,
00:06:57.666
and based on, to some extent,
00:06:59.766
how developed the countries are.
00:07:02.433
There's a lot of networks at the
00:07:04.766
edges, and there's a significant number,
00:07:07.933
a smaller but signficant number,
00:07:10.700
a richly connected topology in the core.
00:07:14.233
And that’s what you'd expect. That the
00:07:17.033
very large Internet companies,
00:07:18.566
Facebook, and Google, and Apple,
00:07:21.766
and the content distribution networks, like Akamai
00:07:25.933
and so on, are all in the
00:07:27.000
middle, interconnecting to everyone. And then there's
00:07:29.666
lots of networks around the edges,
00:07:32.133
which just provide Internet access in particular regions.
00:07:38.133
And this is showing the potential ways
00:07:40.533
that the traffic can flow. It’s showing
00:07:42.900
the interconnections between the autonomous systems,
00:07:45.900
between networks. So, it's giving potential routes
00:07:49.400
which traffic can flow through the network.
00:07:53.166
And this graph is for IPv4.
00:07:56.733
You can do the same thing for
00:07:58.266
IPv6, as we see on this slide,
00:08:00.733
and as you would expect, perhaps,
00:08:02.533
the IPv6 graph is somewhat sparser and
00:08:06.066
perhaps a bit easier to read,
00:08:07.833
because the IPv6 network is smaller.
00:08:12.733
Tt's developing in the same way, though.
00:08:16.133
If you look at the historic data
00:08:18.533
for the IPv4 graph, the IPv6 graph
00:08:21.600
is following the same trajectory as the
00:08:23.766
IPv4 Internet did, it’s just a few years behind.
00:08:30.800
And in this slide, this is data
00:08:34.066
from Google. It’s plotting the fraction of
00:08:36.600
connections going to Google that
00:08:39.400
use IPv6. We see that about a
00:08:42.600
third of the traffic to Google is
00:08:45.100
using IPv6, and that matches-up with the
00:08:49.233
graphs on the previous slide.
00:08:51.600
The IPv6 network is a lot less
00:08:53.733
well developed, it's a much sparser topology
00:08:57.066
compared with IPv4, and there's less traffic using it.
00:09:01.533
But I think that's what you'd expect.
00:09:03.700
IPv4 has had 30 years head-start on
00:09:07.400
deployment. Of course it's going to be
00:09:09.633
much more densely interconnected, of course there's
00:09:12.333
going to be much more IPv4 traffic
00:09:14.633
than IPv6 traffic. But IPv6 is developing,
00:09:17.633
it’s growing at a similar rate.
00:09:23.500
So how do we route traffic around
00:09:26.300
this graph? Given that mass of interconnections
00:09:29.500
we saw in the previous slides,
00:09:32.166
essentially a completely unreadable
00:09:34.333
mass of interconnections,
00:09:35.600
with so many networks, so many interconnections,
00:09:38.600
how do we route traffic around the network?
00:09:43.766
Well, at the edges of the network,
00:09:46.166
this is very straightforward.
00:09:48.133
Devices at the edge of the network
00:09:50.266
tend to have really simple routing tables.
00:09:54.633
If you look at machines in the
00:09:56.933
network in the Computing Science Department of
00:10:00.100
the University, for example,
00:10:02.500
all the machines in Computing Science have
00:10:05.333
IP addresses in the range
00:10:07.666
130.209.240.0/20.
00:10:12.266
They all have IPv4 addresses where the
00:10:15.600
first 20 bits of the address match
00:10:18.133
130.209.240.0,
00:10:22.033
and the last 12 bits identify the
00:10:24.800
particular machine on the Computing Science network.
00:10:29.033
And their routing table just says,
00:10:31.866
if the machine is on
00:10:34.966
the Computing Science network put it out
00:10:37.566
onto the local ethernet, and it will be delivered.
00:10:41.033
If it's got an IP address in the range
00:10:43.566
130.209.240.0/20
00:10:47.233
put it out on to the local
00:10:48.833
Ethernet, and it will be delivered to
00:10:51.133
the machine directly.
00:10:53.566
And then it has what's known as
00:10:56.033
a default entry, which says if it
00:10:57.466
has any other IP address, send it
00:11:00.600
to machine with IP address 130.209.240.48.
00:11:06.700
And machine 130.209.240.48
00:11:10.366
Is the router at the edge of
00:11:12.966
the Computer Science Department. It’s the router
00:11:16.300
which connects Computing Science to the rest
00:11:18.300
of campus, and from then on to the rest of the Internet.
00:11:23.400
And routing at the edges is often
00:11:25.833
like this. The routing table specifies “this
00:11:29.200
is the local network” and says in
00:11:31.300
order to send to any machines on
00:11:33.833
this network, just put it out onto
00:11:35.433
the Ethernet, or on to the WiFi,
00:11:38.066
and they're all directly connected. And anything
00:11:41.433
else, send it over there. And “over
00:11:44.000
there” is the router that connects to
00:11:45.533
the rest of the network.
00:11:48.266
If you look at the routing tables
00:11:50.866
on machines in your home, you will
00:11:52.800
see something similar. And, most likely,
00:11:55.933
you have a private network, you're behind
00:11:58.233
a network address translator,
00:12:00.433
and the routing table will say the
00:12:02.900
network is 192.168.0.0/16, and that's on your
00:12:08.266
local WiFi, and anything else you send
00:12:11.066
to, probably, machine 192.168.0.1, which will be
00:12:16.333
the WiFi base station which will,
00:12:18.866
in turn, send it out to the rest of the Internet.
00:12:25.166
Routing at the edges is straightforward.
00:12:29.733
Routing, as you get nearer the core
00:12:32.466
of the network, gets more complex.
00:12:35.066
We saw at the edges, the networks
00:12:37.700
can just have a default route that
00:12:39.133
points up towards the core.
00:12:41.533
We see it at the bottom-right of
00:12:44.166
the figure on the slide here.
00:12:46.500
Where there’s some network at the edge,
00:12:48.566
which has a couple of its customers.
00:12:51.366
it has links to a couple of
00:12:53.166
customer networks, with the red arrows pointing inwards,
00:12:57.600
and
00:13:00.466
it knows what are the address ranges
00:13:03.600
assigned to those customers.
00:13:05.666
So it knows that if it's got
00:13:07.766
traffic to those address ranges, it can
00:13:10.266
route it down those links to those
00:13:12.033
customers. But it can have a default
00:13:14.333
route that says “for anything else,
00:13:16.600
anyone other than these two customers,
00:13:18.966
send it out towards the wider Internet”.
00:13:22.466
And, at the edges, this sort of
00:13:24.266
default based approach works quite well,
00:13:26.733
because there's only a small part of
00:13:28.900
the network which is known, and everything
00:13:30.766
else is “out there”.
00:13:34.400
As you get into the core, though,
00:13:37.766
the networks tend to need more-and-more information.
00:13:42.433
And, eventually, you end up in a
00:13:44.300
region of the network which is known
00:13:46.266
as the “default free zone”.
00:13:49.166
And the default free zone is that
00:13:51.166
part of the network which is so richly interconnected
00:13:54.366
that it stops being able to say
00:13:57.300
“send it over there to be delivered”,
00:14:00.033
because it's the part of the network
00:14:01.800
those people send it to.
00:14:04.333
It can't say send it towards the
00:14:06.866
middle of the Internet to be delivered,
00:14:08.600
because it is the middle of the Internet.
00:14:11.600
And this large core of autonomous systems
00:14:15.933
in the middle of the network,
00:14:17.200
has to keep track of essentially the
00:14:20.000
whole Internet topology, the whole AS graph.
00:14:23.466
So they need to store all the
00:14:25.600
paths, to all the autonomous systems in
00:14:27.533
the network, to figure out how they
00:14:29.200
can deliver data.
00:14:30.933
They need to keep a map of,
00:14:32.833
essentially, the entire Internet topology. And from
00:14:36.066
that, they can decide which way to
00:14:37.766
send the packets, which network to send
00:14:40.800
the packets to next, in order that they get delivered.
00:14:48.266
Over time, the topology, the AS graph,
00:14:51.800
is gradually getting more complex.
00:14:55.766
It started out being relatively simple,
00:14:58.300
like you see on the left of the slide here.
00:15:02.700
There were ISPs at the edges,
00:15:05.033
which provided connectivity to particular regions.
00:15:08.300
They connected to regional ISPs, which provided
00:15:11.966
wider-area connectivity. And there were a small
00:15:15.533
number of network operators that provided long-distance
00:15:18.466
international connectivity.
00:15:22.966
And, over time, we've gradually seen more-and-more
00:15:26.966
links being added, the links shown in
00:15:30.066
red on the right, for example.
00:15:32.333
We're getting a lot more interconnections at
00:15:34.633
the regional level,
00:15:37.166
a lot denser interconnections at the edges.
00:15:41.433
The network’s getting more-and-more connected. The ISPs,
00:15:45.466
the network operators,
00:15:46.666
the companies that form the network,
00:15:48.266
are gradually building more-and-more interconnections
00:15:50.400
between themselves.
00:15:52.733
And the traffic is less flowing up
00:15:54.800
towards the core, and then through this
00:15:56.766
small set of long-distance providers, and then back down,
00:15:59.766
and is increasingly going from the edges
00:16:02.700
up to some sort of regional transit
00:16:05.733
layer, or from the edges directly to
00:16:07.766
the destination network, without having to go
00:16:10.533
via these long-distance transit providers.
00:16:14.466
And we're seeing more
00:16:16.766
interconnection by large Internet companies, Google for
00:16:22.233
example, or the content distribution networks,
00:16:25.200
Akamai, CloudFlare, Fastly, and the like,
00:16:29.466
connecting at the regional level, connecting to
00:16:32.266
the edge ISPs directly, in order to
00:16:34.933
improve connectivity for their customers.
00:16:38.000
And we're seeing increasing numbers of what
00:16:40.466
are known as Internet Exchange Points.
00:16:42.733
Locations where network operators can come together
00:16:46.533
and interconnect themselves.
00:16:50.133
A prominent example of that, in this
00:16:53.266
country, is the London Internet Exchange,
00:16:55.966
where there's approximately 800-850 different networks,
00:17:00.533
all come together in a particular building,
00:17:03.300
that just connect their networks together.
00:17:06.433
And the picture shows it, as you
00:17:09.800
see it's just a regular office building.
00:17:12.700
If you go into one of these
00:17:14.433
places, what you find is that the
00:17:16.233
core of it is just an enormous
00:17:17.566
Ethernet switch. And all of the networks
00:17:21.133
bring their equipment in, and they all
00:17:22.800
plug-in to, essentially, a massive Ethernet which
00:17:26.800
allows them to just exchange traffic.
00:17:30.866
And the LINX, the London Internet Exchange,
00:17:33.600
talks about how it has several terabytes
00:17:37.700
per second of traffic flowing through it.
00:17:40.066
And this type of scale is pretty
00:17:41.566
commonplace. There’s tens, possibly hundreds, of these
00:17:46.166
in Europe, and many more of them
00:17:47.866
around the world. And they’re the points
00:17:50.866
at which this interconnection tends to happen.
00:17:59.933
The Internet, as we've said, is a
00:18:02.133
network of networks. The autonomous systems are
00:18:05.966
independently operated and, in many cases,
00:18:08.900
they are competitors.
00:18:11.500
If you think about the edges of
00:18:13.433
the network in the UK, for example,
00:18:15.966
you've got autonomous systems such as BT,
00:18:19.633
Virgin Media, Talk Talk, O2, and all
00:18:22.600
the others, all of which are competing
00:18:25.366
for business. They’re all competing to be
00:18:28.100
your Internet provider.
00:18:30.700
These autonomous systems have to cooperate to
00:18:34.033
deliver data between themselves, and deliver data
00:18:36.966
to the rest of the Internet,
00:18:39.133
but fundamentally they’re competitors.
00:18:43.033
They're competing for business, they're competing for
00:18:46.100
customers with each other.
00:18:48.000
And this is true at all of
00:18:49.533
the levels of the hierarchy. The autonomous
00:18:52.333
systems, the networks that comprise the Internet,
00:18:56.266
need to cooperate to make the Internet
00:18:59.000
work, but fundamentally they don't trust each
00:19:01.766
other. They’re competitors, they're operating in different
00:19:05.600
places, they have different goals, different values.
00:19:10.566
And, as a result, business and political
00:19:14.300
and economic relationships very much influence routing.
00:19:19.633
Internet routing, of course, is based on
00:19:22.966
what's the most efficient way to get
00:19:24.966
data to a particular destination, but it's
00:19:27.800
also based on policy.
00:19:31.566
And policy restrictions very much determine the
00:19:35.133
topology. They determine the interconnections between the
00:19:37.966
networks, and they determine which of those
00:19:40.100
interconnections are used.
00:19:43.966
And, at the coarsest sense, they determine
00:19:48.200
the interconnectivity, because they determine which networks
00:19:51.133
actually physically interconnect to each other.
00:19:54.933
Which of these networks actually have put
00:19:59.166
in place a physical link to allow
00:20:01.533
traffic to flow between themselves,
00:20:03.833
versus punting it up to some other
00:20:05.966
level of the hierarchy?
00:20:08.966
But also, once those links are in
00:20:11.266
place, who gets to use them? Which
00:20:13.700
traffic gets to flow over those links?
00:20:16.400
And not all of the traffic which
00:20:18.000
could flow over a particular link is
00:20:20.466
necessarily allowed to, depending on the policy
00:20:22.700
choices that have been made.
00:20:26.066
And these various policy choices might prioritise
00:20:29.733
traffic so that it goes over non-shortest
00:20:32.066
path routes, over not necessarily optimal routes.
00:20:37.300
Network operators might prioritise shortest path,
00:20:42.166
they might prioritise the lowest latency path
00:20:45.166
when they’re choosing a route.
00:20:47.433
But they might also prioritise the highest bandwidth path.
00:20:50.666
Or the cheapest path.
00:20:53.733
Or they might have restrictions which prioritise
00:20:57.333
paths which avoid certain networks, or avoid
00:21:00.233
certain parts of the world.
00:21:03.766
They might be trying to avoid traffic
00:21:06.166
going through certain regions, or through certain
00:21:08.133
network operators, for political reasons or for
00:21:11.966
economic reasons.
00:21:14.233
And these policy considerations very much influence
00:21:17.233
the way Internet routing works.
00:21:24.466
The routing in the Internet operates using
00:21:27.700
a system known as the Border Gateway Protocol.
00:21:31.600
There's two parts to the Border Gateway
00:21:33.733
Protocol, two parts to BGP.
00:21:36.266
External BGP and internal BGP.
00:21:40.733
External BGP provides the connectivity between autonomous
00:21:46.966
systems. It’s used by ASes to exchange
00:21:50.533
information with their neighbours, to tell them
00:21:53.033
which paths are available.
00:21:56.366
External BGP runs over TCP connections,
00:22:00.166
it runs over TCP connections between routers,
00:22:03.666
one in each autonomous system, so it
00:22:07.133
interconnects the autonomous systems.
00:22:10.300
And it allows those two autonomous systems
00:22:12.566
to exchange knowledge of the AS topology,
00:22:15.500
which they’ve filtered according to their policies.
00:22:19.000
External BGP is the way two autonomous
00:22:22.300
system will talk to each other,
00:22:23.900
to exchange information about the structure of the network.
00:22:27.900
And from that they can compute
00:22:31.333
interdomain routes, they can compute the paths
00:22:35.600
that are available across the network.
00:22:39.566
Internal BGP is the part of BGP
00:22:43.300
that’s used within an autonomous system for
00:22:46.366
distributing that information to the other edge
00:22:48.500
routers, and for distributing that information to
00:22:50.966
the internal routers in that system.
00:22:54.066
Internal BGP allows an autonomous system to
00:22:58.400
coordinate routing information internally. It tells the
00:23:03.500
routers that comprise a network, how to
00:23:08.433
get to the edges, how to get
00:23:10.666
out to the rest of the world.
00:23:13.433
And external BGP is used for talking
00:23:16.133
between autonomous systems to coordinate their view
00:23:18.600
of what the rest of the world looks like.
00:23:21.733
We’ll talk about intradomain routing, routing within
00:23:25.333
a network, in one of the later
00:23:27.200
parts. But for the rest of this
00:23:28.866
part of lecture, I want to talk about external BGP,
00:23:31.200
and how the routing between autonomous systems works.
00:23:38.366
At the external BGP level,
00:23:42.033
the autonomous systems, the routers at the
00:23:44.633
edges of the autonomous systems, advertise out
00:23:47.800
IP address ranges, and advertise the AS
00:23:51.100
paths in order to get to those IP address ranges.
00:23:56.333
And these combine to form what's known as a routing table.
00:24:00.466
Essentially, you have a list of IP
00:24:03.133
address ranges, what’s known as a list
00:24:05.000
of prefixes, and for each prefix,
00:24:08.133
you have the list of autonomous systems
00:24:11.366
you need to get through to get to that prefix.
00:24:16.333
And the table at the bottom,
00:24:18.033
is an example of a small part
00:24:20.433
of the Internet routing table.
00:24:22.633
And the whole thing is enormous.
00:24:24.266
The whole thing is
00:24:25.733
a few million lines of this.
00:24:27.800
And there's something like half-a-million prefixes being
00:24:31.933
advertised into the Internet, and each one
00:24:34.200
has multiple ways of getting to it,
00:24:35.800
so there are several million lines of this data.
00:24:39.600
What we see, highlighted in yellow,
00:24:41.866
is the entries for a particular prefix.
00:24:45.433
In this case, it's the IP addresses
00:24:48.133
which match 12.10.231.0/24,
00:24:53.066
where the first 24 bits match 12.10.231.0.
00:25:00.233
And,
00:25:02.666
in the middle the middle column,
00:25:04.800
the next hop column, we see that
00:25:06.633
there are seven different ways of getting
00:25:11.166
to that, via seven different next hop routers.
00:25:15.533
And, for each of these, we see
00:25:17.100
an AS path which shows how to get there.
00:25:20.833
So, for example, if you look at
00:25:23.433
the first line highlighted in yellow,
00:25:25.300
we see we can get to the
00:25:26.333
prefix 12.10.231.0/24
00:25:30.700
via next hop 194.68.130.254
00:25:36.866
If we send a packet destined to
00:25:39.000
that prefix, to that next top router,
00:25:42.333
it will go to the autonomous system
00:25:44.500
number 5459, which will send it to
00:25:48.133
5413, which will send it to 5696,
00:25:52.366
which will send it to 7369.
00:25:55.200
And 7369, because it’s at the end
00:25:58.200
of the AS path, is the one that owns the prefix.
00:26:02.733
And “i” just means this was gathered
00:26:06.166
by internal BGP from some other autonomous
00:26:09.333
system. It’s been passed through this router
00:26:12.466
from one of other ASes in the network.
00:26:17.833
And we see the next line,
00:26:20.566
if you send a packet destined
00:26:22.966
for the same prefix, instead to the
00:26:27.233
router with IP address 158.43.133.48,
00:26:33.366
it will follow a longer path.
00:26:35.233
It will go via autonomous systems 1849,
00:26:38.300
702, 701, 6113, 5696, and eventually to
00:26:43.500
7369, the destination.
00:26:46.200
And so on.
00:26:48.433
And that line highlighted in red is
00:26:50.766
the preferred path. You send a packet
00:26:53.733
destined for prefix 12.10.231.0, and if you
00:26:58.033
send it to the next hop router
00:26:59.833
202.232.1.8, it will go via autonomous systems
00:27:05.133
2497, 5696, and then reach 7369 the destination.
00:27:14.833
And the entire routing table comprises this
00:27:18.333
set of information. It's a list of
00:27:20.166
prefixes and next hops,
00:27:22.500
which routers this autonomous system can send
00:27:25.300
the data to next, in order to
00:27:28.000
make its way towards that destination,
00:27:30.466
and the AS paths it will take,
00:27:32.633
the packets will take, if it sends them to that next hop.
00:27:38.233
What are the next hop IP addresses?
00:27:40.800
They’re the IP addresses of the routers
00:27:43.233
this autonomous system peers with in its neighbours.
00:27:47.700
The particular autonomous system I’ve taken this
00:27:51.000
routing table from, connects to a router
00:27:55.266
with IP address 202.231.1.8, and that router
00:28:01.266
is in one of its neighbours,
00:28:02.500
it's in autonomous system 2497.
00:28:08.000
And it knows that if it sends
00:28:10.133
to that next hop, it will work
00:28:11.933
its way through autonomous systems 2497,
00:28:14.866
and 5696, and 7369 which owns the
00:28:18.600
destination IP address.
00:28:22.100
And let's just repeats, for prefix,
00:28:23.866
after prefix, after prefix.
00:28:29.066
Now.
00:28:30.700
You can extract this information, and you
00:28:32.933
can plot, it and you can form a graph.
00:28:35.700
And the figure we see on the left
00:28:39.833
here, shows the view of the network
00:28:43.166
from the point where this routing table
00:28:45.366
was gathered, which is the autonomous system
00:28:47.466
highlighted in green, showing the interconnections we
00:28:50.133
found to all the others.
00:28:52.533
And all this is doing, is showing
00:28:54.333
each pair of adjacent autonomous systems on
00:28:56.966
the path are connected together.
00:29:00.300
So, if we look at the first
00:29:01.533
line, we see we can reach
00:29:05.533
the prefix 12.10.231.0 via autonomous systems 5459,
00:29:13.000
5413, 5696, and 7369.
00:29:18.566
And, we see from the node in
00:29:21.400
green, if we get up at about
00:29:23.200
the 10 o'clock position and around,
00:29:25.400
we follow the autonomous systems around,
00:29:27.500
we see this path through the network.
00:29:30.266
And if you look at each line
00:29:32.233
in turn, and look at the AS
00:29:33.633
paths, so you'll see I’ve just connected
00:29:35.566
the adjacent ASes together. And it gives
00:29:37.766
you this map, this part of the Internet topology.
00:29:43.333
And the arrows in a red show
00:29:46.000
the preferred paths, which are highlighted
00:29:49.166
on the segment of the routing table.
00:29:52.000
You can see we’re starting to build
00:29:53.866
up the AS graph. We’re starting to
00:29:56.000
build up a map of the topology graph.
00:29:59.066
And, if you do this for the
00:30:00.500
entire graph, if you take the entire set of
00:30:03.900
entries in the routing table, you end
00:30:06.633
up with a graph like the CAIDA
00:30:08.833
graph I showed earlier.
00:30:16.266
So, we see that the routing works
00:30:18.566
by each autonomous system advertising some IP
00:30:22.033
address prefixes to its neighbours.
00:30:25.300
BGP works by each AS telling its neighbours
00:30:29.766
“I can reach these IP prefixes”,
00:30:33.033
“if you send traffic to me,
00:30:35.100
I will deliver it to these prefixes”.
00:30:38.833
And each AS chooses which of these
00:30:41.300
prefixes, which of these routes, to advertise
00:30:43.466
to its neighbours.
00:30:46.200
But it doesn't need to advertise everything
00:30:48.900
it knows. It doesn't need to advertise
00:30:51.466
out everything it receives.
00:30:54.166
Indeed it's common for BGP,
00:30:58.200
it's common for autonomous systems in BGP,
00:31:00.733
to drop some routes from their advertisement.
00:31:06.433
And, what address ranges, what AS paths
00:31:11.700
they advertise, really depends on the relationship
00:31:15.833
between the different autonomous systems.
00:31:20.166
And a common way this is done,
00:31:23.166
is using what’s known as the Gao-Rexford
00:31:25.333
rules. And this is a way of
00:31:27.933
categorising autonomous systems, and categorising how the
00:31:31.066
routing should work.
00:31:33.933
And for any autonomous system, any AS
00:31:36.833
in the Internet, it categorises the other
00:31:39.566
autonomous systems as either being
00:31:42.366
customers, peers, or providers of that AS.
00:31:47.700
So customers are easy. These are the
00:31:50.066
people for whom the network sells Internet service.
00:31:56.533
If the network we’re considering is JANET,
00:32:01.766
the Joint Academic NETwork that connects the
00:32:04.533
UK universities together, the customers are the
00:32:07.333
individual universities.
00:32:11.100
The peers are the other networks with
00:32:13.733
whom it exchanges traffic,
00:32:17.300
on a peer basis, without really charging.
00:32:22.466
The customers are people who pay you
00:32:25.633
for Internet access; the peers are the
00:32:28.500
people you agree to share traffic with at no cost.
00:32:32.333
And in the case of JANET,
00:32:34.500
the academic research network in the UK,
00:32:37.266
the peers might be the other academic
00:32:39.166
research networks around Europe, for example.
00:32:42.466
And the providers are the people who
00:32:44.900
you pay for Internet access, who this
00:32:47.100
AS pays for Internet access.
00:32:49.800
And this might be,
00:32:51.933
in the case of JANET, it would
00:32:53.933
be GÉANT, the pan-European
00:32:56.333
interconnect, or it might be a commercial
00:32:59.566
interconnect that connects it to the rest of the Internet.
00:33:04.166
And, the idea is that if you
00:33:06.100
get a route from one of your
00:33:07.500
customers, so if one of your customers
00:33:09.866
says “I have this IP address range”,
00:33:13.066
“I own these IP addresses”, you will
00:33:16.933
advertise that out to everybody.
00:33:20.266
One of your customers, one of the
00:33:23.033
people who is paying you for Internet
00:33:25.466
access, advertises that they own a particular
00:33:27.733
IP address range, you tell your other
00:33:30.733
customers, you tell your peers, and you
00:33:33.033
tell your provider.
00:33:36.033
And that makes sense. The customer is
00:33:38.633
paying you to provide Internet access,
00:33:41.200
paying you to deliver traffic for them,
00:33:43.933
but also paying you to deliver traffic
00:33:45.833
to them. So if they own a
00:33:47.700
particular IP address range, they want to
00:33:49.766
receive traffic destined for those addresses,
00:33:52.600
so you tell the rest of the Internet about it.
00:33:57.900
If you get a route from your
00:34:01.100
one of your providers, or from one
00:34:02.666
of your peers, though, you only tell your customers.
00:34:10.733
This is a route you're paying to
00:34:13.200
use, rather than being paid to use,
00:34:15.933
and therefore you only tell the people,
00:34:19.466
you only tell the customers, who are
00:34:21.866
paying you to use it. And,
00:34:23.833
for a route from a provider,
00:34:25.200
this makes sense; you're explicitly paying for
00:34:27.166
access, so
00:34:28.766
you tell your customers. But you don’t
00:34:30.300
tell your peers, because you're paying for
00:34:32.166
this access. Why would you let them use it?
00:34:37.600
And, for routes received from your peers,
00:34:39.866
you tell your customers, because the peer
00:34:43.766
is willing to let you use this
00:34:45.866
route at no cost to your customers,
00:34:47.733
but you don't tell your provider,
00:34:49.366
you don't tell the rest of the Internet about it.
00:34:53.133
And the Gao-Rexford specify what routes are
00:34:56.333
advertised, so they specify potential ways traffic can flow.
00:35:01.700
This isn't saying “the traffic will go
00:35:04.600
this way”, it's saying there is a
00:35:07.133
potential route that traffic could follow,
00:35:09.400
if it wanted to get to this address.
00:35:15.000
And the result is what’s known as a valley-free
00:35:18.633
directed acyclic graph, a valley-free DAG.
00:35:22.233
And directed and acyclic means that
00:35:26.866
there's a direction: it shows you which
00:35:29.100
way to go, to get to a
00:35:30.800
particular range of IP addresses. It’s acyclic,
00:35:33.933
that means there are no loops.
00:35:35.733
And valley-free means it goes up,
00:35:38.733
and then along, and then down.
00:35:41.200
It never goes from a customer,
00:35:44.666
to its provider, then down to one
00:35:47.400
of its customers, and then back up
00:35:48.766
to another provider. It goes up,
00:35:50.800
then along, and then down.
00:35:54.233
And it's designed, essentially, to optimise for profit.
00:35:59.300
If someone is paying you for access,
00:36:02.533
you will advertise their routes, which allows
00:36:04.733
traffic to flow to them.
00:36:07.066
If you're paying for a route,
00:36:09.333
you only advertise it to people who
00:36:11.466
are paying you.
00:36:13.966
It’s designed to avoid advertising things which
00:36:19.200
you pay for, to people who are
00:36:20.933
not paying you for access.
00:36:26.833
All the autonomous systems exchange routing information
00:36:30.333
with their neighbours.
00:36:32.466
They exchange lists of IP prefixes,
00:36:36.633
and how they can be reached.
00:36:38.900
What path, what set of autonomous systems,
00:36:43.166
you have to go through to get to that prefix.
00:36:48.566
And they filter this based on the
00:36:49.966
policies. Maybe they apply the Gao-Rexford rules,
00:36:53.266
maybe they apply some other rules,
00:36:54.933
but they don't necessarily advertise all of
00:36:57.600
the prefixes, and all of the paths,
00:36:59.566
they know to all of their peers,
00:37:01.700
to all of their neighbours.
00:37:05.000
Each autonomous system has a partial view
00:37:08.133
of the AS-level topology. It knows what
00:37:11.766
its neighbours are willing to tell it.
00:37:15.733
And it takes that view of the
00:37:17.700
topology, and it applies a set of
00:37:19.966
rules that enforce its policy.
00:37:25.133
And maybe they filter out certain routes.
00:37:28.066
Maybe they don't tell their neighbours about
00:37:31.466
the existence of certain routes, because they
00:37:33.633
don't want them to use those routes for some reasons.
00:37:38.466
Maybe it filters out certain routes its neighbours tell it.
00:37:43.033
The neighbouring AS is willing to deliver
00:37:45.533
traffic in that direction, but it doesn't
00:37:47.500
want the traffic to flow that way,
00:37:49.166
so it filters out that prefix from its routing table.
00:37:54.466
Maybe it prioritises, or de-prioritises, certain other
00:37:58.633
routes. Maybe it tags particular routes for
00:38:02.366
special processing, if there's a particular business
00:38:04.933
reason to do so.
00:38:07.633
And it goes through, and it applies its policies.
00:38:13.066
The table shows the criteria people use,
00:38:18.866
and there’s a local preference, the length
00:38:22.533
of the AS path,
00:38:24.800
the type of origin; is this something
00:38:28.200
you know because it's one of your
00:38:29.700
directly connected customers, or is it something
00:38:31.933
you’ve learnt from one of the other networks?
00:38:35.866
There’s a multi-exit discriminator if there are
00:38:38.166
several ways of getting to a single destination.
00:38:42.733
And so on. there’s a bunch of policies and so on.
00:38:49.300
The point is that,
00:38:53.866
just because you know the existence of
00:38:56.233
a route, doesn't mean you use it.
00:38:59.166
And you don't necessarily
00:39:01.400
pick the shortest routes, you pick the
00:39:03.633
shortest route that matches all your policies
00:39:05.933
after filtering the graph.
00:39:11.166
And, this means that the route that
00:39:14.066
data takes to get through the network,
00:39:17.566
may not necessarily be the shortest route
00:39:19.933
through the network.
00:39:21.366
It’s the shortest route that meets all policy constraints.
00:39:26.466
It means there may be cases where
00:39:29.633
data can't get to a particular destination,
00:39:34.533
even if there is a potential route
00:39:36.866
there, because the autonomous systems don't have
00:39:39.900
a policy which allows it to go in that direction.
00:39:44.500
There are cases where the network could
00:39:47.466
deliver data to a particular destination,
00:39:50.000
but won’t, because the policy choices made
00:39:52.933
by some, or more, of the ISPs
00:39:55.000
in some parts of the world,
00:39:56.266
won't allow traffic from those parts of
00:39:58.400
the world to reach that destination.
00:40:03.533
It's finding the shortest policy-compliant path.
00:40:13.900
BGP is
00:40:17.966
a very political protocol.
00:40:22.933
How the information is exchanged is straightforward.
00:40:27.033
The autonomous systems exchange lists of prefixes,
00:40:32.000
and the AS path in order to get to those prefixes.
00:40:36.300
How those paths are filtered and prioritised
00:40:40.300
is where it gets difficult.
00:40:46.566
In many cases the policy, and economic,
00:40:49.533
and political concerns outweigh the shortest path.
00:40:53.233
The routes are filtered, and they’re prioritised,
00:40:55.633
and they’re de-prioritised, based on policy choices,
00:40:59.466
based on how much it costs a
00:41:01.600
particular AS, and based on
00:41:05.433
political decisions as to which ASes,
00:41:09.066
which regions, which countries, to prefer.
00:41:14.100
And the autonomous systems are competitors,
00:41:17.633
they don't really trust each other.
00:41:22.700
And, as a result, it's hard to
00:41:25.200
say how BGP really works, because the
00:41:28.733
ASes won't tell anyone outside their own organisation.
00:41:36.866
We know what information, we can put
00:41:40.400
a monitor at some point in the
00:41:41.633
network and see what information is reaching
00:41:44.600
that point of network, we can see
00:41:46.800
what other ASes are willing to advertise
00:41:49.533
to a monitor at that point in the network.
00:41:54.300
We can get a friendly AS to
00:41:56.100
show us the BGP data they're receiving.
00:41:59.466
And there are projects, such as RIPE
00:42:02.466
RIS, or the RouteViews project from the
00:42:05.633
University of Oregon, which archive this data,
00:42:08.266
and store it, and make it available for people.
00:42:11.766
And we know the BGP decision process,
00:42:14.300
we know the algorithm the routers follow
00:42:17.233
to exchange the data. We saw that
00:42:20.333
in a previous slide, and it's deterministic
00:42:22.533
about how they pick a particular route.
00:42:26.900
But what we don't know is the
00:42:28.300
data which is going into that algorithm.
00:42:31.566
We know the set of routes that
00:42:33.000
are being advertised, but they are then
00:42:34.800
filtered, and prioritised, and de-prioritised, and munged,
00:42:38.566
before they go into the decision process in the routers.
00:42:41.866
And how each autonomous system does this,
00:42:44.266
is a trade secret of that AS,
00:42:46.266
and they won't tell the rest of
00:42:47.566
the network. And this makes it difficult
00:42:50.033
to evaluate how routing decisions are made in practice.
00:42:53.933
We can see the end result.
00:42:55.866
We can put a monitor in the
00:42:58.266
network somewhere and see the routing tables
00:43:00.800
that it gets. And, based on that,
00:43:02.966
we can infer how the data will
00:43:05.366
get to a particular destination.
00:43:07.733
But how those tables got filtered,
00:43:10.133
and what other routes exist which are
00:43:12.033
being de-prioritised and filtered out so we
00:43:14.766
can't see them, that we don't know.
00:43:17.033
We don't know the potential connections which
00:43:19.000
we're not allowed to use.
00:43:23.233
That's all I wan to say about interdomain routing.
00:43:27.200
We’ve got a network of networks.
00:43:30.066
At the edges, the routing is easy.
00:43:35.100
Within an edge network, you point to
00:43:38.433
the default gateway, and
00:43:41.033
between networks at the edges, again,
00:43:44.033
you can use a default route,
00:43:46.333
you just forward towards the core.
00:43:48.733
In the core you have the default
00:43:50.700
free zone, everyone knows everything,
00:43:53.466
everyone has to know all of the paths.
00:43:56.333
And they use BGP to exchange this
00:43:58.266
data, and then they filter it,
00:43:59.900
and munge it, and process it,
00:44:01.233
to suit their policy needs, and it
00:44:03.166
becomes very opaque what happens.
00:44:05.966
Eventually, though, the packets get delivered,
00:44:08.333
we hope, and the Internet routing works.
00:44:12.066
In the next part, I'll talk about
00:44:13.866
routing security, and after that I'll talk
00:44:16.333
about intradomain routing,
00:44:18.300
how routing works within a network.
Part 3: Routing Security
Some of the security limitations of BGP routing, and the potential for
accidental or malicious route hijacking, are discussed. The RPKI and
MANRS are discussed as possible approaches to improving BGP routing
security.
Slides for part 3
00:00:00.366
Having discussed interdomain routing in detail in
00:00:02.366
the previous part of the lecture,
00:00:04.666
I’d like to move on and talk briefly about routing security.
00:00:08.600
I’ll talk about what is Internet routing
00:00:10.766
security, and the problems of secure routing
00:00:13.633
in the Internet, and I’ll talk about
00:00:15.433
two approaches to addressing some of these
00:00:17.300
problems, the Resource Public Key Infrastructure,
00:00:20.333
RPKI, and the Mutually Agreed Norms for
00:00:23.266
Routing Security, MANRS.
00:00:28.600
So the issue with routing in the Internet
00:00:32.666
is being able to advertise prefixes,
00:00:37.800
address ranges, into BGP.
00:00:40.566
And, to be sure that only the
00:00:43.866
legitimate owner of that address range,
00:00:46.600
only the legitimate owner of a particular
00:00:48.333
prefix, can do that, such that the
00:00:51.033
traffic goes to the correct destinations.
00:00:57.300
And the problem with BGP, and the
00:01:00.266
problem with Internet routing security, is that
00:01:03.700
it doesn't provide this guarantee.
00:01:06.466
The problem with BGP is that any
00:01:08.933
autonomous system participating in BGP routing can
00:01:12.533
announce any address prefix.
00:01:15.066
And they can announce any address prefix
00:01:17.033
whether-or-not they own that prefix.
00:01:21.333
Once an autonomous system has the ability
00:01:26.900
to participate in BGP, once one of
00:01:29.833
the existing BGP speakers has agreed to
00:01:32.166
peer with it and accept routes from that AS,
00:01:36.133
the expectation is that it will announce
00:01:38.200
its own routes, announce the routes to
00:01:40.566
its own address space, and to those of its customers.
00:01:44.300
But, if an autonomous system chooses to
00:01:47.033
announce address space owned by someone else,
00:01:50.966
then there’s nothing to stop it from doing that.
00:01:55.266
And this can happen accidentally. Or it
00:01:57.833
can happen because of people maliciously trying
00:02:00.866
to redirect traffic, such that traffic to
00:02:04.266
a particular destination goes to a fake
00:02:07.133
site, or follows a
00:02:10.966
path through a site which can snoop on particular traffic.
00:02:16.633
And the result is that the traffic
00:02:18.100
gets misdirected. It’s what’s known as a
00:02:20.366
BGP hijacking attack.
00:02:24.733
And this happens frequently by accident,
00:02:28.366
and these accidental hijackings of prefixes are
00:02:32.233
a serious stability problems for the network.
00:02:35.333
But it can also happen due to malicious activities.
00:02:41.166
A well-known example of the type of
00:02:44.166
problem that can happen, is linked from
00:02:47.433
the slide, and this happened when an
00:02:50.300
Internet service provider in Pakistan
00:02:54.500
managed to announce the IP address range
00:02:58.033
for YouTube to the Internet.
00:03:01.100
And what was happening was that a
00:03:04.066
court in Pakistan ruled that
00:03:09.200
ISPs in that country were to block
00:03:12.666
access to YouTube,
00:03:15.100
because the content, some of the content,
00:03:18.233
on YouTube was ruled to infringe local
00:03:21.433
laws. And the ISPs in Pakistan were
00:03:24.500
told to block access to this content.
00:03:27.900
And the way this ISP tried to
00:03:30.066
do that, was by injecting a route
00:03:33.500
to the IP address ranges owned by,
00:03:36.166
and used by, YouTube,
00:03:38.633
to its part of the network.
00:03:42.966
And the idea was that all of
00:03:44.533
its customers, within the country, would see
00:03:49.533
this route advertisement, and their traffic would
00:03:52.833
be redirected to a page that says
00:03:55.066
“access to the site is blocked in this country”.
00:03:59.333
And, if they’d successfully sent that announcement
00:04:02.800
only into Pakistan, that would have worked
00:04:05.466
just fine. That’s a perfectly reasonable technical
00:04:09.133
method of blocking access to a particular
00:04:11.366
site, is that you inject the route that way.
00:04:15.766
The problem is that they misconfigured their
00:04:17.666
routers, and also announced it to the
00:04:19.366
rest of the Internet, as well as to
00:04:23.100
their customers within the country.
00:04:26.633
And, as a result of that,
00:04:28.000
all of the YouTube traffic in the
00:04:30.133
network was redirected to this site in
00:04:32.633
Pakistan, which stated that the traffic was blocked.
00:04:37.033
Now, as you can imagine, this was
00:04:39.366
noticed fairly quickly. The particular ISP that
00:04:43.200
was making the incorrect announcement was located,
00:04:46.300
and the announcement was filtered out
00:04:48.933
very near to that ISP, and so
00:04:52.433
the problem didn't last long.
00:04:54.433
But it does show that it's possible
00:04:56.433
to accidentally disrupt global routing operations,
00:05:01.200
in a really quite surprising, and widespread, way.
00:05:08.700
And this type of problem happens,
00:05:11.300
in perhaps less high-profile ways, on a
00:05:13.933
daily basis. And there are also malicious attacks, where
00:05:20.200
sites are redirected to a fake version
00:05:22.600
of a site, or traffic is redirected
00:05:25.000
so that it passes through a particular
00:05:26.833
network, where an attacker can snoop on that traffic.
00:05:31.700
And this is a serious problem.
00:05:33.333
We'd like to solve this problem,
00:05:35.133
we'd like to make sure that only
00:05:36.766
the legitimate owner of a prefix can
00:05:38.400
advertise routes to that prefix.
00:05:44.300
How is this done?
00:05:48.233
Well, the
00:05:50.033
current best approach to solving this is
00:05:53.033
a technique, known as the Resource Public
00:05:54.866
Key Infrastructure, RPKI.
00:05:59.100
And the RPKI is an attempt to secure Internet routing.
00:06:04.366
And what it does, is it allows
00:06:06.266
autonomous systems to make signed
00:06:08.666
route origin authorisations.
00:06:12.900
And these are messages which get sent in BGP
00:06:17.000
which provide a digital signature for a
00:06:20.166
particular prefix announcement.
00:06:23.300
So, along with the announcement that
00:06:26.033
an autonomous system owns a particular IP
00:06:30.233
address range, and can route traffic to
00:06:33.366
that address range, which goes into BGP
00:06:36.266
as normal, and you get the usual
00:06:38.300
AS paths like we saw in the previous part,
00:06:41.833
RPKI allows the autonomous systems to send
00:06:46.500
a digital signature.
00:06:48.633
And this also progresses through the BGP
00:06:52.266
system, and follows the same route through
00:06:54.533
BGP, and gets filtered and processed in
00:06:57.200
BGP in the same way that the
00:06:59.266
route advertisements do.
00:07:01.300
But it also includes a digital signature,
00:07:04.233
stating that the ISP owns this particular
00:07:07.800
address range, and signed by the next
00:07:10.033
level up in the hierarchy of the routing system.
00:07:15.400
So, at the top-level, the regional Internet
00:07:18.133
registries, RIPE, and ARIN, and so on,
00:07:21.233
which assign IP address ranges to ISPs,
00:07:25.300
provide a signed statement that they have
00:07:27.500
delegated a particular address range to a
00:07:30.533
particular autonomous system, a particular ISP.
00:07:33.500
And if that ISP delegates a subset
00:07:35.533
of that address range to one of
00:07:37.133
its customers, it can make a signed
00:07:38.833
announcement to do so, and that is, in turn, signed.
00:07:42.733
The signatures ripple up all the way to the root.
00:07:47.600
So you get this hierarchical delegation,
00:07:50.766
with digitally signed statements announcing the delegation
00:07:53.866
of the prefixes.
00:07:56.966
And this allows a router which receives
00:07:59.700
a prefix advertisement, and receives one of
00:08:02.233
these Route Origin Authentication announcements, to validate
00:08:05.366
whether that prefix is authorised.
00:08:08.600
And the idea is that valid prefixes
00:08:10.833
will have one of these ROAs,
00:08:14.933
the Route Origin Authorisation digital signatures provided,
00:08:18.500
and the invalid prefixes, the hijacked prefixes, will not.
00:08:23.766
And when applying BGP policy, the other
00:08:27.100
networks that comprise the Internet can look,
00:08:29.633
and they can prefer prefixes which are
00:08:32.100
digitally signed than those which are not.
00:08:34.833
And that makes it harder to hijack a prefix.
00:08:39.766
And RPKI is starting to get traction.
00:08:42.866
It's a relatively new standard, it's maybe
00:08:47.966
10 years old now, and the measurements
00:08:52.133
in the paper we see linked on
00:08:54.633
the slide here, show that, as of
00:08:56.333
a couple of years ago, about 10-12%
00:08:58.933
of the IPv4 addresses
00:09:01.033
are covered by a prefix with a
00:09:03.400
valid signature, and this was growing rapidly.
00:09:07.466
And the links to the CloudFlare blog,
00:09:10.200
and to the isbgpsafeyet.com site,
00:09:13.600
present more up-to-date statistics, and its continuing
00:09:20.233
to grow, and RPKI is starting to become widely used.
00:09:25.466
And it's starting to become possible to
00:09:27.466
validate the authenticity of the routing announcements.
00:09:35.933
The other approach to routing security is
00:09:39.300
a system known as MANRS.
00:09:42.133
And MANRS is a set of mutually
00:09:44.266
agreed norms for routing security.
00:09:48.066
It's a project which is sponsored by the Internet society,
00:09:52.733
and is a collaboration between a set
00:09:55.633
of network operators to improve routing security.
00:10:00.000
And it's mostly there to share best practices.
00:10:03.866
It shares information in how to effectively
00:10:06.533
use RPKI; it shares configuration options;
00:10:10.500
it shares tips and approaches for correctly
00:10:14.133
configuring routers, for correctly configuring filtering,
00:10:19.533
for providing anti-spoofing measures; and for coordinating
00:10:23.466
responses to accidental or malicious
00:10:28.566
route hijacking when it's discovered.
00:10:35.500
And it's mostly there's as a talking
00:10:37.700
shop, as a forum for the ISPs
00:10:39.966
to coordinate, to make sure that the
00:10:43.100
routing system is stable, to address problems
00:10:45.900
as they occur, and to share and
00:10:48.566
to develop best practices for security.
00:10:55.233
And that's essentially all I want to
00:10:56.733
say about routing security.
00:10:59.166
Historically, the Internet routing has not been
00:11:02.000
secure at all.
00:11:04.366
As RPKI, and as MANRS, start to
00:11:07.533
get rolled-out, we’re starting to see some
00:11:10.266
improvements here, we're starting to see people
00:11:12.466
taking this problem seriously, and trying to
00:11:15.200
bring in some security.
00:11:18.233
We're not there yet. The routing is
00:11:20.600
still not particularly secure. Route hijacking,
00:11:23.333
BGP hijacking, still happens on a daily
00:11:26.900
basis, but things are getting better.
Part 4: Intra-domain Routing
Moving on from the discussion of BGP and inter-domain routing, the
final part of the lecture briefly reviews intra-domain routing and
how it differs. The concepts of distance vector and link state
routing are discussion, and the differences in scalability and
convergence times are noted. The lecture concludes with a discussion
of challenges in recovering from link failures in routing, including
fast failover and equal cost multipath routing.
Slides for part 4
00:00:00.466
The previous parts of the lecture have
00:00:02.300
spoken about interdomain routing, routing between the
00:00:05.566
networks that form the Internet.
00:00:07.766
In this final part, I want to
00:00:09.666
talk very briefly about intradomain routing,
00:00:12.566
routing within a network, and just very
00:00:15.300
briefly recap the distance vector and link
00:00:17.666
state routing algorithms.
00:00:21.900
So, as we saw in the previous
00:00:24.200
parts of the lecture, BGP and interdomain
00:00:28.133
routing are about giving information on the
00:00:30.433
path to reach other networks.
00:00:32.766
They're on the way the set of
00:00:35.533
networks that comprise the Internet work together
00:00:40.333
to exchange information needed to route packets
00:00:44.800
across the network.
00:00:47.733
And BGP is very much a policy-focused
00:00:51.266
routing protocol. The challenges in interdomain routing
00:00:55.700
are primarily to do with enforcing routing policy.
00:01:01.700
They’re primarily to do with getting the
00:01:05.033
networks which comprise the Internet,
00:01:08.566
which are, fundamentally, competitors, to work together
00:01:13.466
enough that they can deliver data across
00:01:15.366
the network. It's about expressing the business
00:01:20.766
constraints, the economic constraints,
00:01:22.733
the political constraints,
00:01:24.033
the policy constraints, that affect the way
00:01:26.966
data is delivered.
00:01:30.100
The question of intradomain routing, routing within
00:01:33.633
a network, is quite different.
00:01:36.966
If you look at routing, how to
00:01:39.400
route traffic within an autonomous system, within a network,
00:01:43.733
you find that it's very much a single trust domain.
00:01:49.366
The entire network is operated by a
00:01:52.266
single operator, and that's the point of
00:01:54.233
intradomain routing, it's within a domain,
00:01:56.500
it’s within an autonomous system, it's within a network.
00:01:59.500
So there's a single trust domain,
00:02:01.300
and there's no real policy restrictions on
00:02:04.066
who can see the information about the
00:02:05.666
network, or on which links can be used.
00:02:09.900
When we're talking about BGP, and interdomain routing,
00:02:16.500
the different networks, the different parts of
00:02:19.600
the system, want to hide their internal
00:02:21.866
details. They want to hide the information
00:02:24.133
about what's going on inside their network,
00:02:25.866
from their competitors.
00:02:28.566
If we're considering intradomain routing, we’re routing
00:02:32.033
within a network owned and operated by
00:02:34.600
a single organisation, and the rest of
00:02:36.333
the organisation can see what's going on.
00:02:38.800
They can see the topology of the
00:02:40.500
network, they can understand the constraints it’s
00:02:42.733
operating under, because all of the parts
00:02:44.800
of the organisation working together for one goal.
00:02:48.433
So there tend not to be policy
00:02:49.933
restrictions on who can see the topology,
00:02:52.933
or which devices can understand the constraints
00:02:56.333
on the network.
00:02:57.933
And there tend not to be policy
00:02:59.900
restrictions on which links can be used.
00:03:03.166
Certainly backup links, and so on,
00:03:05.833
exist, but there's no need to hide
00:03:07.600
those links; they’re visible to the entire system.
00:03:12.533
And, generally, the goal is to get
00:03:14.466
very efficient routing. We’re trying to find
00:03:17.533
the shortest path through the network.
00:03:20.166
Unlike inter domain routing, where the goal
00:03:23.900
is to find the shortest policy-compliant path,
00:03:26.733
the goal here is just to find
00:03:28.333
the most efficient use of the resources you have.
00:03:33.033
There’s two fundamental approaches that people use
00:03:36.400
for intradomain routing.
00:03:38.633
There’s an approach known as distance vector,
00:03:41.066
which tends to get instantiated in the
00:03:43.733
Routing Information Protocol, RIP, or there’s an
00:03:47.033
approach known as link state routing,
00:03:49.100
which has been instantiated in a protocol
00:03:52.166
called the Open Shortest Path First routing protocol, OSPF.
00:03:59.966
So, first off, I’ll just briefly talk
00:04:02.166
about distance vector routing.
00:04:04.466
The idea here is that the nodes
00:04:06.833
in the network, the routers that comprise
00:04:10.300
the network, maintain a routing table which contains
00:04:15.400
the distance they are from every other
00:04:18.033
node, and the next hop to get towards that node.
00:04:22.666
And we have an example on the
00:04:25.366
slide here, that shows a network with
00:04:28.100
seven nodes. And in this example,
00:04:31.366
they’re labeled with the letters A, B, C, D, etc.
00:04:35.433
And, in a real system, these would
00:04:37.633
have IP addresses to identify them,
00:04:40.266
but that just makes the slide complicated.
00:04:44.333
And we see the an example of
00:04:48.066
the routing table as is shown at node A
00:04:51.900
And we see that node A contains
00:04:53.700
a list of all of the other
00:04:55.033
nodes of the network, destinations B,
00:04:58.000
C, D, E, F, and G.
00:05:00.433
And, for each of those, it maintains
00:05:02.433
the distance, how far away it is
00:05:04.933
from that node, in number of hops.
00:05:06.833
So it's one hop away from node
00:05:08.666
B, it’s directly connected to B,
00:05:11.433
and it can reach it via node
00:05:13.566
B, it's directly connected. Similarly, it's one
00:05:16.500
hop away from C. It's two hops
00:05:19.066
away from D, and it knows the
00:05:20.766
next hop to get there is C, and so on.
00:05:25.133
And each node in the network periodically
00:05:27.533
exchanges a message with its neighbours,
00:05:29.900
where it tells its neighbours, “these are
00:05:33.400
who I think my other neighbours are,
00:05:35.433
and this is how far away I think I am from them,
00:05:39.900
and this is how far where I am from the from the rest of
00:05:42.800
the network as well”. And this information
00:05:45.566
gradually spreads through the network.
00:05:48.066
And, in the first round of this
00:05:50.800
exchange, each node just finds out its
00:05:53.066
neighbours, then it finds out its neighbours’
00:05:55.433
neighbours, and then its neighbours’ neighbours’,
00:05:57.700
neighbours, and so.
00:06:01.500
And the protocol operates in rounds.
00:06:04.933
It continually exchanges this information with the
00:06:07.966
neighbours, and gradually fills in the map
00:06:10.466
of the network so it knows how
00:06:12.566
far away it is from every node
00:06:14.766
in the network, and what's the best
00:06:17.066
way of getting there.
00:06:19.633
And once it's done that, it just
00:06:21.300
forwards the packets on the shortest path
00:06:23.033
to the destination, based on the hop
00:06:25.066
count, based on the distance. And if
00:06:27.166
there's two ways of getting there with
00:06:28.733
the same hop count it can pick arbitrarily.
00:06:34.533
Now, distance vector routing is
00:06:37.833
relatively straightforward, and it doesn't maintain too
00:06:41.633
much information at the nodes. All it
00:06:46.200
stores is a list of the other
00:06:47.866
nodes, and the distance, and next hop,
00:06:50.933
so the amount of state it needs
00:06:53.166
is linear with the size of the network.
00:06:55.800
The amount of entries in the routing
00:06:57.566
table grows linearly with the number of
00:06:59.800
nodes in the network. so it's relatively
00:07:02.300
resource efficient.
00:07:05.000
But it's slow to converge, because of
00:07:07.266
the way it operates in rounds,
00:07:09.366
and it has a problem where certain types
00:07:14.200
of failures can lead
00:07:16.833
to a behaviour where the distance gradually
00:07:19.700
counts up by one each iteration of
00:07:23.700
the algorithm, each iteration of the routing protocol.
00:07:27.233
And when a failure has happened,
00:07:29.233
it gradually counts up by one until
00:07:31.066
it gets to the representation of infinity
00:07:34.000
in the system, and takes multiple rounds
00:07:36.133
to converge and detect the failure.
00:07:39.366
And that behaviour can lead to very
00:07:41.766
slow convergence, and the system not being
00:07:44.900
able to recover from a link failure effectively.
00:07:52.733
The alternative algorithm, which is widely used
00:07:56.566
in the network, is what's known as link state routing.
00:08:00.800
And the idea of link state routing
00:08:03.733
is that the nodes in the network
00:08:07.133
know, obviously, the links to their neighbours.
00:08:10.200
They know which other routers they directly connected to.
00:08:13.633
And they know some metric about the
00:08:15.900
cost of using those links.
00:08:19.133
And that may just be the link
00:08:21.633
bandwidth, as a metric, or it may
00:08:24.433
be the delay, or it may be
00:08:26.233
a hard-coded metric chosen by the operator.
00:08:31.266
And when a node starts up,
00:08:35.366
or when a link changes, when something
00:08:37.600
changes in the network, the nodes can
00:08:39.666
flood this information throughout the network.
00:08:44.066
They can send to all of their
00:08:45.500
neighbours the list of directly connected nodes,
00:08:49.466
and the cost for using that link,
00:08:51.433
along with a sequence number for these messages.
00:08:56.533
And this gets flooded throughout the whole
00:08:59.000
network, so every node in the network
00:09:01.700
learns every other node in the network,
00:09:04.533
and what are each node’s neighbours.
00:09:08.500
So node A, in this example,
00:09:10.300
will flood out through the network that
00:09:13.133
it's node A, its neighbours are B,
00:09:15.466
C, E, and F, and it will
00:09:17.366
flood out the metrics, the speed of
00:09:19.333
the links for example. And this will go everywhere.
00:09:23.500
This will get flooded throughout this entire
00:09:25.866
network, so node B will know what
00:09:29.300
is node A and what are its
00:09:30.500
neighbours, and so it will node C,
00:09:32.566
and D, and E, and F,
00:09:33.700
and G, and H. And every one of those
00:09:35.866
nodes knows that node A exists,
00:09:38.800
and which nodes it's directly connected to.
00:09:42.066
And this happens for every node.
00:09:43.866
Each node periodically floods this information out,
00:09:46.633
whenever anything changes.
00:09:50.166
And, over time, this means that the
00:09:52.000
entire network, all of the nodes in
00:09:53.900
the network, all the routers in the
00:09:55.666
network, get to learn all of the
00:09:58.233
other links in the network.
00:10:00.600
They get to know which nodes are directly connected.
00:10:04.333
At that point they can just draw a
00:10:06.800
complete map of the network. Every node
00:10:09.333
knows the complete network topology,
00:10:12.733
and at that point, it can run
00:10:14.400
Dijkstra’s algorithm, calculate the shortest path to
00:10:17.533
every other node in the network,
00:10:20.433
and use that to make the decisions
00:10:22.366
which way it forwards the packets.
00:10:26.633
Now, this works much better,
00:10:30.966
because every node knows the complete topology.
00:10:34.966
If something fails, they can recover quite
00:10:36.800
quickly, as soon as the message gets
00:10:38.900
to them, they don't have to wait
00:10:40.966
for the count-to-infinity cycle that the distance
00:10:43.200
vector routing has.
00:10:45.933
The disadvantage of it, though, is that
00:10:48.433
it needs more memory, and it needs
00:10:50.333
more compute cycles.
00:10:52.566
Not only does each node store the
00:10:55.666
distance to every other node, but it
00:10:57.233
stores a complete map of the network.
00:10:59.733
So the amount of state each router,
00:11:03.100
each node in the network, needs to
00:11:04.800
store is equal to the size of the network squared.
00:11:08.100
So it scales order n squared with
00:11:10.200
the size of the network, because each
00:11:12.133
node is storing the complete matrix of
00:11:14.233
all the nodes and their connections to every other node.
00:11:20.300
And calculating Dijkstra’s algorithm is more computationally
00:11:23.833
complex than just looking at the distances.
00:11:26.566
And so this algorithm, the link state
00:11:29.700
approach to routing, is more memory hungry,
00:11:32.833
and it's more computationally intensive,
00:11:34.666
than distance vector.
00:11:36.966
But it converges much faster.
00:11:40.366
It recovers much faster after errors,
00:11:42.900
after links fail.
00:11:47.933
So we see there’s two approaches.
00:11:50.666
You can use distance vector routing in
00:11:52.800
a network, which is very simple to
00:11:54.466
implement, has low resource overheads in routers,
00:11:58.300
but suffers from very slow convergence.
00:12:00.866
If a link in the network fails,
00:12:02.733
it takes a long time to recover,
00:12:04.733
and packets cannot be delivered, packets to
00:12:08.700
certain destinations will not be correctly
00:12:10.600
delivered during that time.
00:12:13.166
Or you can use the link state
00:12:14.900
approach to routing, which is more complex,
00:12:17.700
requires the routers to have more memory,
00:12:19.700
do more computations, but it's much faster to converge.
00:12:25.066
And, when the network was starting out,
00:12:27.500
distance vector routing was relatively popular because
00:12:31.333
memory was expensive, because machines was slow,
00:12:34.433
and because there were not particularly strict
00:12:37.166
performance bounds on the network.
00:12:40.733
These days, memory is cheap, machines are
00:12:44.233
fast, and so the link state approach
00:12:47.666
is generally preferred, because it converges faster,
00:12:51.333
because the network recovers from failures much faster.
00:12:58.366
So what are the challenges with intradomain routing?
00:13:04.066
Well, I think there’s two.
00:13:07.900
The main one is how does it
00:13:10.733
recover effectively from failures?
00:13:19.300
While network equipment is pretty robust,
00:13:24.166
and pretty reliable,
00:13:27.333
it turns out that construction workers are
00:13:29.700
actually surprisingly good at breaking network cables.
00:13:33.266
And it's surprisingly common that someone digging
00:13:35.966
up the road puts a JCB through
00:13:38.466
the cables and breaks the network.
00:13:42.366
And, similarly, for people operating long distance
00:13:45.933
networks, people operating the international links,
00:13:49.166
it turns out that trawlers are pretty
00:13:50.833
good at damaging undersea cables.
00:13:54.866
And so good network designs need to
00:13:56.866
have multiple paths from source to destination.
00:14:00.233
And they need to be able to
00:14:01.866
fail-over to a different path if a
00:14:03.700
link breaks, and they need to be
00:14:05.233
able to do that relatively quickly.
00:14:09.566
How quickly do they need to notice
00:14:11.400
this? How quickly do they need to
00:14:13.000
switch over to a backup path?
00:14:16.966
Well,
00:14:18.666
It depends, what sort of guarantees you've
00:14:21.833
given your customers.
00:14:25.266
For certain types of networks, it may
00:14:28.466
be that a few minutes downtime is
00:14:30.400
acceptable. Maybe the customers of that operator
00:14:33.533
are okay if the link goes away for half-an-hour.
00:14:36.766
That seems less likely, though.
00:14:40.400
A few seconds failure? That's getting more acceptable.
00:14:46.366
it's noticeable, probably, but it's probably acceptable
00:14:49.533
if the link goes down for 10 seconds, for a lot of users.
00:14:54.000
But if the links are being used
00:14:56.066
to carry real-time traffic,
00:14:58.833
and if you want to have the
00:15:01.533
links, have the failures, recovered in a
00:15:04.033
way that doesn't disrupt that traffic,
00:15:06.300
maybe you're providing the network link for
00:15:09.666
the BBC, maybe you're providing a network
00:15:13.200
link for a service which is carrying
00:15:16.666
production quality video,
00:15:19.233
critical video, for example,
00:15:21.566
and if you want to recover such
00:15:25.566
that it doesn't affect that sort of
00:15:27.300
media, you need to be able to
00:15:28.700
recover within the duration of a single frame.
00:15:31.966
So you need to be able to
00:15:33.100
switch-over to a backup link within maybe
00:15:35.500
a 60th of a second. And have
00:15:37.866
that link, have that backup link,
00:15:40.200
have similar latency to the original so
00:15:42.366
it doesn't cause a
00:15:44.233
significant gap in the packets being received.
00:15:49.600
And so, a lot of the challenge
00:15:51.100
is how quickly can you fail-over,
00:15:52.966
and how quickly do you need to
00:15:54.466
fail-over in the event of a link
00:15:56.333
failure, for your customers?
00:15:59.300
If you’re a network operator, what demands
00:16:02.666
are your customers placing on how quickly it recovers?
00:16:06.533
And different service level guarantees,
00:16:10.700
different service level agreements, obviously affect how
00:16:13.400
much you charge your customers. But also
00:16:15.333
they affect how you organise, and how
00:16:17.233
you design, the network, and what mechanisms
00:16:19.433
you put in place for detecting failures.
00:16:21.933
And how you tune the protocol to
00:16:23.800
handle failures, and to recover from failures.
00:16:28.133
And quite often, this involves techniques
00:16:30.833
to pre-calculate alternative paths, so the system has
00:16:35.900
several different routing tables pre-configured,
00:16:40.266
accounting for different link failures, and can
00:16:43.033
just detect the failure and switch over
00:16:44.933
instantly to a pre-computed alternative, and doesn't
00:16:47.300
have to wait for the Information to propagate.
00:16:53.433
And the other issue is that of
00:16:55.100
load balancing. If you have multiple paths
00:16:59.133
through your network,
00:17:01.000
and you're trying to spread the amount
00:17:02.900
of traffic you have to make effective
00:17:04.466
use of those paths, of those different
00:17:06.666
paths through the network,
00:17:08.100
such that not all of the traffic
00:17:09.666
is concentrated on a single link,
00:17:11.633
but it's being spread across the network
00:17:14.333
to avoid congesting a particular link.
00:17:20.866
Then, quite often, the idea is what's
00:17:23.033
called equal-cost multipath. You arrange the network
00:17:26.233
so there's multiple parallel paths on
00:17:29.066
the hot links, on the links that
00:17:31.700
see most of the traffic, and you
00:17:34.000
arrange it so that it alternates the
00:17:35.466
traffic between those paths.
00:17:38.400
But you need to be at least somewhat careful,
00:17:41.733
because protocols like TCP, with the triple-duplicate
00:17:46.433
ACK, are at least slightly sensitive to reordering.
00:17:51.333
If you're sending packets down alternative routes
00:17:55.800
to a destination, and those routes have
00:17:58.066
different delays, and different amounts of traffic
00:18:00.466
on them, the packets can arrive out-of-order.
00:18:03.333
And this is a common source of
00:18:05.000
reordering in the network. And, as we
00:18:07.533
saw when we spoke about TCP,
00:18:09.666
and TCP recovery, it’s insensitive to a
00:18:12.933
small amount of reordering,
00:18:14.833
but if the paths, the different routes
00:18:16.866
through the network, have significantly different latency,
00:18:19.800
by spreading the load, by alternating packets
00:18:23.266
between different paths, you can introduce large
00:18:25.733
amounts of reordering.
00:18:27.333
Which TCP would then interpret as a
00:18:30.333
packet loss, and start retransmitting packets.
00:18:34.566
And different applications,
00:18:35.966
different protocols, have different
00:18:37.366
degrees of sensitivity to reordering. A lot
00:18:39.866
of the real-time applications don't care at
00:18:42.300
all, as long as the packets arrive
00:18:44.066
before their deadline.
00:18:46.033
But protocols like TCP and QUIC,
00:18:48.166
to at least some extent, do care
00:18:50.300
so you need to
00:18:51.866
arrange the network, so that if you
00:18:53.466
are balancing traffic between multiple routes it
00:18:56.200
doesn't accidentally cause large amounts of reordering.
00:19:03.766
And that's all I want to say about routing.
00:19:07.966
We spoke a bit about content distribution
00:19:10.400
networks, the idea of locating servers in
00:19:14.800
multiple places in the network in order to
00:19:18.433
host content near to the people who
00:19:22.100
want that content, near to the users
00:19:24.566
of that content, and how that can
00:19:27.566
be achieved using
00:19:29.333
DNS-based tricks to redirect to a local
00:19:33.233
replica, and a little bit about the
00:19:35.233
idea of anycast routing, where the same
00:19:37.166
addresses are inserted from multiple places and
00:19:39.433
the routing system takes care of getting
00:19:41.166
data to the to the nearest replica.
00:19:45.600
I spoke through interdomain routing, we spoke
00:19:48.033
through the idea of the Border gateway
00:19:49.966
Protocol, BGP, and how it can deliver
00:19:52.333
data, and the various policy constraints that
00:19:55.133
affect the way BGP works.
00:19:57.833
We spoke about routing security, of the
00:20:01.533
lack thereof, in the Internet, and we
00:20:03.933
finished up by talking a little about intradomain routing.
00:20:09.633
This is the final technical part of the lecture,
00:20:13.166
the final technical lecture in the course.
00:20:16.700
In the next lecture I’ll move on
00:20:18.533
and conclude the course, and talk about
00:20:21.833
some possible future directions, and some ways
00:20:24.533
in which the network is evolving.
Discussion
Lecture 9 discussed content distribution and routing.
Part 1 considered content distribution networks (CDNs). It spoke about
the need to locate proxy caches throughout the network in order to get
low-latency access to content and to distribute load. And it discussed,
briefly, how to implement CDNs using either DNS tricks or anycast
routing.
Part 2 considered inter-domain routing. It spoke about autonomous
systems (ASes) and the AS graph. It considered routing at the edge
of the network, based on default routes; and in the core, the so-called
default-free zone. And it highlighted the role of policy in inter-domain
routing.
Inter-domain routing and routing policy is implemented using the Border
Gateway Protocol. BGP exchanges prefixes and AS path information, to
form a routing table. And the filtered table allows policy to expressed.
The Gao-Rexford rules were outlined, describing a common set of polices.
The lack of security in inter-domain routing was mentioned, and the
lecture outlined two project, RPKI and MANRS, that are trying to improve
security and robustness of the routing infrastructure.
Finally, the lecture discussed intra-domain routing, including distance
vector and link state protocols, and some of the challenges in network
operations.
Discussion will focus on the need for, and benefits of CDNs; on
inter-domain routing and the requirements for policy support and
how this is expressed in BGP; and on intra-domain routing and the
challenges of network operations.