Networked Systems H (2022-2023)
Lecture 7: Real-time and Interactive Applications
Lecture 7 discusses real-time and interactive applications. It talks
about the requirements and constraints for running real-time traffic
on the Internet, and discusses how interactive video conferencing and
streaming video applications are implemented.
Part 1: Real-time Media Over The Internet
This first part of the lecture discusses real-time media running over
the Internet. It outlines what is real-time traffic, and what are the
requirements and constraints when running real-time applications over
the Internet. It discuses the implication of non-elastic traffic, the
effects of packet loss, and the differences between quality of service
and quality of experience.
Slides for part 1
00:00:00.300
In this lecture I want to move
00:00:01.733
on from talking about congestion control,
00:00:03.766
and talk instead about real-time and interactive
00:00:06.400
applications.
00:00:08.900
In this first part, I’ll start by
00:00:11.066
talking about real-time applications in the Internet.
00:00:13.300
I’ll talk about what is real-time traffic,
00:00:15.733
some of the requirements and constraints of
00:00:17.700
that traffic, and how we go about
00:00:20.366
ensuring a good quality of experience,
00:00:22.300
a good quality of service, for these applications.
00:00:25.766
In the later parts, I'll talk about
00:00:27.466
interactive applications, I’ll talk about the conferencing
00:00:30.633
architecture, how we go about building a
00:00:33.533
signalling system to
00:00:35.833
locate the person you wish to have
00:00:37.900
a call with, how we describe conferencing
00:00:40.233
sessions, and how we go about transmitting
00:00:42.600
real-time multimedia traffic over the network.
00:00:44.933
And then, in the final part,
00:00:46.366
I'll move on and talk about streaming
00:00:48.266
applications, and talk about the HTTP adaptive
00:00:51.100
streaming protocols that are used for video
00:00:53.966
on demand applications, such as the iPlayer or Netflix.
00:00:59.700
To start with, though, I want to
00:01:01.533
talk about real-time media over the Internet.
00:01:03.366
I’ll say a little bit about what
00:01:05.166
is real-time traffic,
00:01:06.300
what are the requirements and constraints in
00:01:08.566
order to successfully run real-time traffic,
00:01:10.733
real-time applications over the Internet, and some
00:01:13.800
of the issues around quality of service
00:01:15.633
and user experience, and how to make
00:01:17.233
sure we get a good experience for
00:01:20.066
users of these applications.
00:01:25.433
So, there's actually a long history of
00:01:28.466
running real-time traffic over the Internet.
00:01:32.000
And this includes applications like telephony and
00:01:35.366
voice over IP. It includes Internet radio
00:01:39.100
and streaming audio applications. It includes video
00:01:42.466
conferencing applications such as Zoom.
00:01:46.200
It includes streaming TV, streaming video applications,
00:01:50.500
such as the iPlayer and Netflix.
00:01:53.233
But it also includes gaming,
00:01:55.466
and sensor network applications,
00:01:57.800
and various industrial control systems.
00:02:01.166
And these experiments go back a surprisingly long way.
00:02:05.000
The earliest RFC on the subject of
00:02:07.866
real-time media on the Internet is RFC741,
00:02:11.566
which dates back to the early 1970s
00:02:14.333
and describe the network voice protocol.
00:02:16.933
And this was an attempt at running
00:02:19.933
packet voice over the ARPANET, the precursor
00:02:22.700
to the Internet.
00:02:24.466
And there’s been a continual thread of
00:02:26.500
standards developments and experimentation and research in
00:02:29.133
this area.
00:02:31.366
The current set of standards, which we
00:02:33.933
use for telephony applications, for video conferencing
00:02:37.866
applications, dates back to the mid 1990s.
00:02:42.333
It led to a set of protocols,
00:02:44.100
such as SIP, the Session Initiation Protocol,
00:02:47.533
the Session Description Protocol, the Real-time Transport
00:02:50.866
Protocol, and so on.
00:02:52.900
And then there was another burst of
00:02:55.633
developments, in perhaps the mid-2000s or so,
00:02:58.400
with HTTP adaptive streaming, and that led
00:03:01.133
to standards such as the MPEG DASH
00:03:02.766
standards, an applications is like Netflix and the iPlayer.
00:03:07.733
I think what's important, though, is to
00:03:10.466
realise that this is not new for
00:03:12.333
the network. We've seen everyone in the
00:03:17.300
world switch to using video conferencing,
00:03:19.000
and everyone in the world
00:03:20.766
switch to using Webex, and Teams,
00:03:23.100
and Zoom, and the like. But these
00:03:25.700
applications actually existed for many years,
00:03:28.333
and these applications have developed, and the
00:03:31.266
network has developed along with these applications,
00:03:34.333
and there's a long history of support
00:03:37.000
for real-time media in the Internet.
00:03:40.133
And you, occasionally, hear people saying that
00:03:42.233
the Internet was not designed for real-time
00:03:44.600
media, and we need to re-architect the
00:03:46.933
Internet to support real-time applications,
00:03:49.500
and to support future multimedia applications.
00:03:53.433
I think that's being somewhat disingenuous with history.
00:03:57.233
The Internet has developed and grown-up with
00:03:59.800
multimedia applications, right from the beginning.
00:04:02.633
And while they've perhaps not been as
00:04:05.133
popular, as some of the non real-time
00:04:08.100
applications, they've been a continual strand of
00:04:10.100
development, and people have been using these
00:04:12.000
applications and architecting the network to support
00:04:14.533
this type of traffic, for many, many years now.
00:04:21.533
So what is real-time traffic? What do
00:04:24.200
we mean by real-time traffic, real-time applications?
00:04:27.200
Well, the defining characteristic is that the
00:04:29.600
traffic has deadlines. The system fails if
00:04:32.233
the data is not delivered by a certain time.
00:04:35.933
And, depending on the type of application,
00:04:38.200
depending on the type of real-time traffic,
00:04:40.300
those can be what's known as hard
00:04:41.766
deadlines or soft deadlines.
00:04:44.533
Now, an example of a hard deadline
00:04:46.666
might be a control system, such as
00:04:49.166
a railway signalling system, where the data
00:04:52.433
that's controlling the signals has to arrive
00:04:55.433
at the signal before the train does,
00:04:57.733
in order to change the signal appropriately.
00:05:01.333
Real-time multimedia applications, on the other hand,
00:05:05.600
are very much in the in the realm
00:05:07.033
of soft real-time applications,
00:05:08.900
where you have to deliver the data
00:05:10.666
by a certain deadline in order to
00:05:12.233
get smooth playback of the media.
00:05:14.366
In order to get a glitch-free playback
00:05:17.733
of the audio, in order to get smooth video playback.
00:05:23.066
And these applications tend to have to
00:05:25.966
deliver data, perhaps every 50th of a
00:05:28.100
second for audio, maybe every 30 times
00:05:31.700
a second, 60 times a second, to get smooth video.
00:05:36.733
And it's important to realise that no
00:05:38.966
system is ever 100% reliable at meeting its deadlines.
00:05:43.300
It's impossible to engineer system that never
00:05:46.066
misses a deadline. So always think about
00:05:49.033
how can we arrange these systems,
00:05:51.266
such that some appropriate portion of the deadline are met.
00:05:56.133
And what that proportion is, depends on
00:05:58.333
what system we're building.
00:06:01.166
If it's a railway signalling system,
00:06:03.166
we want the probability that the network
00:06:05.766
fails to deliver the message to be
00:06:08.166
low enough that it's more likely that
00:06:10.466
the train will fail, or the actual
00:06:12.466
physical signal will fail, then the probability
00:06:15.133
of the network failing to deliver the message in time.
00:06:19.600
If it's a video conferencing application,
00:06:21.733
or video streaming application, the risks are
00:06:25.200
obviously a lot lower, and so you
00:06:27.133
can accept a higher probability of failure.
00:06:29.633
Although again, it depends on what the
00:06:31.533
application’s being used for. A video conferencing
00:06:35.700
system being used
00:06:37.500
for a group of friends, just chatting,
00:06:40.833
obviously has different reliability constraints, different
00:06:44.833
degrees of strictness of its deadlines, than one
00:06:48.166
being used for remote control of a
00:06:50.566
drone, or one being used for remote surgery, for example.
00:06:57.033
And the different systems can have different
00:06:59.900
types of deadline.
00:07:01.866
It may be that various types of
00:07:04.000
data have to be delivered before a certain time.
00:07:07.233
You have to deliver the control information
00:07:11.033
to the railway signal before the train
00:07:12.766
gets there. So you've got an absolute deadline on the data.
00:07:17.566
Or it maybe that the data has
00:07:19.833
to be delivered periodically, relative to the
00:07:22.633
previous deadline. The video frames have to
00:07:25.300
be delivered every 30th of a second,
00:07:27.933
or every 60th of a second.
00:07:30.300
And different applications have different constraints.
00:07:33.000
Different bounds on the latency, on the
00:07:36.133
absolute deadline. But also on the relative
00:07:38.066
deadline, on the predictability of the timing.
00:07:42.466
It’s important to remember that we're not
00:07:44.933
necessarily talking high performance for these applications.
00:07:49.033
If we're building a phone system that
00:07:51.766
runs over the Internet, for example,
00:07:53.633
the amount of data we're sending is
00:07:55.566
probably only a few kilobits per second.
00:07:58.300
But it requires predictable timing.
00:08:01.133
This packets containing the speech data have
00:08:04.033
to be delivered with
00:08:06.466
at least approximately predictable, approximately equal,
00:08:10.066
spacing, in order that we can correct
00:08:12.800
the timing and play out the speech smoothly.
00:08:17.333
And yes, some types of applications are
00:08:19.633
quite high bandwidth. If we're trying to deliver
00:08:23.200
studio quality movies, or if we're trying
00:08:25.666
to deliver holographic conferencing, then we need
00:08:28.066
tens, or possibly hundreds, of megabits.
00:08:30.500
But they're not necessarily high performance.
00:08:32.933
The key thing is predictability.
00:08:38.166
So what are the requirements for these applications?
00:08:42.100
Well, in the large extent, it depends
00:08:44.400
on whether you're building a streaming application
00:08:46.400
or an interactive application.
00:08:49.800
For video-on-demand applications, like Netflix or YouTube
00:08:53.766
or the iPlayer, for example, there's not
00:08:56.400
really any absolute deadline, in most cases.
00:08:59.866
If you're watching a movie, it's okay
00:09:02.400
if it takes 5, 10, 20 seconds
00:09:04.966
to start playing, after you click the play button,
00:09:08.233
provided the playback is smooth once it has started.
00:09:13.033
And maybe if it's a short-thing,
00:09:14.433
maybe it's a YouTube video that's only
00:09:16.200
a couple of minutes, then you want it to start quicker.
00:09:19.100
But again, it doesn't have to start
00:09:21.600
within milliseconds of you pressing the play
00:09:23.500
button. A second or two of latency
00:09:25.933
is acceptable, provided the playback is smooth
00:09:29.500
once it starts.
00:09:31.866
Now, obviously live applications, the deadlines may
00:09:34.633
be different. Clearly if you're watching
00:09:37.900
a live sporting event on YouTube or
00:09:42.200
or the iPlayer, for example, you don't
00:09:44.200
want it to be too far behind
00:09:45.566
the same event being watched on
00:09:47.133
broadcast TV. But, for these applications,
00:09:50.333
typically it's the relative deadlines, and smooth
00:09:53.266
playback once the application has started,
00:09:55.500
rather than the absolute deadline that matters.
00:09:59.666
The amount of bits per second it
00:10:02.233
needs depends to a large extent on the quality.
00:10:06.166
And, obviously, higher quality is better,
00:10:07.866
a higher bit rate is better.
00:10:09.966
But, to some extent, there's a limit
00:10:11.766
on this. And it's a limit depending
00:10:14.033
on the camera, on the resolution of
00:10:16.033
the camera, and the frame rate of
00:10:17.700
the camera, and the size of the display, and so on.
00:10:21.833
And you don't necessarily need many tens
00:10:25.433
or hundreds of megabits. You can get
00:10:28.266
very good quality video on single digit
00:10:31.333
numbers of megabits per second. And even
00:10:35.000
production quality, studio quality, is only hundreds
00:10:38.933
of megabits per second. So there’s an
00:10:40.866
upper bound on that the rate at
00:10:43.100
which these applications
00:10:45.500
can typically send, when you hit the
00:10:47.900
limits of the capture device, you hit
00:10:49.833
the limits of the display device.
00:10:52.500
And, quite often, for a lot of these applications,
00:10:54.800
predictability matters more than absolute quality.
00:10:59.100
It's often the less annoying to have
00:11:01.500
a movie, which is a consistent quality,
00:11:04.333
than a movie which is occasionally very
00:11:06.566
good quality, but keeps dropping down to
00:11:09.266
a lower resolution. So predictability is often
00:11:11.900
what's critical.
00:11:14.866
And, for a given bit rate,
00:11:16.566
you're also trading off between frame rate
00:11:18.366
and quality. Do you want smooth motion,
00:11:21.333
or do you want very fine detail?
00:11:24.633
And, if you want both smooth motion
00:11:28.533
and fine detail, you have to increase
00:11:30.633
the rate. But you can trade-off between
00:11:32.266
them, at a given bit rate, a different quality level.
00:11:38.166
For interactive applications, the requirements are a
00:11:40.533
bit different. They depend very much on
00:11:42.566
human perception, and the requirements to be
00:11:45.266
able to have a smooth conversation.
00:11:48.933
For phone calls, for video conferencing applications,
00:11:53.733
people have been doing studies of this
00:11:56.333
sort of thing for quite a while.
00:11:58.266
The typical bounds you hear expressed are
00:12:01.266
one-way mouth-to-ear delay, so the delay from
00:12:05.833
me talking, to it going
00:12:08.866
through the air to the microphone,
00:12:10.933
being captured, compressed, transmitted over the network,
00:12:13.933
decompressed, played-out, back from the speakers to
00:12:16.900
your ear, should be no more than about 150
00:12:20.066
milliseconds. And, if it gets more than
00:12:22.533
that, it starts getting a bit awkward
00:12:24.466
for the conversations. People start talking over
00:12:26.833
each other, and it gets to be
00:12:28.366
a bit difficult for a conversation.
00:12:30.733
And the ITU-T Recommendation G.114 talks about
00:12:34.733
this, and about the constraints there, in a lot of detail.
00:12:40.000
And, in terms of lip sync,
00:12:43.266
people start noticing if the audio is
00:12:46.400
more than about 15 milliseconds ahead,
00:12:49.033
or more than about 45 milliseconds behind
00:12:51.266
the video. And it seems that people
00:12:53.566
notice more often if the audio is
00:12:55.200
ahead of the video, than if it's behind the video.
00:12:58.166
So this gives quite strict bounds for
00:13:01.600
overall latency across the network, and for
00:13:04.500
the variation in latency between audio and video streams.
00:13:09.166
And, obviously, this depends what you're doing.
00:13:11.866
If you're having an interactive conversation,
00:13:14.366
the bounds are tighter than if it's
00:13:16.533
more of a lecture style, where it's
00:13:18.266
mostly unidirectional, with more structured pauses and
00:13:22.866
more structured questioning. That type of application
00:13:25.400
can tolerate higher latency.
00:13:28.100
Equally, if you're trying to do,
00:13:30.766
for example, a distributed music performance,
00:13:33.633
then you need much lower,
00:13:36.000
much lower latency.
00:13:38.300
And, if you think about something like
00:13:40.300
an orchestra, and you measure the size
00:13:42.300
of the orchestra, and you think about
00:13:44.300
the speed of sound, you get about
00:13:46.300
15 milliseconds for the sound to go
00:13:48.300
from one side of the orchestra to another.
00:13:51.033
So, that sort of level of latency
00:13:54.833
is clearly acceptable, but once it gets
00:13:57.000
more than 20 or 30 milliseconds,
00:13:59.300
it gets very difficult for people to
00:14:01.400
play in a synchronised way.
00:14:04.266
And if you've seen, if you’ve ever
00:14:08.533
tried to play music over a Zoom
00:14:11.000
call, you'll realise it just doesn't work,
00:14:13.000
because the latency is too high for that,
00:14:16.200
If you're trying to
00:14:18.566
play music collaboratively on a video conference.
00:14:25.500
So that gives you some bounds for latency.
00:14:29.566
What we saw in some of the
00:14:32.133
previous lectures, is that the network is
00:14:33.900
very much a best effort network,
00:14:35.700
and it doesn't guarantee the timing.
00:14:37.566
The amount of latency for data to
00:14:41.333
traverse the network very much depends on
00:14:44.700
the propagation delay of the path,
00:14:48.100
and the amount of queuing, and on
00:14:49.933
the path taken, and it's not predictable at all.
00:14:53.733
If we look at the figure on
00:14:56.233
the left, here, it's showing the variation
00:14:58.800
in round trip time for a particular
00:15:00.500
path. And we see that most of
00:15:02.133
it is bundled up, and there’s a
00:15:04.366
fairly consistent bound, but there are occasional
00:15:06.866
spikes where the packets take a much longer time to arrive.
00:15:11.866
And in some networks these effects can
00:15:14.833
be quite significant, they can take quite
00:15:16.800
a long time for data to arrive.
00:15:20.166
The consequence of all this, is that
00:15:22.300
real-time application needs to be loss tolerant.
00:15:25.166
If you're building an application to be
00:15:27.466
reliable, it has to retransmit data,
00:15:29.733
and that may or may not arrive
00:15:31.766
in time. So you want to build
00:15:33.466
it to be unreliable, and not to
00:15:35.633
necessarily retransmit the data.
00:15:37.800
You also want it to be able
00:15:39.466
to cope with the fact that some
00:15:40.766
packets may be delayed, and be able
00:15:42.800
to proceed even if those packets arrive too late.
00:15:45.600
So it needs to be able to
00:15:47.266
compensate for, to tolerate, loss, whether that's
00:15:50.100
just data which is never going to
00:15:52.166
arrive, or data that's just going to arrive late.
00:15:55.700
And, obviously, there's a bound on how
00:15:57.866
much loss you can conceal, how much
00:16:01.100
loss you can tolerate before the quality goes down.
00:16:04.366
And, the challenge in building these applications is to,
00:16:09.500
partially, engineer the network such that it
00:16:12.266
doesn't lose many packets, such the loss
00:16:14.433
rate, the timing variation, is low enough
00:16:16.633
that the application is going to work.
00:16:18.733
But, also, it’s in building the application
00:16:21.900
to be tolerant to the loss,
00:16:23.266
in being able to conceal the effects of lost packets.
00:16:31.166
The real-time nature of the traffic also
00:16:33.933
affects the way congestion control works,
00:16:36.566
it affects the way data is delivered across the network.
00:16:40.800
As we saw in some of the
00:16:42.600
previous lectures, when we were talking about
00:16:44.533
TCP congestion control,
00:16:46.600
congestion control adapts the speed of transmission
00:16:49.466
to match the available capacity over the network.
00:16:53.033
If the network has more capacity, it sends faster.
00:16:56.233
If the network gets overloaded, it sends slower.
00:16:59.800
And the transfers are elastic.
00:17:02.666
If you're downloading a web page, if you're downloading
00:17:06.133
a large file, faster is better,
00:17:08.000
but it doesn't really matter what rate
00:17:10.433
the congestion control will pick.
00:17:12.433
You want it to come down as fast as
00:17:14.233
it can, and the application can adapt.
00:17:18.233
Real-time traffic is much less elastic.
00:17:22.000
It’s got a minimum rate, there’s a
00:17:24.266
certain quality level, a certain bit rate,
00:17:26.700
below which the media is just unintelligible.
00:17:29.600
If you're transmitting speech, you need a
00:17:31.933
certain number of kilobits per second.
00:17:33.933
Otherwise, what comes out is just not intelligible speech.
00:17:37.500
If you're sending video, you need a
00:17:39.733
certain bit rate, otherwise you can't get
00:17:41.733
full motion video over it; the quality
00:17:44.200
is just too low, the frame rate
00:17:46.066
is just too low, and it's ino longer video.
00:17:49.366
Similarly, though, these applications have a maximum rates.
00:17:54.166
If you're sending speech data, if you're
00:17:56.400
sending music, it depends on the capture
00:17:58.733
rate, the sampling rate.
00:18:01.100
And, even for the highest quality
00:18:03.833
audio, you're probably not looking at more
00:18:06.066
than a megabit, a couple of megabits,
00:18:08.866
for CD quality, surround sound, media.
00:18:12.566
And again, for video, it depends on
00:18:14.533
the type of camera, the frame rates,
00:18:17.333
the resolution, and so on. Again,
00:18:20.833
a small number of megabits, tens of
00:18:23.800
megabits, in the most extreme cases hundreds
00:18:26.066
of megabits, and you get an upper bound on the sending rate.
00:18:31.833
So, real-time applications can't use
00:18:34.833
infinite amounts of traffic.
00:18:37.533
Unlike TCP, they're constrained by the rate
00:18:41.433
at which the media is captured.
00:18:43.500
But also, they can't go arbitrarily slowly,
00:18:46.466
This affects the way we have to
00:18:48.133
send that data, because we have less
00:18:49.766
flexibility in the rate at which these
00:18:51.300
applications can send.
00:18:56.866
And we need to think to what extent
00:19:00.066
it's possible, or desirable, to reserve capacity
00:19:03.666
for these applications.
00:19:08.200
There are certainly ways one can engineer
00:19:10.900
a network, such that it guarantees that
00:19:13.500
a certain amount of data is available.
00:19:16.533
Such that it guarantees that, for example,
00:19:19.000
a five megabit per second
00:19:22.333
channel is available to deliver video.
00:19:26.900
And, if the application is very critical,
00:19:29.100
maybe that makes sense.
00:19:31.000
If you're doing remote surgery, you probably
00:19:34.633
do want to guarantee the capacity is
00:19:36.733
there for the video.
00:19:38.766
But, for a lot of applications,
00:19:40.466
it's not clear it’s needed.
00:19:42.933
So while we have protocols, such as the
00:19:45.966
Resource Reservation Protocol, RSVP,
00:19:49.700
such as the Multi-Protocol Label Switching protocol
00:19:53.666
for orchestrating link-layer networks, such as the
00:19:58.233
idea of network slicing in 5G networks,
00:20:01.133
so we can set up resource reservations.
00:20:05.866
But.
00:20:08.966
This adds complexity. It adds signalling.
00:20:12.933
You need to somehow signal to the
00:20:14.966
network that you need to set up
00:20:16.933
this reservation, tell it what resources the
00:20:19.100
traffic requires.
00:20:20.700
And, somehow, demonstrate to the network that
00:20:22.933
the sender is allowed to use those
00:20:24.733
resources, and is allowed to reserve that
00:20:26.700
capacity, and can pay for it.
00:20:29.433
So you need authentication, authorisation, and accounting
00:20:32.733
mechanisms, to make sure that the people
00:20:34.933
reserving those resources are actually allowed to
00:20:37.100
do so, and have paid for them.
00:20:41.066
And in the end, if the network
00:20:43.633
has capacity, this doesn't actually help you.
00:20:46.366
If the operators designed the network so
00:20:48.766
it has enough capacity for all the
00:20:50.133
traffic it's delivering, the reservation doesn't help.
00:20:55.400
The reservations only help when the network
00:20:57.966
doesn't have the capacity.
00:21:00.066
They’re a way of allowing the operator,
00:21:02.366
who hasn't invested in sufficient network resources,
00:21:04.866
to discriminate in favour of the customers
00:21:07.300
who are willing to pay extra.
00:21:09.800
To discriminate so that those customers who
00:21:12.033
are willing to pay can get good
00:21:13.533
quality, whereas those who don't pay extra,
00:21:16.833
just get a system which doesn't work well.
00:21:21.000
So, it’s not clear that resource reservations
00:21:23.533
necessarily add benefit.
00:21:26.933
There are certainly applications where they do.
00:21:29.366
But, for many applications, the cost of
00:21:32.066
reserving the resources to get guaranteed quality,
00:21:35.666
the cost of building the accounting system,
00:21:38.100
the complexity of building the resource reservation
00:21:40.300
system, it's often easier, and cheaper,
00:21:43.000
just to buy more capacity, such that
00:21:45.166
everything works and there's no need for reservations.
00:21:48.700
And this is one of those areas
00:21:50.666
where the Internet, perhaps, does things differently
00:21:52.900
to a to a lot of other
00:21:54.266
networks. Where the Internet is very much
00:21:56.633
best efforts and unreserved capacity.
00:21:59.300
And it's an area of tension,
00:22:01.166
because a lot of the network operators
00:22:03.000
would like to be able to sell
00:22:05.166
resource reservations, would like to be able
00:22:08.300
to charge you extra to guarantee that
00:22:09.933
your Zoom calls will work.
00:22:13.500
It’s a different model. It's not clear,
00:22:16.366
to me, whether we want a network
00:22:19.000
that provides those guarantees,
00:22:22.633
but requires charging, and authentication,
00:22:25.833
and authorisation,
00:22:26.966
and knowing who's sending what traffic,
00:22:28.733
so you can tell if they've paid
00:22:30.633
for the appropriate quality.
00:22:32.500
Or, whether it's better just for everyone
00:22:34.700
to be sending, and we just architect
00:22:36.900
the networks so that it's good enough
00:22:39.066
for most things, and accept occasional quality lapses.
00:22:47.200
And, ultimately, it comes down to what's
00:22:49.266
known as quality of experience.
00:22:51.900
Does the application actually meet the users
00:22:54.266
needs? Does it allow them to communicate
00:22:56.733
effectively? Does it provide compelling entertainment? Does
00:22:59.366
it provide good enough video quality?
00:23:03.400
It’s very much not a one dimensional metric.
00:23:10.233
When you ask the user
00:23:12.733
“Does it sound good?”, you get a different
00:23:18.100
view on the quality of the music,
00:23:20.833
or the quality of the speech,
00:23:23.100
than if you ask “can you understand it?”
00:23:26.633
The question you ask matters. It depends
00:23:30.300
what aspect of user experience are you
00:23:32.300
evaluating. And it depends on the task
00:23:35.666
people are doing. The quality people need
00:23:38.133
for remote surgery is different to the
00:23:40.433
quality people need for a remote lecture, for example.
00:23:45.866
And some aspects of this user experience
00:23:48.133
you can estimate from looking at technical
00:23:50.166
metrics such as packet loss and latency.
00:23:53.633
And the ITU has something called the
00:23:56.133
E-model, which is a really good subjective
00:23:59.133
measure of speech quality, based on looking
00:24:01.533
at the latency, and the timing variation,
00:24:03.700
and the packet loss of speech data.
00:24:06.166
But, especially when you start talking about
00:24:08.233
video, and especially when you start talking about
00:24:12.166
particular applications, it's often very subjective,
00:24:15.066
and very task dependent. And you need
00:24:17.800
to actually build the system, try it
00:24:19.366
out, and ask people “So how well did it work?”
00:24:21.866
“Does it sound good?” “Can you understand
00:24:23.966
it?” “Did you like it?” You need
00:24:26.266
to do user trials to understand the
00:24:28.933
quality of the experience of the users.
00:24:34.966
So that concludes the first part.
00:24:37.100
I’ve spoken a bit about what is
00:24:39.200
real-time traffic, some of the requirements and
00:24:41.233
constraints to be able to run real-time
00:24:43.166
applications over the network, and some of
00:24:45.733
the issues around quality of service
00:24:47.433
and the user experience.
00:24:49.400
In the next part, we’ll move on
00:24:50.900
to start talking about how you build
00:24:52.766
interactive applications running over the Internet.
Part 2: Interactive Applications (data plane)
The second part discusses interactive applications. It briefly reviews
the history of real-time applications running over the Internet, and
the requirements on timing, data transfer rate, and reliability to be
able to successfully run audio/visual conferencing applications over
the network. It outlines the structure of multimedia conferencing
applications, and the protocol stack used to support such applications.
RTP media transport, media timing recovery, application-level framing,
and forward error correction are discussed, outlining how multimedia
applications are implemented.
Slides for part 2
00:00:00.133
In this part I'd like to talk
00:00:01.533
about interactive conferencing applications.
00:00:04.033
I’ll talk a little bit about what is the structure
00:00:06.266
of video conferencing systems,
00:00:07.933
some of the protocols for multimedia conferencing,
00:00:10.400
for video conferencing, and talk a bit
00:00:12.666
about how we do multimedia transport over the Internet.
00:00:17.466
So what do we mean by interactive conferencing applications?
00:00:21.366
Well I'm talking about applications such as
00:00:24.400
telephony, such as voice over IP,
00:00:27.033
and such as video conferencing.
00:00:29.633
These are applications like the university's telephone
00:00:32.366
system, like Skype, like Zoom or Webex
00:00:36.733
or Microsoft teams, that we're all spending
00:00:39.000
far too much time on these days.
00:00:42.266
And this is an area which has
00:00:44.433
actually been developing in the Internet community
00:00:46.800
for a surprisingly long amount of time.
00:00:50.033
As we discussed in the first part
00:00:51.900
of the lecture, the early standards,
00:00:54.500
the early work here, date back to
00:00:57.633
the early 1970s.
00:00:59.800
And the first Internet RFC on this
00:01:02.000
subject, the Network Voice Protocol, was actually
00:01:04.866
published in 1976. The standards we use
00:01:09.866
today for video conferencing applications, for telephony,
00:01:13.733
for voice over IP, date from the
00:01:16.100
early- and mid-1990s initially.
00:01:20.266
There were a set of applications,
00:01:22.600
such as CU-SeeMe, which you see at
00:01:25.233
the bottom right at the slide here,
00:01:27.966
a set of applications called the Mbone
00:01:30.700
conferencing tools, and the picture on the
00:01:33.533
top right of the slide is an
00:01:36.200
application I was involved in developing in
00:01:38.900
the late 1990s in this space,
00:01:41.300
which prototyped a lot of these standard
00:01:43.566
protocols. They led to the development of
00:01:47.000
a set of standards, such as the
00:01:48.866
Session Description Protocol, SDP, the Session Initiation
00:01:51.866
Protocol, SIP, and the Real-time Transport Protocol,
00:01:56.233
RTP, which formed the basis of these
00:01:58.700
modern video conferencing applications.
00:02:02.900
These got pretty widely adopted. The ITU
00:02:07.066
adopted them as the basis for it
00:02:08.933
H.323 series of recommendations
00:02:11.333
for video conferencing systems.
00:02:13.466
A lot of commercial telephony products are
00:02:16.633
built using them. And the Third Generation
00:02:20.266
Partnership Project, 3GPP, adopted them as the
00:02:23.133
basis for the current set of mobile
00:02:25.066
telephone standards. So, if you make a
00:02:28.500
phone call, a mobile phone call,
00:02:31.000
you’re using the descendants of these standards.
00:02:35.333
And also, more recently, the WebRTC browser-based
00:02:39.666
conferencing system again incorporated these protocols into
00:02:43.666
the browser, building on SDP, and RTP,
00:02:47.366
and the same set of conferencing standards
00:02:49.833
which were prototyped in the tools you
00:02:52.300
see on the right of the slide.
00:02:58.533
Again, as we discussed in the previous
00:03:01.066
part of lecture, if you're building interactive
00:03:03.500
conferencing applications,
00:03:05.166
you've got fairly tight bounds on latency.
00:03:10.366
The one-way delay, from mouth to ear,
00:03:13.900
if you want a sensible interactive conversation,
00:03:17.400
has to be no more than somewhere
00:03:19.766
around 150 milliseconds.
00:03:22.400
And if you're building a video conference,
00:03:24.166
you want reasonably tight lip sync between
00:03:26.500
the audio and video,
00:03:28.200
with the audio no more than around
00:03:30.966
15 milliseconds ahead of the video,
00:03:33.800
and no more than about 45 milliseconds behind.
00:03:37.633
Now, the good thing is that these
00:03:40.600
applications tend to degrade relatively gracefully.
00:03:43.966
The bounds, 150 milliseconds end-to-end latency;
00:03:49.233
the 15 milliseconds ahead, 45 milliseconds behind,
00:03:53.333
for lip sync, are not strict bounds.
00:03:56.333
Shorter is better, but
00:03:59.833
If the latency, if the offset,
00:04:01.500
exceeds those values, it gets to gradually
00:04:04.966
become less-and-less usable, people start talking over
00:04:08.400
each other, people start noticing the
00:04:10.966
that the lack of lip-sync, but nothing
00:04:13.100
fails catastrophically. But that's the sort of
00:04:16.366
values we're looking at: end-to-end delay in
00:04:19.600
the hundred 150 millisecond range, and audio-video
00:04:23.800
synchronised to within a few 10s of milliseconds.
00:04:28.233
The data rates we’re sending depend,
00:04:30.833
very much, on what type of media
00:04:32.600
you're sending, and what codec, what compression
00:04:35.100
scheme you use.
00:04:38.133
For sending speech, the speech compression typically
00:04:42.100
takes portions of speech data that are
00:04:45.666
around 20 milliseconds in duration, about 1/50th
00:04:48.800
of a second in duration, and every
00:04:51.100
20 milliseconds, every 1/50th second, it grabs the
00:04:54.566
next chunk of audio that's been received,
00:04:57.466
compresses it, and transmits it across the network.
00:05:00.800
And this is decoded at the receiver,
00:05:03.433
decompressed, and played out on the same sort of timeframe.
00:05:08.333
The data rates depends on the quality
00:05:11.300
level you want. It's possible to send
00:05:14.066
speech with something on the order of
00:05:17.033
10-15 kilobits per second of speech data,
00:05:19.700
although it's typically sent at a some
00:05:22.633
somewhat higher quality, maybe a couple of
00:05:24.933
hundred kilobits, to get high quality speech
00:05:29.933
that sounds pleasant, but it can go
00:05:34.700
to very low bit rates if necessary.
00:05:39.966
And a lot of these applications vary
00:05:42.333
the quality a little, based on what's
00:05:44.466
going on. They encode higher quality when
00:05:47.333
it's clear that the person is talking,
00:05:49.400
and they send packets less often,
00:05:51.566
and encoded with lower bit rates,
00:05:53.300
when it's clear there's background noise.
00:05:55.600
If you're sending good quality music,
00:05:57.666
you need more bits per second than if you're sending speech.
00:06:02.000
For video, the frame rates, the resolution,
00:06:06.500
very much depend on the camera,
00:06:08.600
on the amount of processor time you
00:06:10.533
have available to do the compression,
00:06:12.533
whether you've got hardware accelerated video compression
00:06:15.166
or not. And on the video compression
00:06:18.533
algorithm, the video codec you're using.
00:06:22.166
Frame rates somewhere in the order of
00:06:24.833
25 to 60 frames per second are common.
00:06:28.533
Video resolution varies from postage stamp sized,
00:06:32.700
up to full screen, HD, or 4k video.
00:06:37.266
You can get good quality video with
00:06:40.133
codecs like H.264, at around the two
00:06:43.466
to four megabits per second range.
00:06:46.066
Obviously, if you're going up to
00:06:48.500
full-motion, 4k, movie encoding, you'll need higher
00:06:52.500
rates than that. But, even then,
00:06:54.766
you’re probably not looking at more than
00:06:56.433
four, eight, ten megabits per second.
00:07:03.166
So, what you see is that these
00:07:04.466
applications have reasonably demanding latency bounds,
00:07:08.100
and reasonably high, but not excessively high,
00:07:11.066
bit-rate bounds. Two to four megabits,
00:07:14.100
even eight megabits, is generally achievable on
00:07:17.233
most residential, home network, connections.
00:07:22.366
And 150 milliseconds end-to-end latency
00:07:25.700
is generally achievable without too much difficulty
00:07:31.566
as long as you're not trying to
00:07:33.900
go transatlantic or transpacific.
00:07:39.566
In terms of reliability requirements,
00:07:42.633
speech data is actually surprisingly loss tolerant.
00:07:46.233
It's relatively straightforward to build systems
00:07:49.733
which can conceal 10-20% random packet loss,
00:07:53.333
without any noticeable reduction in speech quality.
00:07:56.933
And, with the addition of forward error
00:07:59.166
correction, with error correcting codes, it’s quite
00:08:01.600
possible to build systems that work with
00:08:05.100
maybe 50% of the packets being lost.
00:08:08.000
Bursts of packet loss are harder to
00:08:11.233
conceal, and tend to result in inaudible
00:08:14.266
glitches in the speech playback, but they're
00:08:17.633
relatively uncommon in the network.
00:08:20.033
Video packet loss is somewhat harder to conceal.
00:08:23.933
With streaming video applications, if you're sending
00:08:26.866
a movie, for example, you can rely
00:08:29.400
on that the occasional scene changes to
00:08:31.300
reset the decoder state, and to recover
00:08:33.566
from the effects of any loss.
00:08:35.733
With video conferencing, there aren’t typically scene
00:08:38.633
changes, so you have to do a rolling repair,
00:08:41.566
a rolling retransmission, or some form of
00:08:44.300
forward error correction to detect the losses.
00:08:46.500
So video tends to be more sensitive
00:08:49.033
to packet loss than the audio.
00:08:50.866
Equally, though, people are less sensitive to
00:08:53.266
disruptions in video quality than they are
00:08:55.300
to disruptions in the audio quality.
00:08:59.666
So how has one of these interactive
00:09:01.400
conferencing application structured?
00:09:04.366
What does the media transmission path look like?
00:09:08.000
Well, you start with some sort of
00:09:09.800
capture device. Maybe that's a microphone,
00:09:12.600
or maybe it's a camera, depending whether
00:09:15.000
it's an audio or a video application.
00:09:17.533
The media data is captured from that
00:09:19.466
device, and goes into some sort of
00:09:21.033
input buffer, frame at a time.
00:09:23.266
If it's video, it's each video frame
00:09:25.300
at a time. If it's audio,
00:09:27.200
it's frames of, typically, 20 milliseconds worth
00:09:30.066
of speech or music data at a time.
00:09:33.533
Each frame is taken from that input
00:09:35.866
buffer, and passed to the codec.
00:09:38.766
The codec compresses the frames of media,
00:09:41.233
one by one. And, if they’re too
00:09:43.333
large to fit into an individual packet,
00:09:45.500
it fragments them into multiple packets.
00:09:49.200
Each of those fragments of a media
00:09:52.166
frame is transmitted by putting it inside
00:09:55.466
an RTP packet, a Real-time Transport Protocol
00:09:58.533
packet, which is put inside a UDP
00:10:00.966
packet, and sent on to the network.
00:10:04.066
The RTP packet header adds a sequence
00:10:07.400
number, so the packets can be put
00:10:09.400
back into the right order.
00:10:10.900
It adds timing information, so the receiver
00:10:13.700
can reconstruct the timing accurately. And it
00:10:16.233
adds some source identification, so it knows
00:10:18.900
who's sending the media, and some payload
00:10:21.233
identification information, so it knows which compression
00:10:24.033
algorithm, which codec, was used to encode the media.
00:10:27.766
So the media is captured, compressed,
00:10:30.566
fragmented, packetised, and transmitted over the network.
00:10:37.166
On the receiving side, the UDP packets
00:10:40.700
containing the RTP data arrive.
00:10:45.366
And the receiving application extracts the RTP
00:10:48.933
data from the UDP packets, and looks
00:10:51.700
at the source identification information in there.
00:10:54.333
And then it separates the packets out
00:10:56.266
according to who sent them.
00:10:58.366
For each sender,
00:11:01.266
the data goes through a channel coder,
00:11:03.733
which repairs any loss, using a forward
00:11:07.200
error correction scheme
00:11:09.466
If one was used. And we'll talk
00:11:12.066
about that later, but that's where additional
00:11:14.033
packets are sent along with the media,
00:11:15.966
to allow some sort of repair without needing retransmission.
00:11:18.800
Then it goes into what's called a play-out buffer.
00:11:22.066
The play-out buffer is enough buffering to
00:11:24.733
allow the timing, and the variation in
00:11:26.666
timing, to be reconstructed,
00:11:30.733
such that the packets are put back
00:11:33.766
into the right order, and such that
00:11:36.366
they're delivered to the codec, to the decoder,
00:11:40.100
at the right time, and with
00:11:42.866
the correct timing behaviour.
00:11:44.966
The decoder then decompresses the media,
00:11:49.633
conceals any remaining packet loss, corrects any
00:11:54.200
clock skew, corrects any timing problems,
00:11:57.200
mixes it together if there's more than
00:11:59.533
one person talking, and renders it out
00:12:01.300
to the user. It plays the speech
00:12:03.933
or the music out, or it puts
00:12:06.266
the video frames onto the screen.
00:12:12.633
So that's conceptually how these applications work.
00:12:15.766
What does the set of protocol standards
00:12:18.333
which are used to transport multimedia over
00:12:20.666
the Internet, look like?
00:12:23.533
Well, there’s a fairly complex protocol stack.
00:12:27.200
At its core, we have the Internet
00:12:30.066
protocols, IPv4 and IPv6, and UDP and
00:12:32.833
TCP layered above them.
00:12:37.000
Layering above the UDP traffic, is the
00:12:40.566
media transport traffic and the associated data.
00:12:46.066
And what you have there is the
00:12:48.233
UDP packets, which deliver the data;
00:12:51.200
a datagram TLS layer, which negotiate the
00:12:54.633
encryption parameters;
00:12:56.400
and, above that, sit the secure RTP
00:13:00.100
packets, with the audio and video data
00:13:02.400
in them, for transmitting the speech and
00:13:04.966
the pictures. And you have a protocol,
00:13:08.100
known as SCTP,
00:13:10.866
layered on top of DTLS, to provide
00:13:13.266
a peer-to-peer data channel.
00:13:17.900
In addition to the media transport,
00:13:20.133
with RTP and SCTP sitting above DTLS,
00:13:23.666
you also have NAT traversal and path
00:13:25.900
discovery mechanisms. We spoke about these a
00:13:28.733
few lectures ago, with protocols like STUN
00:13:31.533
and TURN and ICE to help set
00:13:35.100
up peer-to-peer connections, to help discover NAT bindings.
00:13:39.966
You have what’s known as a session
00:13:42.233
description protocol, to describe the call being set up.
00:13:46.066
And this identifies the person who's trying
00:13:49.300
to establish the multimedia call, who's trying
00:13:51.800
to establish the video conference.
00:13:53.966
It identifies the person they want to
00:13:56.133
talk to. It describes which audio and
00:13:58.966
video compression algorithms they want to use,
00:14:01.233
which error correction mechanisms they want to
00:14:03.133
use, and so on.
00:14:06.433
And this is used, along with one
00:14:08.800
or more of a set of signalling
00:14:10.900
protocols, depending how the call is being set up.
00:14:14.566
It may be an announcement of a
00:14:17.233
broadcast session, using a protocol called the
00:14:19.800
Session Announcement Protocol, for example.
00:14:22.500
It might be a telephone call,
00:14:25.333
using the Session Initiation Protocol, SIP,
00:14:28.600
which is how the University's phone system
00:14:30.866
works, for example.
00:14:33.366
It might be a streaming video session,
00:14:35.966
using a protocol called RTSP. Or it
00:14:39.300
might be a web based video conferencing
00:14:42.700
application, such as Zoom call, or a
00:14:46.633
Webex call, or a Microsoft Teams call,
00:14:49.433
where the negotiation runs over HTTP using a
00:14:52.700
protocol called JSEP,
00:14:54.133
the Javascript Session Establishment Protocol.
00:15:00.766
So let's talk a little bit about the media transport.
00:15:03.966
How do we actually get the audio
00:15:05.600
and video data from the sender to
00:15:07.633
the receiver, once we've captured and compressed
00:15:10.433
data, and got it ready to transmit?
00:15:14.500
Well it's sent within a protocol called
00:15:16.766
the Real-time Transport Protocol, RTP.
00:15:20.566
RTP comprises two parts. There's a
00:15:24.633
data transfer protocol, and there's a control protocol.
00:15:30.166
The data transfer protocol is usually called
00:15:33.433
just RTP, the RTP data protocol,
00:15:35.966
and it carries the media data.
00:15:39.333
It’s structured in the form of a
00:15:40.633
set of payload formats. The payload formats
00:15:43.400
describe how you take the output of
00:15:45.233
each particular video compression algorithm, each particular
00:15:48.200
audio compression algorithm, and map it onto
00:15:50.900
a set of packets to be transmitted.
00:15:54.566
And it describes how
00:15:57.800
to split up a frame of video,
00:16:00.333
how to split up a sequence of
00:16:02.466
audio packets, such that each RTP packet,
00:16:06.800
each UDP packet, which arrives can be
00:16:09.766
independently decoded, even if some of the
00:16:12.333
packets have been lost. It makes sure
00:16:14.433
there's no dependencies between packets, a concept
00:16:17.200
known as application level framing.
00:16:20.133
And this runs over a datagram TLS
00:16:22.833
layer, which negotiates the encryption keys and
00:16:26.400
the security parameters to allow us to
00:16:28.733
encrypt those RTP packets.
00:16:31.400
The control protocol runs in parallel,
00:16:33.700
and provides things like Caller-ID,
00:16:36.466
reception quality statistics,
00:16:39.533
retransmission requests, and so one, in case data gets lost.
00:16:45.500
And there are various extensions that go
00:16:47.466
along with this, that provide things like
00:16:50.466
detailed user experience and reception quality reporting,
00:16:54.000
that provide codec control and feedback mechanisms to
00:16:57.866
detect and correct packet loss, and that
00:17:00.700
provide congestion control and perform circuit breaker
00:17:04.033
functions to stop the transmission if the
00:17:06.200
quality is too bad.
00:17:11.566
The RTP packets are sent inside UDP packets.
00:17:15.766
The diagram we see here shows the
00:17:17.933
format of the RTP packets. This is
00:17:20.000
the format of the media data,
00:17:22.033
which sits within the payload section of UDP packets.
00:17:26.566
And we see it that's actually a
00:17:28.066
reasonably sophisticated protocol. If we look at
00:17:31.566
the format of the packet, we see
00:17:33.233
there’s a sequence number and a timestamp to allow the
00:17:36.866
receiver to reconstruct the ordering, and reconstruct
00:17:39.533
the timing. There’s a source identifier to
00:17:42.733
identify who sent the packet, if you
00:17:44.933
have a multi-party video conference.
00:17:47.266
And there's some payload format identifiers,
00:17:49.500
that describe whether it contains audio or
00:17:51.766
video, what compression algorithm, is used, on so on
00:17:57.933
And there’s space for extension headers,
00:18:00.533
and space of padding, and the space
00:18:02.600
for payload data where the actual audio or video data goes.
00:18:09.266
And these packets, these RTP packets,
00:18:11.700
are sent within UDP packets. And the
00:18:15.833
sender will typically send these with pretty
00:18:18.566
regular timing. If it’s audio, it generates
00:18:21.866
50 packets per second;
00:18:24.100
if it's video, it might be 25
00:18:26.400
or 30 or 60 frames per second,
00:18:28.600
but the timing tends to be quite predictable.
00:18:32.566
As the data traverses the network,
00:18:34.400
though, the timing is often disrupted by
00:18:37.200
the other types of traffic, the cross-traffic
00:18:39.700
within the network. If we look at
00:18:42.233
the bottom of the slide, we see
00:18:43.666
the packets arriving at the receiver,
00:18:46.500
and we see that the timing is no longer predictable.
00:18:50.900
Because of the other traffic in the
00:18:54.100
network, because it's a best effort network,
00:18:56.733
because it's a shared network,
00:18:58.433
the media data is sharing the network
00:19:01.466
with TCP traffic, with all the other
00:19:03.466
flows on the network, and so the
00:19:05.400
packets don't necessarily arrived with predictable timing.
00:19:11.400
One of the things the receiver has
00:19:13.966
to do, is try to reconstruct the timing.
00:19:18.000
And what we see on this slide,
00:19:19.933
at the top, we see the timing
00:19:21.966
of the data as it was transmitted.
00:19:24.333
And the example is showing audio data,
00:19:27.300
and it’s labelling talk-spurts, and a talk-spurt
00:19:29.733
will be a sentence, or a fragment
00:19:31.833
of a sentence, with a pause between it.
00:19:34.933
We see that the packets comprising the
00:19:37.100
speech data are transmitted with regular spacing.
00:19:40.566
And they pass across the network,
00:19:42.266
and at some point later they arrive at the receiver.
00:19:46.033
There's obviously some delay, it’s labeled as
00:19:48.600
network transit delay on the slide,
00:19:50.733
which is the time it takes the
00:19:52.133
packets to traverse the network.
00:19:54.800
And there will be a minimum amount
00:19:56.500
of time it takes, just based on
00:19:57.833
the propagation delay, how long it takes
00:20:00.033
the signals to work their way down
00:20:03.066
the network from the sender to the
00:20:04.700
receiver. And, on top of that,
00:20:06.633
there'll be varying amounts of queuing
00:20:08.200
delay, depending on how busy the network is.
00:20:11.466
And the result of that, is that
00:20:13.100
the timing is no longer regular.
00:20:14.933
Packets which were sent with regular spacing,
00:20:17.366
arrive bunched together with occasional gaps between
00:20:20.300
them. And, occasionally, they may arrive out-of-order,
00:20:24.066
or occasionally the packets may get lost entirely
00:20:28.133
And what the receiver does, is to
00:20:30.766
adds what’s labeled as “playout buffering delay”
00:20:33.300
on this slide, to compensate for this
00:20:35.833
timing variation. To compensate for what's known
00:20:38.366
as jitter, the variation in the time
00:20:40.700
it takes the packets to transit across the network.
00:20:44.266
By adding a bit of buffering delay,
00:20:46.466
the receiver can allow itself time to
00:20:49.900
put all the packets back into the right order,
00:20:52.833
and to regularise the spacing. It just
00:20:55.633
adds enough delay to allow it to
00:20:57.600
compensate for this variation. So, by adding
00:21:00.400
a little extra delay at the receiver,
00:21:03.066
the receiver correct for the variations in timing.
00:21:07.200
And, if packets are lost, it obviously
00:21:09.766
has to try and conceal that loss,
00:21:11.700
or it can try to do a
00:21:13.433
retransmission if it thinks the retransmission will
00:21:15.500
arrive in time.
00:21:17.133
Of, if packets arrive, and we see
00:21:19.066
the very last packet here, if the
00:21:20.400
packets arrive too late, if they're delayed
00:21:22.833
too much, then they may arrive too
00:21:24.600
late to be played out. In which
00:21:26.566
case they’re just discarded, and the gap
00:21:29.200
has to be concealed as-if the packet were lost.
00:21:36.066
And, essentially, you can see, that if
00:21:38.300
the packets are played-out immediately they arrive,
00:21:40.366
this variation and timing would lead to
00:21:42.166
gaps, because the packets are not arriving
00:21:45.266
with consistent spacing.
00:21:47.400
If you delay the play-out by more
00:21:49.600
than the typical variation between the inter-arrival
00:21:52.000
time of the packets,
00:21:53.466
you can add enough buffering that once
00:21:56.466
you actually start playing out the packets,
00:21:59.000
when you start playing out the data,
00:22:00.466
you can allow smooth playback. You trade
00:22:02.933
off a little bit of extra latency for very smooth,
00:22:06.566
consistent, playback.
00:22:09.533
And that delay between the packets arriving,
00:22:12.266
and the media starting to play back,
00:22:16.633
that buffering delay,
00:22:18.433
partly allows you to reconstruct the timing,
00:22:22.033
and it partly gives time to decompress
00:22:24.233
the audio, decompress the video, run a
00:22:27.833
loss concealment algorithm, and potentially retransmit any
00:22:31.900
lost packets, depending on the network round-trip time.
00:22:37.333
And then you can schedule the packets
00:22:39.000
to be played out, and you can
00:22:40.200
play the data out smoothly.
00:22:46.900
What's critical, though, is that loss is
00:22:50.366
very much possible. The receiver has to
00:22:52.666
make the best of the packets which do arrive.
00:22:58.500
And a lot of effort, when building
00:23:01.600
video conferencing applications, goes into defining how
00:23:05.633
the compressed audio-visual data is formatted into
00:23:08.200
the packets.
00:23:10.566
And the goal is that each packet
00:23:12.300
should be independently usable.
00:23:14.733
It's easy to take the output of
00:23:18.533
a video compression scheme, a video codec,
00:23:20.633
and just arbitrarily put the data into packets.
00:23:24.066
But, if you do that, the different
00:23:28.966
packets end up depending on each other.
00:23:30.733
You can't decode a particular packet if
00:23:33.300
an earlier one was lost, because it
00:23:35.000
depends on some of the data was in the earlier packet.
00:23:38.566
So a lot of the skill in
00:23:40.833
building a video conferencing application goes into
00:23:43.433
what's known as the payload format.
00:23:45.166
It goes into the structure of how
00:23:47.033
you format the output of the video compression,
00:23:50.333
and how you format the output of
00:23:52.400
the audio compression, so that for each
00:23:54.100
packet that arrives, it doesn't depend on
00:23:56.133
any data that was in a previous
00:23:58.533
packet, to the extent possible, so that
00:24:01.466
every packet that arrives can be decoded completely.
00:24:05.566
And there are obviously limits to this.
00:24:08.133
Most video compression schemes work by sending
00:24:11.566
a full image, and then encoding differences
00:24:14.200
to that, and that obviously means that
00:24:16.533
you depend on that previous full image,
00:24:19.533
what's known as the index frame.
00:24:21.866
And a lot of these systems build
00:24:24.700
in retransmission schemes if the index frame
00:24:27.866
gets lost, but apart from that the
00:24:29.966
packets for the predicted frames,
00:24:32.800
that are transmitted after that,
00:24:34.133
should all be independently decodable.
00:24:37.833
The paper shown on the right of
00:24:39.633
the slide here, “Architectural Considerations for a
00:24:42.566
New Generation of Protocols”, by David Clark
00:24:45.900
and David Tennenhouse,
00:24:47.366
talks about this approach, and talks about
00:24:49.500
this philosophy of how to encode the
00:24:51.600
data such as the packets are independently
00:24:53.800
decodable, and how to structure these types
00:24:55.766
of applications, and it's very much worth a read.
00:25:02.933
Obviously the packets can get lost,
00:25:05.133
and the way networks applications typically deal
00:25:08.766
with lost packets is by asking for a retransmission.
00:25:12.333
And you can clearly do this with
00:25:14.266
a video conferencing application.
00:25:16.800
The problem is that retransmission takes time.
00:25:19.333
It takes a round-trip time for the
00:25:22.500
retransmission requests to get back from the
00:25:24.433
receiver to the sender, and for the
00:25:26.233
sender to transmit the data.
00:25:28.733
But for video conferencing applications, for interactive
00:25:31.200
applications, you've got quite a strictly delay bound.
00:25:34.266
The delay bound is somewhere on the
00:25:36.233
order of 100-150 milliseconds, mouth to ear delay.
00:25:40.266
And that comprises the time it takes
00:25:43.100
to capture a frame of audio,
00:25:45.233
and audio frames are typically 20 milliseconds,
00:25:48.400
so you've got a 20 millisecond frame
00:25:50.533
of audio being captured.
00:25:52.100
And then it takes some time to
00:25:53.700
compress that frame. And then it has
00:25:55.766
to be sent across the networks,
00:25:57.033
so you've got the time to transit network.
00:25:59.233
And then the time to decompress the
00:26:00.966
frame, and the time to play that
00:26:03.333
frame of audio out. And that typically
00:26:05.766
ends up being four framing durations,
00:26:08.700
plus the network time.
00:26:10.333
So you have 20 milliseconds of frame
00:26:13.266
data being captured. And while that's being
00:26:15.766
captured, the previous frame is being compressed,
00:26:19.100
and transmitted. And, on the receiver side,
00:26:21.233
you have one frame being
00:26:23.200
decoded, errors being concealed, and timing being
00:26:27.933
reconstructed. And then another frame being played
00:26:30.500
out. So you've got 4 frames,
00:26:32.733
80 milliseconds, plus the network time.
00:26:35.033
It doesn't leave much time to do a retransmission.
00:26:38.666
So retransmissions tend not to be particularly
00:26:41.700
useful in video conferencing applications, unless they're
00:26:45.500
on quite short duration network paths,
00:26:48.033
because they arrive too late to be played-out.
00:26:51.866
So what these applications tend to do,
00:26:54.066
is use forward error correction.
00:26:57.733
And the basic idea of forward error
00:26:59.600
correction is that you send additional error
00:27:01.833
correcting packets, along with the original data.
00:27:05.700
So, in the example on the slide,
00:27:07.900
we're sending four packets of original speech
00:27:10.800
data, original media data. And for each
00:27:13.666
of those four packets, you then send
00:27:15.433
a fifth packet, which is the forward
00:27:16.900
error correction packet.
00:27:19.466
So the group of four packets gets
00:27:21.800
turned into five packets for transmission.
00:27:25.366
And, in this example, the third of those packets gets lost.
00:27:30.166
And at the receiver, you take the
00:27:33.266
four of those five packets which did arrive,
00:27:37.300
and you use the error correcting data
00:27:40.366
to recover that loss without retransmitting the packet.
00:27:45.433
And there are lots of different ways
00:27:47.233
in which these error correcting codes can work.
00:27:50.833
In the simplest case, the forward error
00:27:53.100
correction packet is just the result of
00:27:54.833
running the exclusive-or, the XOR operation,
00:27:57.833
on the previous packets. So the forward
00:28:00.466
error correction packets on the slides could
00:28:02.633
be, for example, the XOR of packets
00:28:04.800
1, 2, 3, and 4.
00:28:07.366
In this case, on the receiver,
00:28:09.800
when it notices that packet 3 has
00:28:11.433
been lost, if it calculates the XOR
00:28:13.800
of the received packets, so if you
00:28:15.833
XOR packets 1, 2, and 4,
00:28:17.833
and the FEC packet together, what will
00:28:20.300
come out will be the original packet, missing packet.
00:28:25.966
And that's obviously a simple approach.
00:28:28.466
There are a lot of much more
00:28:30.666
sophisticated forward error correction schemes, which trade
00:28:33.633
off different amounts of complexity for different overheads.
00:28:36.900
But the idea is that you send
00:28:38.566
occasional packets, which error correcting packets,
00:28:42.033
and that allows you to recover from
00:28:44.600
some types of loss without retransmitting the
00:28:47.700
packets, so you can recover losses more quickly.
00:28:55.500
And that's the summary of how we
00:28:58.266
transmit media over the Internet.
00:29:01.833
That data is captured, compressed, framed into
00:29:05.700
RTP packets, each which includes sequence number
00:29:09.766
and timing recovery information.
00:29:11.966
And then, when they arrive at the receiver,
00:29:13.866
it’s decompressed, it’s buffered, the timing
00:29:16.600
is reconstructed, and the buffering is chosen
00:29:18.833
to allow the receiver to reconstruct the timing,
00:29:21.900
and then the media is played-out to the user.
00:29:26.033
And that comprises the media transport parts.
00:29:29.566
As we saw, there's also signalling protocols
00:29:33.866
and NAT traversal protocols. What I'll talk
00:29:36.833
about in the next part is,
00:29:38.333
briefly, how the signalling protocols work to
00:29:40.500
set up multimedia conferencing calls.
Part 3: Interactive Applications (Control Plane)
This part moves on from discussion real-time data transfer to discuss
the control plane supporting interactive conferencing applications. It
discusses the WebRTC data channel, and the supporting signalling
protocols, including the SDP offer/answer exchange, SIP, and WebRTC
signalling via JSEP.
Slides for part 3
00:00:00.000
In the previous part of the lecture,
00:00:01.733
I introduced interactive conferencing applications. I spoke
00:00:04.900
a bit about the architecture of those
00:00:06.766
applications, about the latency requirements, and the
00:00:10.100
structure of those applications, and began to
00:00:12.666
introduce the standard set of conferencing protocols.
00:00:16.066
I spoke in detail about the Real-time
00:00:18.500
Transport Protocol, and the way media data is transferred.
00:00:22.766
In this part of the lecture,
00:00:24.100
I want to talk briefly about two other aspects of
00:00:27.800
interactive video conferencing applications,
00:00:30.300
the data channel, and the signalling protocols.
00:00:35.166
In addition to sending audio visual media,
00:00:39.933
most video conferencing applications also provide some
00:00:43.700
sort of peer-to-peer data channel.
00:00:47.200
This is part of the WebRTC standards,
00:00:50.733
and it's also part of most of the other systems as well.
00:00:57.133
The goal is to provide
00:00:59.566
for applications like peer-to-peer file transfer as
00:01:03.033
part of the video conferencing tool,
00:01:05.133
to support a chat session along with
00:01:08.300
the audio and video, and to support
00:01:10.166
features like reaction emoji, the ability to
00:01:13.466
raise your hand, request to the speaker
00:01:16.633
talks faster or slower, and so on.
00:01:20.900
The way this is implemented in WebRTC,
00:01:23.700
is using a protocol called SCTP running
00:01:27.533
inside a secure UDP tunnel.
00:01:30.566
I’m not going to talk much about SCTP.
00:01:33.566
SCTP is the Stream Control Transport Protocol,
00:01:37.233
and it was a previous attempt at replacing TCP.
00:01:41.700
The original version of SCTP ran directly
00:01:45.233
over IP, and was pitched as a
00:01:48.666
direct replacement for TCP, running as a
00:01:51.833
peer for TCP or UDP directly on the IP layer.
00:01:56.800
And it turned out this was too
00:01:58.333
difficult to deploy, so it didn't get
00:02:02.233
tremendous amounts of take-up. But, at the
00:02:04.933
point when the WebRTC standards were being
00:02:07.466
developed, it was
00:02:09.966
available, and specified, and it was deemed
00:02:12.966
relatively straightforward to move it to run
00:02:15.766
on top of UDP, to run on
00:02:17.733
top of Datagram TLS, to provide security,
00:02:20.800
as a deployable way of providing a
00:02:24.866
reliable peer-to-peer data channel.
00:02:29.166
And it would perhaps have been possible
00:02:31.100
to use TCP to do this,
00:02:34.066
but the belief at the time was
00:02:36.500
that NAT traversal for TCP wasn't very
00:02:40.533
reliable, and that something running over UDP
00:02:43.600
would work better for NAT traversal.
00:02:46.300
And I think that was the right decision.
00:02:50.033
And SCTP, the WebRTC data channel using
00:02:53.800
SCTP over DTLS over UDP,
00:02:57.800
provides a transparent data channel. It provides
00:03:01.600
the ability to deliver framed messages,
00:03:04.866
it supports delivering multiple sub-streams of data
00:03:07.900
over a single connection, and it supports
00:03:10.300
congestion control, retransmissions, reliability and so on.
00:03:16.366
And it makes it straightforward to build
00:03:18.066
peer-to-peer applications using WebRTC.
00:03:21.600
And gains all the deployments advantages that
00:03:24.000
we gained with QUIC, by running over UDP.
00:03:28.333
You might ask why WebRTC uses
00:03:34.866
SCTP to build its data channel, rather than using QUIC?
00:03:41.033
And, fundamentally, that's because WebRTC predates the
00:03:43.666
development of QUIC.
00:03:46.966
It seems likely, now that the QUIC
00:03:49.300
standard is finished, that future versions of
00:03:51.600
WebRTC will migrate, and switch to using
00:03:54.433
QUIC, and gradually phase out the SCTP-based data channel.
00:03:59.466
And QUIC learned, I think, from this
00:04:02.066
experience, and is more flexible and more
00:04:04.466
highly optimised than the SCTP, DTLS, UDP stack.
00:04:13.666
In addition to the media transport and
00:04:16.166
data, you need some form of signalling,
00:04:18.933
and some sort of session description,
00:04:20.766
to specify how to set up a video conferencing call.
00:04:29.300
Video conferencing calls run peer-to-peer. The goal
00:04:32.666
of a system like Zoom, or Skype,
00:04:35.900
or any of these systems, is to
00:04:37.700
set up peer-to-peer data, where possible,
00:04:40.700
so that they can achieve the lowest possible latency.
00:04:45.066
They need some sort of signalling protocol
00:04:47.300
to do that. They need some sort
00:04:49.033
of protocol to convey the details of
00:04:51.866
what transport connections are to be set
00:04:54.466
up, to exchange the set of candidate
00:04:56.700
IP addresses on which they can be
00:04:58.500
reached, to set up the peer-to-peer connection.
00:05:01.666
They need to specify the media formats
00:05:04.333
they want to use. Is it just
00:05:06.300
audio? Or is it audio and video?
00:05:08.366
And which compression algorithms are to be
00:05:10.366
used? And they want to specify the
00:05:12.166
timing of the session, and the security
00:05:15.200
parameters, and all the other parameters.
00:05:20.266
A standardised way of doing that is
00:05:23.733
using a protocol called the Session Description Protocol.
00:05:27.533
The example on the right at the
00:05:29.133
slide is an example of an SDP,
00:05:32.300
a Session Description Protocol, description of a
00:05:35.300
simple multimedia conference.
00:05:40.366
The format of SDP is unpleasant.
00:05:42.533
It’s essentially a set of key-value pairs,
00:05:46.000
where the keys are all single letters,
00:05:48.533
and the values are more complex,
00:05:51.633
one key-value pair per line, with the
00:05:54.800
key and the value separated by equals signs.
00:05:58.300
And, as we see in the example,
00:06:00.766
it starts with a version number,
00:06:02.666
v=0. There’s an originator line, and it
00:06:06.300
was originated by Jane Doe, who had
00:06:09.100
IP address 10.47.16.5.
00:06:12.866
It's a seminar about session description protocol.
00:06:15.933
It's got the email address of Jane
00:06:18.366
Doe, who set up the call,
00:06:20.566
it's got their IP address, the times
00:06:22.966
that session is active,
00:06:25.400
it's receive only, it’s broadcast so that
00:06:29.666
the listener just receives the data,
00:06:33.166
it’s sending using audio and video media,
00:06:36.366
and it specifies the ports and some
00:06:38.366
details of the video compression scheme, and so on.
00:06:43.133
The details of the format aren't particularly
00:06:44.933
important. It’s clear that it's sending
00:06:49.333
what the session is about, the IP
00:06:53.133
addresses, the times, the details of the
00:06:55.766
audio compression, the details of the video
00:06:57.800
compression, the port numbers to use,
00:06:59.500
and so on. And how this is
00:07:01.066
encoded isn't really important.
00:07:07.133
In order to set up an interactive
00:07:08.666
call, you need some sort of a negotiation.
00:07:11.166
You need some sort of offer to
00:07:12.500
communicate, which says this is the set
00:07:15.566
of video compression schemes, this is the
00:07:17.800
set of audio compression schemes, that the sender supports.
00:07:21.033
This is who is trying to call
00:07:23.133
you. This is the IP address that
00:07:24.866
they're calling you from. These are the
00:07:27.866
public key for trying to negotiate the
00:07:29.800
security parameters. And so on.
00:07:33.033
And that comprises an offer.
00:07:35.900
And the offer gets sent via a
00:07:38.833
signalling channel, via some out of band
00:07:42.166
signalling server, to the
00:07:45.500
responder, to the person you're trying to call.
00:07:49.633
The responder generates an answer, which looks
00:07:52.666
at that set of codecs the offer
00:07:55.600
specified, and picks the subset it also understands.
00:07:59.200
It provides the IP addresses it can
00:08:01.733
be reached at, it provides its public
00:08:04.933
keys, confirms its willingness to communicate,
00:08:07.333
and so on. And the answer flows
00:08:09.300
back to the original sender, the initiator of the call.
00:08:15.500
And this allows the offering party and
00:08:17.466
the answering party, the initiator and responder,
00:08:20.200
to exchange the details they need to establish the call.
00:08:24.066
The offer contains all the IP address
00:08:27.266
candidates that can be used with the
00:08:29.433
ICE algorithm to probe the NAT bindings.
00:08:31.933
The answer coming back contains the candidates
00:08:34.700
for the receiver, that allows them to
00:08:37.266
do the STUN exchange, the STUN packets,
00:08:40.000
to run the ICE algorithms that actually
00:08:41.600
sets up the peer-to-peer connection.
00:08:43.566
And it's also got the details of
00:08:45.266
the compression algorithms, the video codec,
00:08:47.466
the audio formats, the security parameters, and so on.
00:08:53.000
Unfortunately SDP, which we have ended up
00:08:56.000
using as the negotiation format, really wasn't
00:08:58.600
designed to do this. It was originally
00:09:01.633
designed as a one way announcement format
00:09:05.166
to describe video on demand sessions,
00:09:08.033
rather than as a format for negotiating
00:09:10.200
parameters. So the syntax is pretty unpleasant,
00:09:13.533
and the semantics are pretty unpleasant,
00:09:16.266
and it's somewhat complex to use in practice.
00:09:20.266
And this complexity wasn't really visible when
00:09:24.500
we started developing the these systems,
00:09:27.633
these tools, but unfortunately it turned out
00:09:31.100
that SDP wasn't a great format here,
00:09:33.166
but it's now too entrenched
00:09:35.633
for alternatives to take off. So we’re
00:09:37.766
left with this quite unpleasant, not particularly
00:09:39.966
well-designed format. But, we use it,
00:09:42.633
and we negotiate the parameters.
00:09:47.666
Exactly how this is used depends on
00:09:49.833
the system you're using, There’s two widely used models.
00:09:55.066
One is a system known as the Session Initiation Protocol.
00:09:59.300
And the Session Initiation Protocol, SIP,
00:10:02.866
is very widely used for telephony,
00:10:05.533
and it's widely used for stand-alone video
00:10:09.100
conferencing systems.
00:10:11.466
If you make a phone call using
00:10:13.166
a mobile phone, this is how the
00:10:16.933
phone locates the person you wish to
00:10:19.200
call, and sets up the call,
00:10:21.233
is using SIP, for example.
00:10:25.033
And SIP relies on a set of
00:10:26.900
conferencing servers, one representing the person making
00:10:30.766
the call, and one representing person being called.
00:10:34.733
And the two devices, typically mobile phones
00:10:37.500
or telephones these days, have a direct
00:10:39.933
connection to those servers, which they maintain
00:10:41.933
at all times.
00:10:44.233
On the sending side, when you try
00:10:47.366
to make a call, the message goes
00:10:49.466
out to the server, and that allows,
00:10:52.233
at that point there's a set of
00:10:54.166
STUN packets exchanged, and a set of
00:10:56.133
signalling messages exchanged, that allow the initiator
00:10:59.233
to find its public NAT bindings.
00:11:02.466
And then the message goes out to
00:11:04.100
the server, and that locates the server
00:11:07.000
for the person being called, and passes
00:11:09.200
the message back over the connection to
00:11:13.600
their server, and it eventually reaches the responder.
00:11:17.366
And that gives the responder the candidate
00:11:20.400
addresses, and all the connection details,
00:11:23.133
and the codec parameters, and so on,
00:11:24.900
needed for it to decide whether it
00:11:26.833
wishes to accept the call, and to
00:11:29.166
start setting up the NAT bindings.
00:11:32.300
And it responds, and the message goes
00:11:34.133
back through the multiple servers to the
00:11:36.233
initiator, and that completes the offer answer exchange.
00:11:39.533
At that point, they can start running
00:11:41.866
the ICE algorithm, discovering the NAT bindings.
00:11:44.700
And they've already agreed the parameters at
00:11:46.766
this point, which codecs they using,
00:11:49.000
what public keys that are using,
00:11:50.300
and so on. And that lets them
00:11:52.066
set up a peer-to-peer connection
00:11:54.666
using the ICE algorithm, and using STUN,
00:11:57.933
to set up a peer-to-peer connection over
00:11:59.866
which the media data can flow.
00:12:04.066
And it's an indirect connection set up.
00:12:06.566
The data flows from initiator, to their
00:12:08.766
server, to the responder’s server, to the
00:12:11.100
responder, and then back via the server path.
00:12:15.466
And that indirect signalling setup allows the
00:12:18.333
direct peer-to-peer connection to be created.
00:12:25.433
In more modern systems, systems using the
00:12:28.966
WebRTC browser-based approach,
00:12:33.566
the trapezoid that we have in the
00:12:36.833
SIP world, with the servers representing each
00:12:40.733
of the two parties, tends to get
00:12:42.033
collapsed into a single server representing the
00:12:44.533
conferencing service.
00:12:47.233
And the server, in this case,
00:12:48.866
is something such as the Zoom servers,
00:12:51.633
or the Webex servers, or the Microsoft Teams servers.
00:12:56.033
And, it's essentially following the same pattern.
00:12:58.900
It’s just that there's a now a
00:13:00.366
single conferencing server that initiates the call,
00:13:03.566
rather than being a cross-provider, with server
00:13:07.633
for each party.
00:13:10.466
And this is how web-based conferencing systems such as
00:13:14.766
Zoom, and Webex, and Teams, and the like, work.
00:13:19.866
You get your Javascript application, your web-based
00:13:24.000
application, sitting on top. This talks to
00:13:27.000
the WebRTC API in the browsers,
00:13:30.233
and that provides access to the session
00:13:32.666
descriptions which you can exchange with the
00:13:36.466
server over HTTP GET and POST requests
00:13:40.000
to figure out the details of
00:13:43.033
how the communication should be set up.
00:13:45.733
And, once you've done that, you can
00:13:47.333
fire off the data channel, and the
00:13:49.100
media transport, and establish the peer-to-peer connections.
00:13:54.000
So the initial signalling is exchanged via
00:13:56.133
HTTP to the web server, that controls
00:13:58.433
the call. The offer-answer exchange in SDP
00:14:02.566
is exchanged with the server, and that
00:14:04.733
exchanges it with the responder, and then,
00:14:06.833
when all the parties agree to communicate,
00:14:10.566
the server sends back the session description
00:14:14.500
containing the details which the browsers need
00:14:18.133
to set up the call. And they
00:14:19.533
then established a peer-to-peer connection.
00:14:22.366
And the goal is to integrate the
00:14:24.166
video conferencing features into the browsers,
00:14:27.033
and allows the server to control the call setup.
00:14:30.366
And, as we've seen over the course
00:14:33.933
of, I guess, the last year or
00:14:36.200
so, it actually works reasonably well.
00:14:39.666
These video conferencing applications work
00:14:42.000
reasonably well in practice.
00:14:47.600
So what's happening with interactive applications?
00:14:50.100
Where are things going?
00:14:53.166
I think there’s two ways these types
00:14:56.233
of applications are evolving.
00:14:59.100
One is supporting better quality, and supporting
00:15:04.166
new types of media. Obviously, over time,
00:15:08.166
the audio and the video quality,
00:15:09.966
and the frame rate, and the resolution,
00:15:11.533
has gradually been increasing, and I expect
00:15:13.733
that will continue for a while.
00:15:16.866
There's also people talking about running various
00:15:20.433
types of augmented reality, virtual reality,
00:15:23.933
holographic 3D conferencing, and
00:15:27.133
tactile conferencing where you transmit a sense
00:15:30.533
of touch over the network. And some
00:15:33.733
of these have perhaps stricter requirements on
00:15:36.600
latency, and stricter requirements on quality but,
00:15:39.933
as far as I can tell,
00:15:40.966
they all fit within the basic framework we've described.
00:15:44.266
They can all be transmitted
00:15:46.366
over UDP using either RTP,
00:15:50.900
or the data channel, or something
00:15:52.500
very like it. And they all fit
00:15:54.066
within the same basic framework, of add
00:15:57.033
a little bit of buffering to reconstruct
00:15:58.800
the timing, graceful degradation for the media transport.
00:16:05.633
Currently, we have a mix of RTP
00:16:09.133
for the audio and video data,
00:16:11.433
and the SCTP-based data channel.
00:16:15.000
It's pretty clear, I think, that the
00:16:16.700
data channel is going to transition to
00:16:18.366
using QUIC relatively soon.
00:16:21.766
And there's a fair amount of active
00:16:23.833
research, and standardisation, and discussion, about whether
00:16:26.566
it makes sense to also move the
00:16:28.633
audio and video data to run over QUIC.
00:16:32.400
And people are building unreliable datagram extensions
00:16:35.666
to QUIC to support this, so I
00:16:37.933
think it's reasonably likely that we’ll end
00:16:39.833
up running both the audio and the
00:16:41.633
video and the data channel over
00:16:43.633
peer-to-peer QUIC connections, although the details of
00:16:46.933
how that will work are still being discussed.
00:16:53.700
And that's what I would say about
00:16:55.066
interactive applications. In the next part I
00:16:58.066
will move on talk about video on
00:17:01.633
demand, and streaming applications.
Part 4: Streaming Video
The final part of the lecture discusses streaming video. It talks about
HTTP Adaptive Streaming and MPEG DASH, content delivery networks, and
some reasons why streaming media is delivered over HTTP. The operation
of HTTP adaptive streaming protocols is discussed, and their strengths
and limitations are highlighted.
Slides for part 4
00:00:00.466
In this last part of the lecture,
00:00:02.266
I want to talk about streaming video
00:00:04.033
and HTTP adaptive streaming.
00:00:08.100
So how do streaming video applications,
00:00:10.666
such as Netflix, the iPlayer, and YouTube, actually work?
00:00:15.100
Well, what you might expect them to
00:00:17.000
do, is use RTP, the same as
00:00:18.933
the video conferencing applications, to stream the
00:00:21.833
video over the network in a low-latency
00:00:24.700
and loss-tolerant way.
00:00:26.566
And, indeed, this is how streaming video,
00:00:28.833
streaming audio, applications used to work.
00:00:31.400
Back in the late 1990s, the most
00:00:33.966
popular application in this space was RealAudio,
00:00:36.800
and later RealPlayer when it incorporated video support.
00:00:40.433
This did exactly as you would expect.
00:00:43.066
It streamed the audio and the video
00:00:45.566
over RTP, and had a separate control
00:00:48.266
protocol, the Real Time Streaming Protocol,
00:00:50.800
to control the playback.
00:00:53.033
These days, though, most applications actually deliver
00:00:56.300
the video over HTTPS instead.
00:00:59.833
And as a result, they have significantly
00:01:01.933
worse performance. They have significantly higher latency,
00:01:05.466
and significantly higher startup latency.
00:01:09.366
The reason they do this, though,
00:01:11.400
is that by streaming video over HTTPS,
00:01:14.500
they can integrate better with
00:01:15.933
content distribution networks.
00:01:20.366
So what is a content distribution network?
00:01:23.433
A content distribution network is a service
00:01:26.066
that provides a global set of web
00:01:28.200
caches, and proxies, that you can use
00:01:30.600
to distribute your application, that you can
00:01:34.466
use to distribute the web data,
00:01:36.366
the web content, that comprises your application
00:01:39.133
or your website.
00:01:42.166
They're run by companies such as Akamai,
00:01:44.533
and CloudFlare, and Fastly. And these companies
00:01:47.600
run massive global sets of web proxies,
00:01:50.766
web caches. And they take over the
00:01:53.766
delivery of particular sets of content from
00:01:57.666
websites. As a website operator, you give
00:02:01.300
the files, the images, the videos,
00:02:03.500
that you wish to be hosted on
00:02:05.533
the CDN, to the CDN operator.
00:02:08.600
And they ensure that they’re cached throughout
00:02:10.900
the network, at locations close to where
00:02:12.866
your customers are.
00:02:14.733
And each of those files, or images,
00:02:16.700
or videos, is given a unique URL.
00:02:19.600
And the CDN manages the DNS resolution
00:02:23.033
for that URL, so that when you
00:02:25.200
look up the name, it returns you
00:02:27.233
an IP address that corresponds to a
00:02:29.100
proxy, or a cache, which is located
00:02:31.200
near physically near to you.
00:02:33.666
And that server has the data on
00:02:35.533
it such that the response comes quickly,
00:02:38.166
and such that the load is balanced
00:02:39.666
around these servers, around the world.
00:02:43.966
And these CDNs, these content distribution networks,
00:02:46.833
are extremely effective at delivering and caching
00:02:49.700
HTTP content.
00:02:51.933
They support some extremely high volume applications:
00:02:56.366
game delivery services such as Steam,
00:03:00.800
applications like the Apple software update,
00:03:04.000
or the Windows software update, and massively
00:03:07.566
popular websites.
00:03:09.900
And they have global deployments, and they
00:03:12.100
have agreement with the overwhelming majority of
00:03:14.466
ISPs to host these caches, these proxy
00:03:17.700
servers, at the edge of the network.
00:03:19.866
So, no matter where you are in
00:03:22.000
the network, you're very near to a
00:03:23.933
content distribution node.
00:03:28.633
A limitation of CDNs, though, is that
00:03:30.833
they only work with HTTP-based content.
00:03:33.933
They’re for delivering web content. And the
00:03:36.566
entire infrastructure is based around delivering web
00:03:39.700
content over HTTP, or more typically these days
00:03:43.033
HTTPS. They don't support RTP based streaming.
00:03:49.600
The way streaming video is delivered,
00:03:52.900
these days, is to make use of
00:03:55.166
content distribution networks. It's delivered using HTTPS
00:03:59.200
from a CDN node.
00:04:03.100
The contents of a video, in a
00:04:05.666
system such as Netflix, is encoded in
00:04:08.300
multiple chunks, where each chunk comprises,
00:04:12.466
typically, around 10 seconds worth of the video data.
00:04:17.233
Each of the chunks is designed to
00:04:19.366
be independently decodable, and each is made
00:04:22.333
available in many different versions, at many
00:04:24.366
different quality rates, many different bandwidths.
00:04:29.000
A manifest file provides an index for
00:04:31.666
what chunks are available. It's an index,
00:04:35.300
which says, for the first 10 seconds of the movie
00:04:38.433
there are these six different versions available,
00:04:40.933
and this is the size of each
00:04:42.533
one, and the quality level for each
00:04:44.266
one, and this is a URL where
00:04:46.000
it can be retrieved from.
00:04:47.833
And the same for the next 10 seconds,
00:04:49.933
and the next 10 seconds, and so on.
00:04:53.233
And the way the video streaming works,
00:04:55.700
is that the client fetches the manifest,
00:04:59.666
looks at the set of chunks,
00:05:04.266
and starts downloading the chunks in turn.
00:05:07.333
And it uses standard HTTPS downloads to
00:05:10.166
download each of the chunks. But,
00:05:12.100
as it's doing so, it monitors how
00:05:13.933
quickly it’s successfully downloading. And. based on
00:05:17.000
that, it chooses what encoding rate to fetch next.
00:05:21.033
So, it starts out by fetching a
00:05:23.100
relatively low rate chunk, and measures how
00:05:26.300
quickly it downloads. Maybe it's fetching a
00:05:29.000
chunk that's encoded at 500 kilobits per second,
00:05:31.966
and it measures how fast it actually
00:05:33.733
downloads. And it sees if it's actually
00:05:36.133
managing to download the 500 kilobits per
00:05:38.733
second video faster, or slower, than 500 kilobits.
00:05:42.966
If it's downloading slower than real-time,
00:05:47.033
it will pick a lower quality,
00:05:48.733
a smaller chunk, for the next time.
00:05:50.900
And, if it's downloading faster than real-time,
00:05:53.266
then it will try and pick a
00:05:54.933
higher quality, a higher rate, chunk the
00:05:57.366
next time. So it can adapt the
00:05:59.533
rate at which it downloads the video
00:06:01.233
by picking a different quality setting for
00:06:03.100
each of the chunks, each of the pieces of the video.
00:06:06.966
And as it downloads the chunks,
00:06:08.966
it plays each one out, in turn,
00:06:10.766
while it's downloading the next chunk.
00:06:15.600
And each of the chunks of video
00:06:17.666
is typically five or ten seconds,
00:06:19.933
or thereabouts, worth of video
00:06:21.766
content. And each one is compressed multiple
00:06:25.600
different times, and it's available at multiple
00:06:27.600
different rates, and it’s available at multiple
00:06:30.933
different sizes, for example.
00:06:33.733
And the chart on the graph,
00:06:35.300
on the right, gives an example of
00:06:37.466
how Netflix recommend videos are encoded,
00:06:40.600
starting at a rate of 235 kilobits
00:06:45.633
per second, for a 320x240 very low
00:06:49.833
resolution video, and moving up to 5800
00:06:53.300
kilobits per second, 5.8 megabits per second,
00:06:56.266
for a full HD quality video.
00:06:59.000
You can see the each 10 second
00:07:01.300
piece of content is available at 10
00:07:03.233
different quality levels, 10 different sizes.
00:07:07.000
And the receiver fetches the manifest to
00:07:09.666
start off with, which gives it the
00:07:11.200
index of all of the different chunks,
00:07:12.833
and all of the different sizes,
00:07:14.866
and which URL each one is available at.
00:07:17.700
And, as it fetches the chunk,
00:07:21.633
it tries to retrieve the URL for
00:07:26.133
that chunk, which involves a DNS request,
00:07:29.066
which involves the CDN redirecting it to
00:07:33.000
a local cache. And for that local
00:07:34.966
cache, as it downloads that chunk of
00:07:36.833
video, it measures the download rate.
00:07:39.766
If the download rate is slower than
00:07:42.466
the encoding rate, it switches to a
00:07:44.733
lower rate for the next chunk.
00:07:46.066
If the download rate is faster than the encoding rate,
00:07:49.200
it can consider switching up to a
00:07:50.666
higher quality, a higher rate, for the
00:07:52.233
next chunk. It chooses the encoding rate
00:07:55.866
to fetch based on the TCP download rate.
00:08:00.766
And we see what's happening is that
00:08:03.766
we've got two levels of adaptation going on.
00:08:07.500
On one level, we've got the dynamic
00:08:11.233
adaptive streaming, the DASH clients, fetching the
00:08:15.200
content over HTTP.
00:08:17.133
They’re fetching ten seconds worth of video
00:08:19.166
at a time, measuring the total time
00:08:21.033
it takes to download that ten seconds
00:08:22.800
worth of video. And they’re dividing the
00:08:25.333
time taken by the number of bytes
00:08:27.800
in each chunk, and that gives them
00:08:29.566
an average download rate for chuck.
00:08:34.000
They're also doing this, though, over a
00:08:37.033
TCP connection. And, as we saw in
00:08:39.966
some of the previous lectures, TCP adapts
00:08:41.966
its congestion window every round-trip time.
00:08:44.833
And it's following a Reno or a
00:08:47.266
Cubic algorithm, and it's following the AIMD
00:08:50.533
approach. And, as you see at the
00:08:52.600
top of the slide, the sending rate’s
00:08:54.633
bouncing around following the sawtooth pattern,
00:08:57.266
and following the slow start and the
00:08:58.866
congestion avoidance phases of TCP.
00:09:01.400
So we've got quite a lot of
00:09:03.400
variation, on very short time scales,
00:09:06.233
as TCP does its thing. And then
00:09:08.800
that averages out, to give an overall
00:09:10.533
download rate for the chunk.
00:09:14.566
And, depending on the overall download rate
00:09:16.933
that TCP manages to get, averaged over
00:09:20.366
the ten seconds worth of video for
00:09:22.833
chunk, that selects the size of the next
00:09:25.666
chunk to download. The idea is that
00:09:28.033
each chunk can be downloaded, at least
00:09:31.900
at real-time speed, and ideally a bit
00:09:34.166
faster than real-time, so the download gets
00:09:37.300
ahead of itself.
00:09:41.366
And, when you start watching a movie
00:09:45.233
on Netflix, or watching a program on
00:09:47.033
the iPlayer, for example, you often see
00:09:48.900
it starts out relatively poor quality,
00:09:51.833
for the first few seconds, and then
00:09:53.466
the quality jumps up after 10 or 20 seconds or so.
00:09:57.666
And what's happening here, is that the
00:09:59.400
receiver’s picking a conservative download rate for
00:10:01.800
the initial chunk, it’s picking one of
00:10:05.200
the relatively low quality, relatively small,
00:10:09.966
chunks, and downloading that first, and measuring
00:10:13.000
how long it takes. And, typically,
00:10:15.366
that's a conservative choice, and it realises
00:10:17.700
it can actually download,
00:10:19.033
it realises that the chunks are actually
00:10:21.666
downloading faster, so it switches up the
00:10:23.566
quality level fairly quickly. And, after the
00:10:25.766
first 10, 20, seconds, after a couple
00:10:28.300
of chunks have gone, the quality level has picked up.
00:10:35.500
A consequence of all of this,
00:10:37.633
is that it takes quite a long
00:10:41.033
time for streaming video to get started.
00:10:44.900
It’s quite common that when you start
00:10:47.266
playing a movie on Netflix, or a
00:10:50.133
program on the iPlayer, that it takes
00:10:51.633
a few seconds before it gets going.
00:10:54.800
And the reason for this, is some
00:10:57.600
combination of the chunk duration, and the
00:11:00.966
playout buffering, and the encoding delays if
00:11:03.566
the video’s being encoded live.
00:11:06.800
Fetching chunks, which are typically 10 seconds
00:11:10.300
long, you need to have one chunk
00:11:13.933
being played out at any one time.
00:11:17.000
You need to have 10 seconds worth
00:11:18.766
of video buffered up in the receiver,
00:11:20.566
so you can be playing that chunk
00:11:22.033
out while you're fetching the next one.
00:11:24.900
So you've got one chunk being played
00:11:26.633
out, and one being fetched, so you’ve
00:11:28.333
immediately got two chunks worth of buffering.
00:11:30.300
So that's 20 seconds worth of buffering.
00:11:32.600
Plus the time it takes to fetch
00:11:34.466
over the network, plus, if it's being
00:11:37.133
encoded live, the time it takes to
00:11:38.766
encode the chunk which will be at
00:11:40.300
least a, it needs to
00:11:42.166
pull in the entire chunk before it
00:11:43.833
can encode it, so you've got at
00:11:45.166
least another chunk, so that'd be another 10 seconds.
00:11:49.400
So you get a significant amount of
00:11:51.300
latency because of the ten second chunk
00:11:54.400
duration. You also need enough chunks of
00:11:58.333
video buffered up, such that
00:12:02.166
if the TCP download rate changes,
00:12:06.666
and it turns out that the available
00:12:08.466
capacity changes, so a chunk downloads much
00:12:10.566
slower than you would expect, that you
00:12:12.333
don't want to run out of video to play.
00:12:14.900
You want enough video buffered up,
00:12:17.666
that if something takes a long time,
00:12:20.033
you have time to drop down to
00:12:22.000
a lower rate for the next chunk,
00:12:24.000
and keep the video coming, even at
00:12:26.533
a reduced level, without it stalling.
00:12:28.100
Without you running out of video to play out.
00:12:32.033
So you’ve got to download a complete
00:12:33.733
chunk before you start playing out.
00:12:36.100
So you download and decompress a particular
00:12:38.300
chunk, and while you're doing that you're
00:12:40.700
playing the previous chunk, and everything stacks
00:12:44.633
up, the latency stacks up.
00:12:50.600
In addition to the fact that you're
00:12:52.400
just buffering up the different chunks of
00:12:55.000
video, and you need to have a
00:12:57.100
complete chunk being played while the next
00:12:59.600
one is downloading, you get the sources
00:13:02.000
of latency because of the network,
00:13:03.800
because of the way the data is transmitted over the network.
00:13:07.733
As we saw when we spoke about
00:13:09.766
TCP, the usual way TCP retransmits lost
00:13:13.900
packets, is following a triple duplicate ACK.
00:13:18.766
What we see on the slide here,
00:13:21.800
is that the data, on the sending
00:13:24.800
side, we have the user space,
00:13:26.700
where the blocks of data, the chunks
00:13:29.066
of video, are being written into a TCP connection.
00:13:32.766
And these get buffered up in the
00:13:34.633
kernel, in the operating system kernel on
00:13:36.866
the sender side, and transmitted over the network.
00:13:40.200
At some point later they arrive in
00:13:42.066
the operating system kernel on the receiver
00:13:44.400
side, and that generates the acknowledgments as
00:13:47.533
those chunks, as the TCP packets,
00:13:50.333
the chunks of video, are received.
00:13:53.433
And, if a packet gets lost,
00:13:55.533
it starts generating duplicate acknowledgments. And,
00:13:58.266
eventually, after the triple duplicate acknowledgement,
00:14:00.866
the packet will be transmitted.
00:14:04.600
And we see that this takes time.
00:14:07.700
And if this is video, and the
00:14:09.666
packets are being sent at a constant
00:14:11.333
rate, we see that it takes time
00:14:13.833
to send four packets, the lost packet
00:14:16.633
plus the three following that generate the duplicate ACKs,
00:14:20.033
before the sender notices that a packet
00:14:25.966
loss has happened. Plus, it takes one
00:14:29.000
round trip time for the acknowledgements to
00:14:31.300
get back to the sender, and for
00:14:32.900
it to retransmit the packet.
00:14:35.333
So the time before, if a packet
00:14:38.333
has been lost, it takes four times
00:14:40.033
the packet transmission time, plus one round-trip
00:14:42.733
time, before the packet gets
00:14:44.433
retransmitted, and arrives back at the receiver.
00:14:47.766
And that adds some latency. It’s got
00:14:50.833
to add at least four packets,
00:14:53.733
plus one round-trip time, extra latency to
00:14:56.233
cope with a single retransmission.
00:14:59.133
And, if the network's unreliable, such that
00:15:01.366
more than one packet is likely to
00:15:02.933
be lost, you need to add in
00:15:04.300
more buffering time, add in additional latency,
00:15:07.000
to allow the packets to arrive,
00:15:09.133
such that they can be given to
00:15:10.933
the receiver without disrupting the timing.
00:15:14.133
So you need to add some latency
00:15:16.433
to compensate for the retransmissions that TCP
00:15:18.933
might be causing, so that you can
00:15:21.633
keep receiving data smoothly while accounting for
00:15:24.866
the retransmission times.
00:15:31.166
In addition, there’s some latency due to
00:15:33.800
the size of the chunks of video.
00:15:36.633
Each chunk has to be independently decodable,
00:15:39.466
because you're changing the
00:15:43.500
compression, potentially changing the compression level,
00:15:45.866
at each chunk. So each one can't
00:15:48.466
be based on the previous one.
00:15:49.633
They all have to start from scratch
00:15:52.700
at the beginning of each chunk,
00:15:54.333
because you don't know what version came before.
00:15:58.766
And, if you look at how video
00:16:00.666
compression works, it's all based on predicting.
00:16:03.033
You send initial frames, what are called
00:16:04.966
I-frames, index frames,
00:16:06.800
which give you a complete frame of
00:16:08.700
video. And then they predict, based on
00:16:11.333
that, the next few frames based on
00:16:13.600
that. So, at the start of a
00:16:15.733
scene, you’ll send an index frame,
00:16:18.800
and then, for the rest of the
00:16:20.133
scene, each of the successive frames will
00:16:22.633
just include the difference from the previous
00:16:25.000
frame, from the previous index frame.
00:16:30.266
And how often you send index frames
00:16:35.833
affects that the encoding rates, because the
00:16:37.900
index frames are big.
00:16:39.833
They're sending a complete frame of video,
00:16:41.900
whereas the predicted frames, in between,
00:16:43.933
are much smaller. The index frames are
00:16:46.033
often, maybe, 20 times the size of
00:16:47.900
the predicted frames.
00:16:49.933
And depending how you encode the chunks,
00:16:52.733
if the smaller chunks are, because each
00:16:55.166
chunk of video has to start with
00:16:57.266
an index frame, it has to start
00:16:58.866
with a complete frame, the shorter each
00:17:01.200
chunk is, the fewer P-frames that can
00:17:04.166
be sent before the start of the
00:17:05.900
next chunk and the next index free.
00:17:09.233
So you have this trade-off. You can
00:17:12.300
make the chunks of video small,
00:17:14.866
and that reduces the latency in the
00:17:17.200
system, but it means you have more
00:17:19.200
frequent index frames. And the more frequent index frames
00:17:24.300
need more data, because the index frames
00:17:27.800
are large compared to the predicted frames,
00:17:30.000
so the encoding efficiency goes down,
00:17:32.633
and the overheads go up.
00:17:35.633
And this tends to enforce a lower
00:17:38.533
bound of around two seconds before the
00:17:41.066
overhead of sending the frequent index frames
00:17:44.066
gets to be too much, it gets to be excessive.
00:17:46.733
So chunk sizes tend to be more
00:17:49.866
than that, tend to be 5,
00:17:51.000
10, seconds just keep the overheads down,
00:17:53.166
to keep the compression efficiency, the video
00:17:55.266
compression efficiency, reasonable.
00:17:57.966
And that's the main source of latency in these applications.
00:18:06.800
So, this clearly works. Applications like Netflix,
00:18:12.266
like the iPlayer, clearly work.
00:18:15.366
But they have relatively high latency.
00:18:18.100
Because you're fetching video chunk-by-chunk, and each
00:18:22.366
chunk is five or ten seconds worth of video,
00:18:25.233
you have five or ten seconds wait
00:18:30.033
when you start the video playing,
00:18:31.966
before it actually starts playing. And it's
00:18:36.000
difficult to reduce that latency, because of
00:18:39.500
the compression efficiency, because of the overheads.
00:18:44.266
And it would be desirable, though, to reduce that latency.
00:18:50.433
It will be desirable for people who
00:18:53.466
watch sport, because the latency for the
00:18:56.766
streaming applications is higher than it is
00:18:59.133
for broadcast TV,
00:19:00.666
so, if you're watching live sports,
00:19:02.300
you tend to see the action 5,
00:19:05.666
or 10, or 20, seconds behind broadcast
00:19:08.466
TV, and that can be problematic.
00:19:14.400
It’s also a problem for people trying
00:19:17.466
to offer interactive applications, and augmented reality,
00:19:20.300
where they'd like the latency to be
00:19:22.800
low enough that you can interact with
00:19:25.066
the content, and maybe dynamically change the
00:19:27.633
view point, or interact with parts of the video.
00:19:31.766
So people are looking to build lower-latency
00:19:34.933
streaming video.
00:19:37.766
I think there's two ways in which this is likely to happen.
00:19:43.366
The first is that we might go back to using RTP.
00:19:47.700
We might go back to using something
00:19:51.566
like WebRTC to control the setup,
00:19:54.466
and build streaming video using essentially the
00:19:57.200
same platform we use for interactive video conferencing,
00:20:00.800
but sending in one direction only.
00:20:04.800
And this is possible today.
00:20:07.733
The browsers support
00:20:09.700
WebRTC, and there's nothing that says you
00:20:13.533
have to transmit as well as receiving
00:20:15.566
in a WebRTC session. So you could
00:20:17.766
build an application uses WebRTC to stream
00:20:20.033
video to the browser.
00:20:22.666
It would have much lower latency than
00:20:25.133
the DASH-based, dynamic adaptive streaming over HTTP
00:20:28.900
based, approach that people use today.
00:20:31.833
But it's not clear that it would
00:20:33.500
play well with the content distribution networks.
00:20:35.733
It’s not clear that the CDNs would support RTP streaming.
00:20:39.600
But if they did, if the CDNs
00:20:42.200
could be persuaded to support RTP,
00:20:44.266
this would be a good way of getting lower latency.
00:20:48.700
I think what's perhaps more likely,
00:20:50.700
though, is that we will start to
00:20:53.266
see the CDNs switching to support QUIC,
00:20:55.966
because it gives better performance
00:20:58.600
for web traffic in general,
00:21:01.000
and then people start to switch to
00:21:03.633
delivering the streaming video over QUIC.
00:21:07.400
And, because QUIC is a user space
00:21:10.333
stack, it's easier to deploy interesting transport
00:21:15.900
protocol innovations. Because they're done by just
00:21:18.000
deploying a new application, you don't have
00:21:19.800
to change the operating system kernel,
00:21:21.566
you don't have to change,
00:21:23.066
if you want to change how TCP
00:21:24.666
works, you have to change the operating system.
00:21:26.600
Whereas if you want to change the way QUIC works,
00:21:28.733
you just have to change the application or the library
00:21:30.933
that's providing QUIC.
00:21:32.500
So I think it's likely that we
00:21:33.833
will see CDNs switching to use HTTP/3,
00:21:38.200
and HTTP over QUIC,
00:21:40.133
and I think it's likely that they'll
00:21:41.900
also switch to delivering video over QUIC.
00:21:43.866
And I think that gives much more
00:21:45.466
flexibility to change the way QUIC works,
00:21:47.733
to optimise it to support low-latency video.
00:21:51.600
And we’re already, I think, starting to
00:21:54.033
see that happening. YouTube is already delivering
00:21:57.500
video over QUIC.
00:21:59.000
There are people talking about datagram extensions
00:22:01.966
to QUIC in the IETF to get
00:22:04.033
low latency, so I think we’re likely
00:22:07.133
to see the video switching to be
00:22:08.866
delivered by the CDNs using QUIC,
00:22:11.333
but with some QUIC extension to provide lower latency.
00:22:18.600
So that's all I want to say
00:22:20.466
about real-time and interactive applications.
00:22:25.366
The real-time applications have latency bounds.
00:22:29.266
They may be strict latency bounds,
00:22:31.833
150 milliseconds for an interactive application or
00:22:36.566
a video conference, or they may be
00:22:38.733
quite relaxed latency bounds, 10s of seconds
00:22:41.766
for streaming video currently.
00:22:44.733
The interactive applications run over WebRTP,
00:22:48.300
which is the Real-time Transport Protocol,
00:22:51.266
RTP, for the media transport, with a
00:22:54.566
web-based signalling protocol put on top of
00:22:56.700
it. Of they use older standards,
00:22:59.633
such as SIP,
00:23:01.433
the way mobile phones or
00:23:03.633
the telephone network works, these days,
00:23:06.200
to set up the RTP flows.
00:23:09.533
Streaming applications, because they want to fit
00:23:12.266
with the content distribution network infrastructure,
00:23:15.866
because the amount of video traffic is
00:23:18.333
so great that they need the
00:23:20.933
scaling advantages that comes with distribution networks,
00:23:23.900
use an approach known as DASH,
00:23:26.100
Dynamic Adaptive Streaming over HTTP,
00:23:29.066
and deliver the video over HTTP as
00:23:31.466
a series of chunks, with a manifest,
00:23:33.600
and they let the browser choose which
00:23:36.133
chunk sizes to fetch, and use that
00:23:39.033
as a coarse-grained method of adaptation.
00:23:41.866
And this is very scalable, and it
00:23:45.100
makes very good use of the CDN
00:23:47.666
infrastructure to scale out, but it's relatively
00:23:50.933
high latency,
00:23:52.966
and relatively high overhead. And I think
00:23:56.666
the interesting challenge, in the future,
00:23:58.366
is to be combining these two approaches,
00:24:00.566
to try and get the scaling benefits
00:24:02.566
of content distribution networks,
00:24:04.500
and the low-latency benefits of protocols like
00:24:07.233
RTP, and to try and bring this
00:24:09.333
into the video streaming world.
Discussion
Lecture 7 discussed real-time and interactive applications. It
reviewed the definition of real-time traffic, and the differing
deadlines and latency requirements for streaming and interactive
applications, the differences in elasticity of traffic demand in
real-time and non-real-time applications, quality of service, and
quality of experience.
Considering interactive conferencing applications, the lecture reviewed
the structure of such applications and briefly described the standard
Internet multimedia conferencing protocol stack. It outlined the
features RTP provides for secure delivery of real-time media, and
highlighted the importance of timing recovery, application level framing,
loss concealment, and forward error correction. It briefly mentioned the
WebRTC peer-to-peer data channel. And it discussed the need for
signalling protocols to setup interactive calls, and briefly outlined how
SIP and WebRTC use SDP to negotiate calls.
Considering streaming applications, the lecture highlighted the role of
content distribution networks to explain why media is delivered over
HTTP. It explained chunked media delivery and the Dynamic Adaptive
Streaming over HTTP (DASH) standard for streaming video, showing how this
adapts the sending rate and how it relates to TCP congestion control. The
lecture also mentioned some sources of latency for DASH-style systems.
Discussion will focus on the essential differences between real-time and
non-real-time applications, timing recovery, and media transport in both
interactive and streaming applications.