csperkins.org

Networked Systems H (2022-2023)

Lecture 7: Real-time and Interactive Applications

Lecture 7 discusses real-time and interactive applications. It talks about the requirements and constraints for running real-time traffic on the Internet, and discusses how interactive video conferencing and streaming video applications are implemented.

Part 1: Real-time Media Over The Internet

This first part of the lecture discusses real-time media running over the Internet. It outlines what is real-time traffic, and what are the requirements and constraints when running real-time applications over the Internet. It discuses the implication of non-elastic traffic, the effects of packet loss, and the differences between quality of service and quality of experience.

 

00:00:00.300 In this lecture I want to move

00:00:01.733 on from talking about congestion control,

00:00:03.766 and talk instead about real-time and interactive

00:00:06.400 applications.

 

00:00:08.900 In this first part, I’ll start by

00:00:11.066 talking about real-time applications in the Internet.

 

00:00:13.300 I’ll talk about what is real-time traffic,

00:00:15.733 some of the requirements and constraints of

00:00:17.700 that traffic, and how we go about

00:00:20.366 ensuring a good quality of experience,

00:00:22.300 a good quality of service, for these applications.

 

00:00:25.766 In the later parts, I'll talk about

00:00:27.466 interactive applications, I’ll talk about the conferencing

00:00:30.633 architecture, how we go about building a

00:00:33.533 signalling system to

00:00:35.833 locate the person you wish to have

00:00:37.900 a call with, how we describe conferencing

00:00:40.233 sessions, and how we go about transmitting

00:00:42.600 real-time multimedia traffic over the network.

 

00:00:44.933 And then, in the final part,

00:00:46.366 I'll move on and talk about streaming

00:00:48.266 applications, and talk about the HTTP adaptive

00:00:51.100 streaming protocols that are used for video

00:00:53.966 on demand applications, such as the iPlayer or Netflix.

 

00:00:59.700 To start with, though, I want to

00:01:01.533 talk about real-time media over the Internet.

00:01:03.366 I’ll say a little bit about what

00:01:05.166 is real-time traffic,

00:01:06.300 what are the requirements and constraints in

00:01:08.566 order to successfully run real-time traffic,

00:01:10.733 real-time applications over the Internet, and some

00:01:13.800 of the issues around quality of service

00:01:15.633 and user experience, and how to make

00:01:17.233 sure we get a good experience for

00:01:20.066 users of these applications.

 

00:01:25.433 So, there's actually a long history of

00:01:28.466 running real-time traffic over the Internet.

 

00:01:32.000 And this includes applications like telephony and

00:01:35.366 voice over IP. It includes Internet radio

00:01:39.100 and streaming audio applications. It includes video

00:01:42.466 conferencing applications such as Zoom.

 

00:01:46.200 It includes streaming TV, streaming video applications,

00:01:50.500 such as the iPlayer and Netflix.

00:01:53.233 But it also includes gaming,

00:01:55.466 and sensor network applications,

00:01:57.800 and various industrial control systems.

 

00:02:01.166 And these experiments go back a surprisingly long way.

 

00:02:05.000 The earliest RFC on the subject of

00:02:07.866 real-time media on the Internet is RFC741,

00:02:11.566 which dates back to the early 1970s

00:02:14.333 and describe the network voice protocol.

 

00:02:16.933 And this was an attempt at running

00:02:19.933 packet voice over the ARPANET, the precursor

00:02:22.700 to the Internet.

 

00:02:24.466 And there’s been a continual thread of

00:02:26.500 standards developments and experimentation and research in

00:02:29.133 this area.

 

00:02:31.366 The current set of standards, which we

00:02:33.933 use for telephony applications, for video conferencing

00:02:37.866 applications, dates back to the mid 1990s.

 

00:02:42.333 It led to a set of protocols,

00:02:44.100 such as SIP, the Session Initiation Protocol,

00:02:47.533 the Session Description Protocol, the Real-time Transport

00:02:50.866 Protocol, and so on.

 

00:02:52.900 And then there was another burst of

00:02:55.633 developments, in perhaps the mid-2000s or so,

00:02:58.400 with HTTP adaptive streaming, and that led

00:03:01.133 to standards such as the MPEG DASH

00:03:02.766 standards, an applications is like Netflix and the iPlayer.

 

00:03:07.733 I think what's important, though, is to

00:03:10.466 realise that this is not new for

00:03:12.333 the network. We've seen everyone in the

 

00:03:17.300 world switch to using video conferencing,

00:03:19.000 and everyone in the world

00:03:20.766 switch to using Webex, and Teams,

00:03:23.100 and Zoom, and the like. But these

00:03:25.700 applications actually existed for many years,

00:03:28.333 and these applications have developed, and the

00:03:31.266 network has developed along with these applications,

00:03:34.333 and there's a long history of support

00:03:37.000 for real-time media in the Internet.

 

00:03:40.133 And you, occasionally, hear people saying that

00:03:42.233 the Internet was not designed for real-time

00:03:44.600 media, and we need to re-architect the

00:03:46.933 Internet to support real-time applications,

00:03:49.500 and to support future multimedia applications.

 

00:03:53.433 I think that's being somewhat disingenuous with history.

 

00:03:57.233 The Internet has developed and grown-up with

00:03:59.800 multimedia applications, right from the beginning.

 

00:04:02.633 And while they've perhaps not been as

00:04:05.133 popular, as some of the non real-time

00:04:08.100 applications, they've been a continual strand of

00:04:10.100 development, and people have been using these

00:04:12.000 applications and architecting the network to support

00:04:14.533 this type of traffic, for many, many years now.

 

00:04:21.533 So what is real-time traffic? What do

00:04:24.200 we mean by real-time traffic, real-time applications?

 

00:04:27.200 Well, the defining characteristic is that the

00:04:29.600 traffic has deadlines. The system fails if

00:04:32.233 the data is not delivered by a certain time.

 

00:04:35.933 And, depending on the type of application,

00:04:38.200 depending on the type of real-time traffic,

00:04:40.300 those can be what's known as hard

00:04:41.766 deadlines or soft deadlines.

 

00:04:44.533 Now, an example of a hard deadline

00:04:46.666 might be a control system, such as

00:04:49.166 a railway signalling system, where the data

00:04:52.433 that's controlling the signals has to arrive

00:04:55.433 at the signal before the train does,

00:04:57.733 in order to change the signal appropriately.

 

00:05:01.333 Real-time multimedia applications, on the other hand,

00:05:05.600 are very much in the in the realm

00:05:07.033 of soft real-time applications,

00:05:08.900 where you have to deliver the data

00:05:10.666 by a certain deadline in order to

00:05:12.233 get smooth playback of the media.

00:05:14.366 In order to get a glitch-free playback

00:05:17.733 of the audio, in order to get smooth video playback.

 

00:05:23.066 And these applications tend to have to

00:05:25.966 deliver data, perhaps every 50th of a

00:05:28.100 second for audio, maybe every 30 times

00:05:31.700 a second, 60 times a second, to get smooth video.

 

00:05:36.733 And it's important to realise that no

00:05:38.966 system is ever 100% reliable at meeting its deadlines.

 

00:05:43.300 It's impossible to engineer system that never

00:05:46.066 misses a deadline. So always think about

00:05:49.033 how can we arrange these systems,

00:05:51.266 such that some appropriate portion of the deadline are met.

 

00:05:56.133 And what that proportion is, depends on

00:05:58.333 what system we're building.

 

00:06:01.166 If it's a railway signalling system,

00:06:03.166 we want the probability that the network

00:06:05.766 fails to deliver the message to be

00:06:08.166 low enough that it's more likely that

00:06:10.466 the train will fail, or the actual

00:06:12.466 physical signal will fail, then the probability

00:06:15.133 of the network failing to deliver the message in time.

 

00:06:19.600 If it's a video conferencing application,

00:06:21.733 or video streaming application, the risks are

00:06:25.200 obviously a lot lower, and so you

00:06:27.133 can accept a higher probability of failure.

 

00:06:29.633 Although again, it depends on what the

00:06:31.533 application’s being used for. A video conferencing

00:06:35.700 system being used

00:06:37.500 for a group of friends, just chatting,

00:06:40.833 obviously has different reliability constraints, different

00:06:44.833 degrees of strictness of its deadlines, than one

00:06:48.166 being used for remote control of a

00:06:50.566 drone, or one being used for remote surgery, for example.

 

00:06:57.033 And the different systems can have different

00:06:59.900 types of deadline.

 

00:07:01.866 It may be that various types of

00:07:04.000 data have to be delivered before a certain time.

 

00:07:07.233 You have to deliver the control information

00:07:11.033 to the railway signal before the train

00:07:12.766 gets there. So you've got an absolute deadline on the data.

 

00:07:17.566 Or it maybe that the data has

00:07:19.833 to be delivered periodically, relative to the

00:07:22.633 previous deadline. The video frames have to

00:07:25.300 be delivered every 30th of a second,

00:07:27.933 or every 60th of a second.

 

00:07:30.300 And different applications have different constraints.

00:07:33.000 Different bounds on the latency, on the

00:07:36.133 absolute deadline. But also on the relative

00:07:38.066 deadline, on the predictability of the timing.

 

00:07:42.466 It’s important to remember that we're not

00:07:44.933 necessarily talking high performance for these applications.

 

00:07:49.033 If we're building a phone system that

00:07:51.766 runs over the Internet, for example,

00:07:53.633 the amount of data we're sending is

00:07:55.566 probably only a few kilobits per second.

 

00:07:58.300 But it requires predictable timing.

 

00:08:01.133 This packets containing the speech data have

00:08:04.033 to be delivered with

00:08:06.466 at least approximately predictable, approximately equal,

00:08:10.066 spacing, in order that we can correct

00:08:12.800 the timing and play out the speech smoothly.

 

00:08:17.333 And yes, some types of applications are

00:08:19.633 quite high bandwidth. If we're trying to deliver

00:08:23.200 studio quality movies, or if we're trying

00:08:25.666 to deliver holographic conferencing, then we need

00:08:28.066 tens, or possibly hundreds, of megabits.

 

00:08:30.500 But they're not necessarily high performance.

00:08:32.933 The key thing is predictability.

 

00:08:38.166 So what are the requirements for these applications?

 

00:08:42.100 Well, in the large extent, it depends

00:08:44.400 on whether you're building a streaming application

00:08:46.400 or an interactive application.

 

00:08:49.800 For video-on-demand applications, like Netflix or YouTube

00:08:53.766 or the iPlayer, for example, there's not

00:08:56.400 really any absolute deadline, in most cases.

 

00:08:59.866 If you're watching a movie, it's okay

00:09:02.400 if it takes 5, 10, 20 seconds

00:09:04.966 to start playing, after you click the play button,

00:09:08.233 provided the playback is smooth once it has started.

 

00:09:13.033 And maybe if it's a short-thing,

00:09:14.433 maybe it's a YouTube video that's only

00:09:16.200 a couple of minutes, then you want it to start quicker.

 

00:09:19.100 But again, it doesn't have to start

00:09:21.600 within milliseconds of you pressing the play

00:09:23.500 button. A second or two of latency

00:09:25.933 is acceptable, provided the playback is smooth

00:09:29.500 once it starts.

 

00:09:31.866 Now, obviously live applications, the deadlines may

00:09:34.633 be different. Clearly if you're watching

00:09:37.900 a live sporting event on YouTube or

00:09:42.200 or the iPlayer, for example, you don't

00:09:44.200 want it to be too far behind

00:09:45.566 the same event being watched on

00:09:47.133 broadcast TV. But, for these applications,

00:09:50.333 typically it's the relative deadlines, and smooth

00:09:53.266 playback once the application has started,

00:09:55.500 rather than the absolute deadline that matters.

 

00:09:59.666 The amount of bits per second it

00:10:02.233 needs depends to a large extent on the quality.

 

00:10:06.166 And, obviously, higher quality is better,

00:10:07.866 a higher bit rate is better.

 

00:10:09.966 But, to some extent, there's a limit

00:10:11.766 on this. And it's a limit depending

00:10:14.033 on the camera, on the resolution of

00:10:16.033 the camera, and the frame rate of

00:10:17.700 the camera, and the size of the display, and so on.

 

00:10:21.833 And you don't necessarily need many tens

00:10:25.433 or hundreds of megabits. You can get

00:10:28.266 very good quality video on single digit

00:10:31.333 numbers of megabits per second. And even

00:10:35.000 production quality, studio quality, is only hundreds

00:10:38.933 of megabits per second. So there’s an

00:10:40.866 upper bound on that the rate at

00:10:43.100 which these applications

00:10:45.500 can typically send, when you hit the

00:10:47.900 limits of the capture device, you hit

00:10:49.833 the limits of the display device.

 

00:10:52.500 And, quite often, for a lot of these applications,

00:10:54.800 predictability matters more than absolute quality.

 

00:10:59.100 It's often the less annoying to have

00:11:01.500 a movie, which is a consistent quality,

00:11:04.333 than a movie which is occasionally very

00:11:06.566 good quality, but keeps dropping down to

00:11:09.266 a lower resolution. So predictability is often

00:11:11.900 what's critical.

 

00:11:14.866 And, for a given bit rate,

00:11:16.566 you're also trading off between frame rate

00:11:18.366 and quality. Do you want smooth motion,

00:11:21.333 or do you want very fine detail?

 

00:11:24.633 And, if you want both smooth motion

00:11:28.533 and fine detail, you have to increase

00:11:30.633 the rate. But you can trade-off between

00:11:32.266 them, at a given bit rate, a different quality level.

 

00:11:38.166 For interactive applications, the requirements are a

00:11:40.533 bit different. They depend very much on

00:11:42.566 human perception, and the requirements to be

00:11:45.266 able to have a smooth conversation.

 

00:11:48.933 For phone calls, for video conferencing applications,

00:11:53.733 people have been doing studies of this

00:11:56.333 sort of thing for quite a while.

 

00:11:58.266 The typical bounds you hear expressed are

00:12:01.266 one-way mouth-to-ear delay, so the delay from

00:12:05.833 me talking, to it going

00:12:08.866 through the air to the microphone,

00:12:10.933 being captured, compressed, transmitted over the network,

00:12:13.933 decompressed, played-out, back from the speakers to

00:12:16.900 your ear, should be no more than about 150

00:12:20.066 milliseconds. And, if it gets more than

00:12:22.533 that, it starts getting a bit awkward

00:12:24.466 for the conversations. People start talking over

00:12:26.833 each other, and it gets to be

00:12:28.366 a bit difficult for a conversation.

 

00:12:30.733 And the ITU-T Recommendation G.114 talks about

00:12:34.733 this, and about the constraints there, in a lot of detail.

 

00:12:40.000 And, in terms of lip sync,

00:12:43.266 people start noticing if the audio is

00:12:46.400 more than about 15 milliseconds ahead,

00:12:49.033 or more than about 45 milliseconds behind

00:12:51.266 the video. And it seems that people

00:12:53.566 notice more often if the audio is

00:12:55.200 ahead of the video, than if it's behind the video.

 

00:12:58.166 So this gives quite strict bounds for

00:13:01.600 overall latency across the network, and for

00:13:04.500 the variation in latency between audio and video streams.

 

00:13:09.166 And, obviously, this depends what you're doing.

 

00:13:11.866 If you're having an interactive conversation,

00:13:14.366 the bounds are tighter than if it's

00:13:16.533 more of a lecture style, where it's

00:13:18.266 mostly unidirectional, with more structured pauses and

00:13:22.866 more structured questioning. That type of application

00:13:25.400 can tolerate higher latency.

 

00:13:28.100 Equally, if you're trying to do,

00:13:30.766 for example, a distributed music performance,

00:13:33.633 then you need much lower,

00:13:36.000 much lower latency.

 

00:13:38.300 And, if you think about something like

00:13:40.300 an orchestra, and you measure the size

00:13:42.300 of the orchestra, and you think about

00:13:44.300 the speed of sound, you get about

00:13:46.300 15 milliseconds for the sound to go

00:13:48.300 from one side of the orchestra to another.

 

00:13:51.033 So, that sort of level of latency

00:13:54.833 is clearly acceptable, but once it gets

00:13:57.000 more than 20 or 30 milliseconds,

00:13:59.300 it gets very difficult for people to

00:14:01.400 play in a synchronised way.

 

00:14:04.266 And if you've seen, if you’ve ever

00:14:08.533 tried to play music over a Zoom

00:14:11.000 call, you'll realise it just doesn't work,

00:14:13.000 because the latency is too high for that,

 

00:14:16.200 If you're trying to

00:14:18.566 play music collaboratively on a video conference.

 

00:14:25.500 So that gives you some bounds for latency.

 

00:14:29.566 What we saw in some of the

00:14:32.133 previous lectures, is that the network is

00:14:33.900 very much a best effort network,

00:14:35.700 and it doesn't guarantee the timing.

 

00:14:37.566 The amount of latency for data to

00:14:41.333 traverse the network very much depends on

00:14:44.700 the propagation delay of the path,

00:14:48.100 and the amount of queuing, and on

00:14:49.933 the path taken, and it's not predictable at all.

 

00:14:53.733 If we look at the figure on

00:14:56.233 the left, here, it's showing the variation

00:14:58.800 in round trip time for a particular

00:15:00.500 path. And we see that most of

00:15:02.133 it is bundled up, and there’s a

00:15:04.366 fairly consistent bound, but there are occasional

00:15:06.866 spikes where the packets take a much longer time to arrive.

 

00:15:11.866 And in some networks these effects can

00:15:14.833 be quite significant, they can take quite

00:15:16.800 a long time for data to arrive.

 

00:15:20.166 The consequence of all this, is that

00:15:22.300 real-time application needs to be loss tolerant.

 

00:15:25.166 If you're building an application to be

00:15:27.466 reliable, it has to retransmit data,

00:15:29.733 and that may or may not arrive

00:15:31.766 in time. So you want to build

00:15:33.466 it to be unreliable, and not to

00:15:35.633 necessarily retransmit the data.

 

00:15:37.800 You also want it to be able

00:15:39.466 to cope with the fact that some

00:15:40.766 packets may be delayed, and be able

00:15:42.800 to proceed even if those packets arrive too late.

 

00:15:45.600 So it needs to be able to

00:15:47.266 compensate for, to tolerate, loss, whether that's

00:15:50.100 just data which is never going to

00:15:52.166 arrive, or data that's just going to arrive late.

 

00:15:55.700 And, obviously, there's a bound on how

00:15:57.866 much loss you can conceal, how much

00:16:01.100 loss you can tolerate before the quality goes down.

 

00:16:04.366 And, the challenge in building these applications is to,

00:16:09.500 partially, engineer the network such that it

00:16:12.266 doesn't lose many packets, such the loss

00:16:14.433 rate, the timing variation, is low enough

00:16:16.633 that the application is going to work.

 

00:16:18.733 But, also, it’s in building the application

00:16:21.900 to be tolerant to the loss,

00:16:23.266 in being able to conceal the effects of lost packets.

 

00:16:31.166 The real-time nature of the traffic also

00:16:33.933 affects the way congestion control works,

00:16:36.566 it affects the way data is delivered across the network.

 

00:16:40.800 As we saw in some of the

00:16:42.600 previous lectures, when we were talking about

00:16:44.533 TCP congestion control,

00:16:46.600 congestion control adapts the speed of transmission

00:16:49.466 to match the available capacity over the network.

 

00:16:53.033 If the network has more capacity, it sends faster.

00:16:56.233 If the network gets overloaded, it sends slower.

 

00:16:59.800 And the transfers are elastic.

 

00:17:02.666 If you're downloading a web page, if you're downloading

00:17:06.133 a large file, faster is better,

00:17:08.000 but it doesn't really matter what rate

00:17:10.433 the congestion control will pick.

00:17:12.433 You want it to come down as fast as

00:17:14.233 it can, and the application can adapt.

 

00:17:18.233 Real-time traffic is much less elastic.

 

00:17:22.000 It’s got a minimum rate, there’s a

00:17:24.266 certain quality level, a certain bit rate,

00:17:26.700 below which the media is just unintelligible.

 

00:17:29.600 If you're transmitting speech, you need a

00:17:31.933 certain number of kilobits per second.

00:17:33.933 Otherwise, what comes out is just not intelligible speech.

 

00:17:37.500 If you're sending video, you need a

00:17:39.733 certain bit rate, otherwise you can't get

00:17:41.733 full motion video over it; the quality

00:17:44.200 is just too low, the frame rate

00:17:46.066 is just too low, and it's ino longer video.

 

00:17:49.366 Similarly, though, these applications have a maximum rates.

 

00:17:54.166 If you're sending speech data, if you're

00:17:56.400 sending music, it depends on the capture

00:17:58.733 rate, the sampling rate.

 

00:18:01.100 And, even for the highest quality

00:18:03.833 audio, you're probably not looking at more

00:18:06.066 than a megabit, a couple of megabits,

00:18:08.866 for CD quality, surround sound, media.

 

00:18:12.566 And again, for video, it depends on

00:18:14.533 the type of camera, the frame rates,

00:18:17.333 the resolution, and so on. Again,

00:18:20.833 a small number of megabits, tens of

00:18:23.800 megabits, in the most extreme cases hundreds

00:18:26.066 of megabits, and you get an upper bound on the sending rate.

 

00:18:31.833 So, real-time applications can't use

00:18:34.833 infinite amounts of traffic.

 

00:18:37.533 Unlike TCP, they're constrained by the rate

00:18:41.433 at which the media is captured.

00:18:43.500 But also, they can't go arbitrarily slowly,

 

00:18:46.466 This affects the way we have to

00:18:48.133 send that data, because we have less

00:18:49.766 flexibility in the rate at which these

00:18:51.300 applications can send.

 

00:18:56.866 And we need to think to what extent

00:19:00.066 it's possible, or desirable, to reserve capacity

00:19:03.666 for these applications.

 

00:19:08.200 There are certainly ways one can engineer

00:19:10.900 a network, such that it guarantees that

00:19:13.500 a certain amount of data is available.

00:19:16.533 Such that it guarantees that, for example,

00:19:19.000 a five megabit per second

00:19:22.333 channel is available to deliver video.

 

00:19:26.900 And, if the application is very critical,

00:19:29.100 maybe that makes sense.

 

00:19:31.000 If you're doing remote surgery, you probably

00:19:34.633 do want to guarantee the capacity is

00:19:36.733 there for the video.

 

00:19:38.766 But, for a lot of applications,

00:19:40.466 it's not clear it’s needed.

 

00:19:42.933 So while we have protocols, such as the

00:19:45.966 Resource Reservation Protocol, RSVP,

00:19:49.700 such as the Multi-Protocol Label Switching protocol

00:19:53.666 for orchestrating link-layer networks, such as the

00:19:58.233 idea of network slicing in 5G networks,

00:20:01.133 so we can set up resource reservations.

 

00:20:05.866 But.

 

00:20:08.966 This adds complexity. It adds signalling.

 

00:20:12.933 You need to somehow signal to the

00:20:14.966 network that you need to set up

00:20:16.933 this reservation, tell it what resources the

00:20:19.100 traffic requires.

 

00:20:20.700 And, somehow, demonstrate to the network that

00:20:22.933 the sender is allowed to use those

00:20:24.733 resources, and is allowed to reserve that

00:20:26.700 capacity, and can pay for it.

 

00:20:29.433 So you need authentication, authorisation, and accounting

00:20:32.733 mechanisms, to make sure that the people

00:20:34.933 reserving those resources are actually allowed to

00:20:37.100 do so, and have paid for them.

 

00:20:41.066 And in the end, if the network

00:20:43.633 has capacity, this doesn't actually help you.

 

00:20:46.366 If the operators designed the network so

00:20:48.766 it has enough capacity for all the

00:20:50.133 traffic it's delivering, the reservation doesn't help.

 

00:20:55.400 The reservations only help when the network

00:20:57.966 doesn't have the capacity.

 

00:21:00.066 They’re a way of allowing the operator,

00:21:02.366 who hasn't invested in sufficient network resources,

00:21:04.866 to discriminate in favour of the customers

00:21:07.300 who are willing to pay extra.

 

00:21:09.800 To discriminate so that those customers who

00:21:12.033 are willing to pay can get good

00:21:13.533 quality, whereas those who don't pay extra,

00:21:16.833 just get a system which doesn't work well.

 

00:21:21.000 So, it’s not clear that resource reservations

00:21:23.533 necessarily add benefit.

 

00:21:26.933 There are certainly applications where they do.

00:21:29.366 But, for many applications, the cost of

00:21:32.066 reserving the resources to get guaranteed quality,

00:21:35.666 the cost of building the accounting system,

00:21:38.100 the complexity of building the resource reservation

00:21:40.300 system, it's often easier, and cheaper,

00:21:43.000 just to buy more capacity, such that

00:21:45.166 everything works and there's no need for reservations.

 

00:21:48.700 And this is one of those areas

00:21:50.666 where the Internet, perhaps, does things differently

00:21:52.900 to a to a lot of other

00:21:54.266 networks. Where the Internet is very much

00:21:56.633 best efforts and unreserved capacity.

 

00:21:59.300 And it's an area of tension,

00:22:01.166 because a lot of the network operators

00:22:03.000 would like to be able to sell

00:22:05.166 resource reservations, would like to be able

00:22:08.300 to charge you extra to guarantee that

00:22:09.933 your Zoom calls will work.

 

00:22:13.500 It’s a different model. It's not clear,

00:22:16.366 to me, whether we want a network

00:22:19.000 that provides those guarantees,

00:22:22.633 but requires charging, and authentication,

00:22:25.833 and authorisation,

00:22:26.966 and knowing who's sending what traffic,

00:22:28.733 so you can tell if they've paid

00:22:30.633 for the appropriate quality.

 

00:22:32.500 Or, whether it's better just for everyone

00:22:34.700 to be sending, and we just architect

00:22:36.900 the networks so that it's good enough

00:22:39.066 for most things, and accept occasional quality lapses.

 

00:22:47.200 And, ultimately, it comes down to what's

00:22:49.266 known as quality of experience.

 

00:22:51.900 Does the application actually meet the users

00:22:54.266 needs? Does it allow them to communicate

00:22:56.733 effectively? Does it provide compelling entertainment? Does

00:22:59.366 it provide good enough video quality?

 

00:23:03.400 It’s very much not a one dimensional metric.

 

00:23:10.233 When you ask the user

00:23:12.733 “Does it sound good?”, you get a different

00:23:18.100 view on the quality of the music,

00:23:20.833 or the quality of the speech,

00:23:23.100 than if you ask “can you understand it?”

 

00:23:26.633 The question you ask matters. It depends

00:23:30.300 what aspect of user experience are you

00:23:32.300 evaluating. And it depends on the task

00:23:35.666 people are doing. The quality people need

00:23:38.133 for remote surgery is different to the

00:23:40.433 quality people need for a remote lecture, for example.

 

00:23:45.866 And some aspects of this user experience

00:23:48.133 you can estimate from looking at technical

00:23:50.166 metrics such as packet loss and latency.

 

00:23:53.633 And the ITU has something called the

00:23:56.133 E-model, which is a really good subjective

00:23:59.133 measure of speech quality, based on looking

00:24:01.533 at the latency, and the timing variation,

00:24:03.700 and the packet loss of speech data.

 

00:24:06.166 But, especially when you start talking about

00:24:08.233 video, and especially when you start talking about

00:24:12.166 particular applications, it's often very subjective,

00:24:15.066 and very task dependent. And you need

00:24:17.800 to actually build the system, try it

00:24:19.366 out, and ask people “So how well did it work?”

00:24:21.866 “Does it sound good?” “Can you understand

00:24:23.966 it?” “Did you like it?” You need

00:24:26.266 to do user trials to understand the

00:24:28.933 quality of the experience of the users.

 

00:24:34.966 So that concludes the first part.

 

00:24:37.100 I’ve spoken a bit about what is

00:24:39.200 real-time traffic, some of the requirements and

00:24:41.233 constraints to be able to run real-time

00:24:43.166 applications over the network, and some of

00:24:45.733 the issues around quality of service

00:24:47.433 and the user experience.

 

00:24:49.400 In the next part, we’ll move on

00:24:50.900 to start talking about how you build

00:24:52.766 interactive applications running over the Internet.

Part 2: Interactive Applications (data plane)

The second part discusses interactive applications. It briefly reviews the history of real-time applications running over the Internet, and the requirements on timing, data transfer rate, and reliability to be able to successfully run audio/visual conferencing applications over the network. It outlines the structure of multimedia conferencing applications, and the protocol stack used to support such applications. RTP media transport, media timing recovery, application-level framing, and forward error correction are discussed, outlining how multimedia applications are implemented.

 

00:00:00.133 In this part I'd like to talk

00:00:01.533 about interactive conferencing applications.

 

00:00:04.033 I’ll talk a little bit about what is the structure

00:00:06.266 of video conferencing systems,

00:00:07.933 some of the protocols for multimedia conferencing,

00:00:10.400 for video conferencing, and talk a bit

00:00:12.666 about how we do multimedia transport over the Internet.

 

00:00:17.466 So what do we mean by interactive conferencing applications?

 

00:00:21.366 Well I'm talking about applications such as

00:00:24.400 telephony, such as voice over IP,

00:00:27.033 and such as video conferencing.

 

00:00:29.633 These are applications like the university's telephone

00:00:32.366 system, like Skype, like Zoom or Webex

00:00:36.733 or Microsoft teams, that we're all spending

00:00:39.000 far too much time on these days.

 

00:00:42.266 And this is an area which has

00:00:44.433 actually been developing in the Internet community

00:00:46.800 for a surprisingly long amount of time.

 

00:00:50.033 As we discussed in the first part

00:00:51.900 of the lecture, the early standards,

00:00:54.500 the early work here, date back to

00:00:57.633 the early 1970s.

 

00:00:59.800 And the first Internet RFC on this

00:01:02.000 subject, the Network Voice Protocol, was actually

00:01:04.866 published in 1976. The standards we use

00:01:09.866 today for video conferencing applications, for telephony,

00:01:13.733 for voice over IP, date from the

00:01:16.100 early- and mid-1990s initially.

 

00:01:20.266 There were a set of applications,

00:01:22.600 such as CU-SeeMe, which you see at

00:01:25.233 the bottom right at the slide here,

00:01:27.966 a set of applications called the Mbone

00:01:30.700 conferencing tools, and the picture on the

00:01:33.533 top right of the slide is an

00:01:36.200 application I was involved in developing in

00:01:38.900 the late 1990s in this space,

00:01:41.300 which prototyped a lot of these standard

00:01:43.566 protocols. They led to the development of

00:01:47.000 a set of standards, such as the

00:01:48.866 Session Description Protocol, SDP, the Session Initiation

00:01:51.866 Protocol, SIP, and the Real-time Transport Protocol,

00:01:56.233 RTP, which formed the basis of these

00:01:58.700 modern video conferencing applications.

 

00:02:02.900 These got pretty widely adopted. The ITU

00:02:07.066 adopted them as the basis for it

00:02:08.933 H.323 series of recommendations

00:02:11.333 for video conferencing systems.

00:02:13.466 A lot of commercial telephony products are

00:02:16.633 built using them. And the Third Generation

00:02:20.266 Partnership Project, 3GPP, adopted them as the

00:02:23.133 basis for the current set of mobile

00:02:25.066 telephone standards. So, if you make a

00:02:28.500 phone call, a mobile phone call,

00:02:31.000 you’re using the descendants of these standards.

 

00:02:35.333 And also, more recently, the WebRTC browser-based

00:02:39.666 conferencing system again incorporated these protocols into

00:02:43.666 the browser, building on SDP, and RTP,

00:02:47.366 and the same set of conferencing standards

00:02:49.833 which were prototyped in the tools you

00:02:52.300 see on the right of the slide.

 

00:02:58.533 Again, as we discussed in the previous

00:03:01.066 part of lecture, if you're building interactive

00:03:03.500 conferencing applications,

00:03:05.166 you've got fairly tight bounds on latency.

 

00:03:10.366 The one-way delay, from mouth to ear,

00:03:13.900 if you want a sensible interactive conversation,

00:03:17.400 has to be no more than somewhere

00:03:19.766 around 150 milliseconds.

 

00:03:22.400 And if you're building a video conference,

00:03:24.166 you want reasonably tight lip sync between

00:03:26.500 the audio and video,

00:03:28.200 with the audio no more than around

00:03:30.966 15 milliseconds ahead of the video,

00:03:33.800 and no more than about 45 milliseconds behind.

 

00:03:37.633 Now, the good thing is that these

00:03:40.600 applications tend to degrade relatively gracefully.

 

00:03:43.966 The bounds, 150 milliseconds end-to-end latency;

00:03:49.233 the 15 milliseconds ahead, 45 milliseconds behind,

00:03:53.333 for lip sync, are not strict bounds.

 

00:03:56.333 Shorter is better, but

00:03:59.833 If the latency, if the offset,

00:04:01.500 exceeds those values, it gets to gradually

00:04:04.966 become less-and-less usable, people start talking over

00:04:08.400 each other, people start noticing the

00:04:10.966 that the lack of lip-sync, but nothing

00:04:13.100 fails catastrophically. But that's the sort of

00:04:16.366 values we're looking at: end-to-end delay in

00:04:19.600 the hundred 150 millisecond range, and audio-video

00:04:23.800 synchronised to within a few 10s of milliseconds.

 

00:04:28.233 The data rates we’re sending depend,

00:04:30.833 very much, on what type of media

00:04:32.600 you're sending, and what codec, what compression

00:04:35.100 scheme you use.

 

00:04:38.133 For sending speech, the speech compression typically

00:04:42.100 takes portions of speech data that are

00:04:45.666 around 20 milliseconds in duration, about 1/50th

00:04:48.800 of a second in duration, and every

00:04:51.100 20 milliseconds, every 1/50th second, it grabs the

00:04:54.566 next chunk of audio that's been received,

00:04:57.466 compresses it, and transmits it across the network.

 

00:05:00.800 And this is decoded at the receiver,

00:05:03.433 decompressed, and played out on the same sort of timeframe.

 

00:05:08.333 The data rates depends on the quality

00:05:11.300 level you want. It's possible to send

00:05:14.066 speech with something on the order of

00:05:17.033 10-15 kilobits per second of speech data,

00:05:19.700 although it's typically sent at a some

00:05:22.633 somewhat higher quality, maybe a couple of

00:05:24.933 hundred kilobits, to get high quality speech

00:05:29.933 that sounds pleasant, but it can go

00:05:34.700 to very low bit rates if necessary.

 

00:05:39.966 And a lot of these applications vary

00:05:42.333 the quality a little, based on what's

00:05:44.466 going on. They encode higher quality when

00:05:47.333 it's clear that the person is talking,

00:05:49.400 and they send packets less often,

00:05:51.566 and encoded with lower bit rates,

00:05:53.300 when it's clear there's background noise.

 

00:05:55.600 If you're sending good quality music,

00:05:57.666 you need more bits per second than if you're sending speech.

 

00:06:02.000 For video, the frame rates, the resolution,

00:06:06.500 very much depend on the camera,

00:06:08.600 on the amount of processor time you

00:06:10.533 have available to do the compression,

00:06:12.533 whether you've got hardware accelerated video compression

00:06:15.166 or not. And on the video compression

00:06:18.533 algorithm, the video codec you're using.

 

00:06:22.166 Frame rates somewhere in the order of

00:06:24.833 25 to 60 frames per second are common.

 

00:06:28.533 Video resolution varies from postage stamp sized,

00:06:32.700 up to full screen, HD, or 4k video.

 

00:06:37.266 You can get good quality video with

00:06:40.133 codecs like H.264, at around the two

00:06:43.466 to four megabits per second range.

 

00:06:46.066 Obviously, if you're going up to

00:06:48.500 full-motion, 4k, movie encoding, you'll need higher

00:06:52.500 rates than that. But, even then,

00:06:54.766 you’re probably not looking at more than

00:06:56.433 four, eight, ten megabits per second.

 

00:07:03.166 So, what you see is that these

00:07:04.466 applications have reasonably demanding latency bounds,

00:07:08.100 and reasonably high, but not excessively high,

00:07:11.066 bit-rate bounds. Two to four megabits,

00:07:14.100 even eight megabits, is generally achievable on

00:07:17.233 most residential, home network, connections.

00:07:22.366 And 150 milliseconds end-to-end latency

00:07:25.700 is generally achievable without too much difficulty

00:07:31.566 as long as you're not trying to

00:07:33.900 go transatlantic or transpacific.

 

00:07:39.566 In terms of reliability requirements,

00:07:42.633 speech data is actually surprisingly loss tolerant.

 

00:07:46.233 It's relatively straightforward to build systems

00:07:49.733 which can conceal 10-20% random packet loss,

00:07:53.333 without any noticeable reduction in speech quality.

 

00:07:56.933 And, with the addition of forward error

00:07:59.166 correction, with error correcting codes, it’s quite

00:08:01.600 possible to build systems that work with

00:08:05.100 maybe 50% of the packets being lost.

 

00:08:08.000 Bursts of packet loss are harder to

00:08:11.233 conceal, and tend to result in inaudible

00:08:14.266 glitches in the speech playback, but they're

00:08:17.633 relatively uncommon in the network.

 

00:08:20.033 Video packet loss is somewhat harder to conceal.

 

00:08:23.933 With streaming video applications, if you're sending

00:08:26.866 a movie, for example, you can rely

00:08:29.400 on that the occasional scene changes to

00:08:31.300 reset the decoder state, and to recover

00:08:33.566 from the effects of any loss.

 

00:08:35.733 With video conferencing, there aren’t typically scene

00:08:38.633 changes, so you have to do a rolling repair,

00:08:41.566 a rolling retransmission, or some form of

00:08:44.300 forward error correction to detect the losses.

00:08:46.500 So video tends to be more sensitive

00:08:49.033 to packet loss than the audio.

 

00:08:50.866 Equally, though, people are less sensitive to

00:08:53.266 disruptions in video quality than they are

00:08:55.300 to disruptions in the audio quality.

 

00:08:59.666 So how has one of these interactive

00:09:01.400 conferencing application structured?

 

00:09:04.366 What does the media transmission path look like?

 

00:09:08.000 Well, you start with some sort of

00:09:09.800 capture device. Maybe that's a microphone,

00:09:12.600 or maybe it's a camera, depending whether

00:09:15.000 it's an audio or a video application.

 

00:09:17.533 The media data is captured from that

00:09:19.466 device, and goes into some sort of

00:09:21.033 input buffer, frame at a time.

00:09:23.266 If it's video, it's each video frame

00:09:25.300 at a time. If it's audio,

00:09:27.200 it's frames of, typically, 20 milliseconds worth

00:09:30.066 of speech or music data at a time.

 

00:09:33.533 Each frame is taken from that input

00:09:35.866 buffer, and passed to the codec.

 

00:09:38.766 The codec compresses the frames of media,

00:09:41.233 one by one. And, if they’re too

00:09:43.333 large to fit into an individual packet,

00:09:45.500 it fragments them into multiple packets.

 

00:09:49.200 Each of those fragments of a media

00:09:52.166 frame is transmitted by putting it inside

00:09:55.466 an RTP packet, a Real-time Transport Protocol

00:09:58.533 packet, which is put inside a UDP

00:10:00.966 packet, and sent on to the network.

 

00:10:04.066 The RTP packet header adds a sequence

00:10:07.400 number, so the packets can be put

00:10:09.400 back into the right order.

 

00:10:10.900 It adds timing information, so the receiver

00:10:13.700 can reconstruct the timing accurately. And it

00:10:16.233 adds some source identification, so it knows

00:10:18.900 who's sending the media, and some payload

00:10:21.233 identification information, so it knows which compression

00:10:24.033 algorithm, which codec, was used to encode the media.

 

00:10:27.766 So the media is captured, compressed,

00:10:30.566 fragmented, packetised, and transmitted over the network.

 

00:10:37.166 On the receiving side, the UDP packets

00:10:40.700 containing the RTP data arrive.

 

00:10:45.366 And the receiving application extracts the RTP

00:10:48.933 data from the UDP packets, and looks

00:10:51.700 at the source identification information in there.

 

00:10:54.333 And then it separates the packets out

00:10:56.266 according to who sent them.

 

00:10:58.366 For each sender,

00:11:01.266 the data goes through a channel coder,

00:11:03.733 which repairs any loss, using a forward

00:11:07.200 error correction scheme

00:11:09.466 If one was used. And we'll talk

00:11:12.066 about that later, but that's where additional

00:11:14.033 packets are sent along with the media,

00:11:15.966 to allow some sort of repair without needing retransmission.

 

00:11:18.800 Then it goes into what's called a play-out buffer.

 

00:11:22.066 The play-out buffer is enough buffering to

00:11:24.733 allow the timing, and the variation in

00:11:26.666 timing, to be reconstructed,

00:11:30.733 such that the packets are put back

00:11:33.766 into the right order, and such that

00:11:36.366 they're delivered to the codec, to the decoder,

00:11:40.100 at the right time, and with

00:11:42.866 the correct timing behaviour.

 

00:11:44.966 The decoder then decompresses the media,

00:11:49.633 conceals any remaining packet loss, corrects any

00:11:54.200 clock skew, corrects any timing problems,

00:11:57.200 mixes it together if there's more than

00:11:59.533 one person talking, and renders it out

00:12:01.300 to the user. It plays the speech

00:12:03.933 or the music out, or it puts

00:12:06.266 the video frames onto the screen.

 

00:12:12.633 So that's conceptually how these applications work.

 

00:12:15.766 What does the set of protocol standards

00:12:18.333 which are used to transport multimedia over

00:12:20.666 the Internet, look like?

 

00:12:23.533 Well, there’s a fairly complex protocol stack.

 

00:12:27.200 At its core, we have the Internet

00:12:30.066 protocols, IPv4 and IPv6, and UDP and

00:12:32.833 TCP layered above them.

 

00:12:37.000 Layering above the UDP traffic, is the

00:12:40.566 media transport traffic and the associated data.

 

00:12:46.066 And what you have there is the

00:12:48.233 UDP packets, which deliver the data;

00:12:51.200 a datagram TLS layer, which negotiate the

00:12:54.633 encryption parameters;

00:12:56.400 and, above that, sit the secure RTP

 

00:13:00.100 packets, with the audio and video data

00:13:02.400 in them, for transmitting the speech and

00:13:04.966 the pictures. And you have a protocol,

00:13:08.100 known as SCTP,

00:13:10.866 layered on top of DTLS, to provide

00:13:13.266 a peer-to-peer data channel.

 

00:13:17.900 In addition to the media transport,

00:13:20.133 with RTP and SCTP sitting above DTLS,

00:13:23.666 you also have NAT traversal and path

00:13:25.900 discovery mechanisms. We spoke about these a

00:13:28.733 few lectures ago, with protocols like STUN

00:13:31.533 and TURN and ICE to help set

00:13:35.100 up peer-to-peer connections, to help discover NAT bindings.

 

00:13:39.966 You have what’s known as a session

00:13:42.233 description protocol, to describe the call being set up.

 

00:13:46.066 And this identifies the person who's trying

00:13:49.300 to establish the multimedia call, who's trying

00:13:51.800 to establish the video conference.

 

00:13:53.966 It identifies the person they want to

00:13:56.133 talk to. It describes which audio and

00:13:58.966 video compression algorithms they want to use,

00:14:01.233 which error correction mechanisms they want to

00:14:03.133 use, and so on.

 

00:14:06.433 And this is used, along with one

00:14:08.800 or more of a set of signalling

00:14:10.900 protocols, depending how the call is being set up.

 

00:14:14.566 It may be an announcement of a

00:14:17.233 broadcast session, using a protocol called the

00:14:19.800 Session Announcement Protocol, for example.

00:14:22.500 It might be a telephone call,

00:14:25.333 using the Session Initiation Protocol, SIP,

00:14:28.600 which is how the University's phone system

00:14:30.866 works, for example.

00:14:33.366 It might be a streaming video session,

00:14:35.966 using a protocol called RTSP. Or it

00:14:39.300 might be a web based video conferencing

00:14:42.700 application, such as Zoom call, or a

00:14:46.633 Webex call, or a Microsoft Teams call,

00:14:49.433 where the negotiation runs over HTTP using a

00:14:52.700 protocol called JSEP,

00:14:54.133 the Javascript Session Establishment Protocol.

 

00:15:00.766 So let's talk a little bit about the media transport.

 

00:15:03.966 How do we actually get the audio

00:15:05.600 and video data from the sender to

00:15:07.633 the receiver, once we've captured and compressed

00:15:10.433 data, and got it ready to transmit?

 

00:15:14.500 Well it's sent within a protocol called

00:15:16.766 the Real-time Transport Protocol, RTP.

 

00:15:20.566 RTP comprises two parts. There's a

00:15:24.633 data transfer protocol, and there's a control protocol.

 

00:15:30.166 The data transfer protocol is usually called

00:15:33.433 just RTP, the RTP data protocol,

00:15:35.966 and it carries the media data.

 

00:15:39.333 It’s structured in the form of a

00:15:40.633 set of payload formats. The payload formats

00:15:43.400 describe how you take the output of

00:15:45.233 each particular video compression algorithm, each particular

00:15:48.200 audio compression algorithm, and map it onto

00:15:50.900 a set of packets to be transmitted.

 

00:15:54.566 And it describes how

00:15:57.800 to split up a frame of video,

00:16:00.333 how to split up a sequence of

00:16:02.466 audio packets, such that each RTP packet,

00:16:06.800 each UDP packet, which arrives can be

00:16:09.766 independently decoded, even if some of the

00:16:12.333 packets have been lost. It makes sure

00:16:14.433 there's no dependencies between packets, a concept

00:16:17.200 known as application level framing.

 

00:16:20.133 And this runs over a datagram TLS

00:16:22.833 layer, which negotiates the encryption keys and

00:16:26.400 the security parameters to allow us to

00:16:28.733 encrypt those RTP packets.

 

00:16:31.400 The control protocol runs in parallel,

00:16:33.700 and provides things like Caller-ID,

00:16:36.466 reception quality statistics,

00:16:39.533 retransmission requests, and so one, in case data gets lost.

 

00:16:45.500 And there are various extensions that go

00:16:47.466 along with this, that provide things like

00:16:50.466 detailed user experience and reception quality reporting,

00:16:54.000 that provide codec control and feedback mechanisms to

00:16:57.866 detect and correct packet loss, and that

00:17:00.700 provide congestion control and perform circuit breaker

00:17:04.033 functions to stop the transmission if the

00:17:06.200 quality is too bad.

 

00:17:11.566 The RTP packets are sent inside UDP packets.

 

00:17:15.766 The diagram we see here shows the

00:17:17.933 format of the RTP packets. This is

00:17:20.000 the format of the media data,

00:17:22.033 which sits within the payload section of UDP packets.

 

00:17:26.566 And we see it that's actually a

00:17:28.066 reasonably sophisticated protocol. If we look at

00:17:31.566 the format of the packet, we see

00:17:33.233 there’s a sequence number and a timestamp to allow the

00:17:36.866 receiver to reconstruct the ordering, and reconstruct

00:17:39.533 the timing. There’s a source identifier to

00:17:42.733 identify who sent the packet, if you

00:17:44.933 have a multi-party video conference.

 

00:17:47.266 And there's some payload format identifiers,

00:17:49.500 that describe whether it contains audio or

00:17:51.766 video, what compression algorithm, is used, on so on

 

00:17:57.933 And there’s space for extension headers,

00:18:00.533 and space of padding, and the space

00:18:02.600 for payload data where the actual audio or video data goes.

 

00:18:09.266 And these packets, these RTP packets,

00:18:11.700 are sent within UDP packets. And the

00:18:15.833 sender will typically send these with pretty

00:18:18.566 regular timing. If it’s audio, it generates

00:18:21.866 50 packets per second;

00:18:24.100 if it's video, it might be 25

00:18:26.400 or 30 or 60 frames per second,

00:18:28.600 but the timing tends to be quite predictable.

 

00:18:32.566 As the data traverses the network,

00:18:34.400 though, the timing is often disrupted by

00:18:37.200 the other types of traffic, the cross-traffic

00:18:39.700 within the network. If we look at

00:18:42.233 the bottom of the slide, we see

00:18:43.666 the packets arriving at the receiver,

00:18:46.500 and we see that the timing is no longer predictable.

 

00:18:50.900 Because of the other traffic in the

 

00:18:54.100 network, because it's a best effort network,

00:18:56.733 because it's a shared network,

00:18:58.433 the media data is sharing the network

00:19:01.466 with TCP traffic, with all the other

00:19:03.466 flows on the network, and so the

00:19:05.400 packets don't necessarily arrived with predictable timing.

 

00:19:11.400 One of the things the receiver has

00:19:13.966 to do, is try to reconstruct the timing.

 

00:19:18.000 And what we see on this slide,

00:19:19.933 at the top, we see the timing

00:19:21.966 of the data as it was transmitted.

 

00:19:24.333 And the example is showing audio data,

00:19:27.300 and it’s labelling talk-spurts, and a talk-spurt

00:19:29.733 will be a sentence, or a fragment

00:19:31.833 of a sentence, with a pause between it.

 

00:19:34.933 We see that the packets comprising the

00:19:37.100 speech data are transmitted with regular spacing.

 

00:19:40.566 And they pass across the network,

00:19:42.266 and at some point later they arrive at the receiver.

 

00:19:46.033 There's obviously some delay, it’s labeled as

00:19:48.600 network transit delay on the slide,

00:19:50.733 which is the time it takes the

00:19:52.133 packets to traverse the network.

 

00:19:54.800 And there will be a minimum amount

00:19:56.500 of time it takes, just based on

00:19:57.833 the propagation delay, how long it takes

00:20:00.033 the signals to work their way down

00:20:03.066 the network from the sender to the

00:20:04.700 receiver. And, on top of that,

00:20:06.633 there'll be varying amounts of queuing

00:20:08.200 delay, depending on how busy the network is.

 

00:20:11.466 And the result of that, is that

00:20:13.100 the timing is no longer regular.

00:20:14.933 Packets which were sent with regular spacing,

00:20:17.366 arrive bunched together with occasional gaps between

00:20:20.300 them. And, occasionally, they may arrive out-of-order,

00:20:24.066 or occasionally the packets may get lost entirely

 

00:20:28.133 And what the receiver does, is to

00:20:30.766 adds what’s labeled as “playout buffering delay”

00:20:33.300 on this slide, to compensate for this

00:20:35.833 timing variation. To compensate for what's known

00:20:38.366 as jitter, the variation in the time

00:20:40.700 it takes the packets to transit across the network.

 

00:20:44.266 By adding a bit of buffering delay,

00:20:46.466 the receiver can allow itself time to

00:20:49.900 put all the packets back into the right order,

00:20:52.833 and to regularise the spacing. It just

 

00:20:55.633 adds enough delay to allow it to

00:20:57.600 compensate for this variation. So, by adding

00:21:00.400 a little extra delay at the receiver,

00:21:03.066 the receiver correct for the variations in timing.

 

00:21:07.200 And, if packets are lost, it obviously

00:21:09.766 has to try and conceal that loss,

00:21:11.700 or it can try to do a

00:21:13.433 retransmission if it thinks the retransmission will

00:21:15.500 arrive in time.

 

00:21:17.133 Of, if packets arrive, and we see

00:21:19.066 the very last packet here, if the

00:21:20.400 packets arrive too late, if they're delayed

00:21:22.833 too much, then they may arrive too

00:21:24.600 late to be played out. In which

00:21:26.566 case they’re just discarded, and the gap

00:21:29.200 has to be concealed as-if the packet were lost.

 

00:21:36.066 And, essentially, you can see, that if

00:21:38.300 the packets are played-out immediately they arrive,

00:21:40.366 this variation and timing would lead to

00:21:42.166 gaps, because the packets are not arriving

00:21:45.266 with consistent spacing.

 

00:21:47.400 If you delay the play-out by more

00:21:49.600 than the typical variation between the inter-arrival

00:21:52.000 time of the packets,

00:21:53.466 you can add enough buffering that once

00:21:56.466 you actually start playing out the packets,

00:21:59.000 when you start playing out the data,

00:22:00.466 you can allow smooth playback. You trade

00:22:02.933 off a little bit of extra latency for very smooth,

00:22:06.566 consistent, playback.

 

00:22:09.533 And that delay between the packets arriving,

00:22:12.266 and the media starting to play back,

00:22:16.633 that buffering delay,

00:22:18.433 partly allows you to reconstruct the timing,

00:22:22.033 and it partly gives time to decompress

00:22:24.233 the audio, decompress the video, run a

00:22:27.833 loss concealment algorithm, and potentially retransmit any

00:22:31.900 lost packets, depending on the network round-trip time.

 

00:22:37.333 And then you can schedule the packets

00:22:39.000 to be played out, and you can

00:22:40.200 play the data out smoothly.

 

00:22:46.900 What's critical, though, is that loss is

00:22:50.366 very much possible. The receiver has to

00:22:52.666 make the best of the packets which do arrive.

 

00:22:58.500 And a lot of effort, when building

00:23:01.600 video conferencing applications, goes into defining how

00:23:05.633 the compressed audio-visual data is formatted into

00:23:08.200 the packets.

 

00:23:10.566 And the goal is that each packet

00:23:12.300 should be independently usable.

 

00:23:14.733 It's easy to take the output of

00:23:18.533 a video compression scheme, a video codec,

00:23:20.633 and just arbitrarily put the data into packets.

 

00:23:24.066 But, if you do that, the different

00:23:28.966 packets end up depending on each other.

00:23:30.733 You can't decode a particular packet if

00:23:33.300 an earlier one was lost, because it

00:23:35.000 depends on some of the data was in the earlier packet.

 

00:23:38.566 So a lot of the skill in

00:23:40.833 building a video conferencing application goes into

00:23:43.433 what's known as the payload format.

00:23:45.166 It goes into the structure of how

00:23:47.033 you format the output of the video compression,

00:23:50.333 and how you format the output of

00:23:52.400 the audio compression, so that for each

00:23:54.100 packet that arrives, it doesn't depend on

00:23:56.133 any data that was in a previous

00:23:58.533 packet, to the extent possible, so that

00:24:01.466 every packet that arrives can be decoded completely.

 

00:24:05.566 And there are obviously limits to this.

00:24:08.133 Most video compression schemes work by sending

00:24:11.566 a full image, and then encoding differences

00:24:14.200 to that, and that obviously means that

00:24:16.533 you depend on that previous full image,

00:24:19.533 what's known as the index frame.

00:24:21.866 And a lot of these systems build

00:24:24.700 in retransmission schemes if the index frame

00:24:27.866 gets lost, but apart from that the

00:24:29.966 packets for the predicted frames,

00:24:32.800 that are transmitted after that,

00:24:34.133 should all be independently decodable.

 

00:24:37.833 The paper shown on the right of

00:24:39.633 the slide here, “Architectural Considerations for a

00:24:42.566 New Generation of Protocols”, by David Clark

00:24:45.900 and David Tennenhouse,

00:24:47.366 talks about this approach, and talks about

00:24:49.500 this philosophy of how to encode the

00:24:51.600 data such as the packets are independently

00:24:53.800 decodable, and how to structure these types

00:24:55.766 of applications, and it's very much worth a read.

 

00:25:02.933 Obviously the packets can get lost,

00:25:05.133 and the way networks applications typically deal

00:25:08.766 with lost packets is by asking for a retransmission.

 

00:25:12.333 And you can clearly do this with

00:25:14.266 a video conferencing application.

 

00:25:16.800 The problem is that retransmission takes time.

 

00:25:19.333 It takes a round-trip time for the

00:25:22.500 retransmission requests to get back from the

00:25:24.433 receiver to the sender, and for the

00:25:26.233 sender to transmit the data.

 

00:25:28.733 But for video conferencing applications, for interactive

00:25:31.200 applications, you've got quite a strictly delay bound.

 

00:25:34.266 The delay bound is somewhere on the

00:25:36.233 order of 100-150 milliseconds, mouth to ear delay.

 

00:25:40.266 And that comprises the time it takes

00:25:43.100 to capture a frame of audio,

00:25:45.233 and audio frames are typically 20 milliseconds,

00:25:48.400 so you've got a 20 millisecond frame

00:25:50.533 of audio being captured.

 

00:25:52.100 And then it takes some time to

00:25:53.700 compress that frame. And then it has

00:25:55.766 to be sent across the networks,

00:25:57.033 so you've got the time to transit network.

00:25:59.233 And then the time to decompress the

00:26:00.966 frame, and the time to play that

00:26:03.333 frame of audio out. And that typically

00:26:05.766 ends up being four framing durations,

00:26:08.700 plus the network time.

 

00:26:10.333 So you have 20 milliseconds of frame

00:26:13.266 data being captured. And while that's being

00:26:15.766 captured, the previous frame is being compressed,

00:26:19.100 and transmitted. And, on the receiver side,

00:26:21.233 you have one frame being

00:26:23.200 decoded, errors being concealed, and timing being

00:26:27.933 reconstructed. And then another frame being played

00:26:30.500 out. So you've got 4 frames,

00:26:32.733 80 milliseconds, plus the network time.

 

00:26:35.033 It doesn't leave much time to do a retransmission.

 

00:26:38.666 So retransmissions tend not to be particularly

00:26:41.700 useful in video conferencing applications, unless they're

00:26:45.500 on quite short duration network paths,

00:26:48.033 because they arrive too late to be played-out.

 

00:26:51.866 So what these applications tend to do,

00:26:54.066 is use forward error correction.

 

00:26:57.733 And the basic idea of forward error

00:26:59.600 correction is that you send additional error

00:27:01.833 correcting packets, along with the original data.

 

00:27:05.700 So, in the example on the slide,

00:27:07.900 we're sending four packets of original speech

00:27:10.800 data, original media data. And for each

00:27:13.666 of those four packets, you then send

00:27:15.433 a fifth packet, which is the forward

00:27:16.900 error correction packet.

 

00:27:19.466 So the group of four packets gets

00:27:21.800 turned into five packets for transmission.

 

00:27:25.366 And, in this example, the third of those packets gets lost.

 

00:27:30.166 And at the receiver, you take the

00:27:33.266 four of those five packets which did arrive,

00:27:37.300 and you use the error correcting data

00:27:40.366 to recover that loss without retransmitting the packet.

 

00:27:45.433 And there are lots of different ways

00:27:47.233 in which these error correcting codes can work.

 

00:27:50.833 In the simplest case, the forward error

00:27:53.100 correction packet is just the result of

00:27:54.833 running the exclusive-or, the XOR operation,

00:27:57.833 on the previous packets. So the forward

00:28:00.466 error correction packets on the slides could

00:28:02.633 be, for example, the XOR of packets

00:28:04.800 1, 2, 3, and 4.

 

00:28:07.366 In this case, on the receiver,

00:28:09.800 when it notices that packet 3 has

00:28:11.433 been lost, if it calculates the XOR

00:28:13.800 of the received packets, so if you

00:28:15.833 XOR packets 1, 2, and 4,

00:28:17.833 and the FEC packet together, what will

00:28:20.300 come out will be the original packet, missing packet.

 

00:28:25.966 And that's obviously a simple approach.

00:28:28.466 There are a lot of much more

00:28:30.666 sophisticated forward error correction schemes, which trade

00:28:33.633 off different amounts of complexity for different overheads.

 

00:28:36.900 But the idea is that you send

00:28:38.566 occasional packets, which error correcting packets,

00:28:42.033 and that allows you to recover from

00:28:44.600 some types of loss without retransmitting the

00:28:47.700 packets, so you can recover losses more quickly.

 

00:28:55.500 And that's the summary of how we

00:28:58.266 transmit media over the Internet.

 

00:29:01.833 That data is captured, compressed, framed into

00:29:05.700 RTP packets, each which includes sequence number

00:29:09.766 and timing recovery information.

 

00:29:11.966 And then, when they arrive at the receiver,

00:29:13.866 it’s decompressed, it’s buffered, the timing

00:29:16.600 is reconstructed, and the buffering is chosen

00:29:18.833 to allow the receiver to reconstruct the timing,

00:29:21.900 and then the media is played-out to the user.

 

00:29:26.033 And that comprises the media transport parts.

 

00:29:29.566 As we saw, there's also signalling protocols

00:29:33.866 and NAT traversal protocols. What I'll talk

00:29:36.833 about in the next part is,

00:29:38.333 briefly, how the signalling protocols work to

00:29:40.500 set up multimedia conferencing calls.

Part 3: Interactive Applications (Control Plane)

This part moves on from discussion real-time data transfer to discuss the control plane supporting interactive conferencing applications. It discusses the WebRTC data channel, and the supporting signalling protocols, including the SDP offer/answer exchange, SIP, and WebRTC signalling via JSEP.

00:00:00.000 In the previous part of the lecture,

00:00:01.733 I introduced interactive conferencing applications. I spoke

00:00:04.900 a bit about the architecture of those

00:00:06.766 applications, about the latency requirements, and the

00:00:10.100 structure of those applications, and began to

00:00:12.666 introduce the standard set of conferencing protocols.

 

00:00:16.066 I spoke in detail about the Real-time

00:00:18.500 Transport Protocol, and the way media data is transferred.

 

00:00:22.766 In this part of the lecture,

00:00:24.100 I want to talk briefly about two other aspects of

00:00:27.800 interactive video conferencing applications,

00:00:30.300 the data channel, and the signalling protocols.

 

00:00:35.166 In addition to sending audio visual media,

00:00:39.933 most video conferencing applications also provide some

00:00:43.700 sort of peer-to-peer data channel.

 

00:00:47.200 This is part of the WebRTC standards,

00:00:50.733 and it's also part of most of the other systems as well.

 

00:00:57.133 The goal is to provide

00:00:59.566 for applications like peer-to-peer file transfer as

00:01:03.033 part of the video conferencing tool,

00:01:05.133 to support a chat session along with

00:01:08.300 the audio and video, and to support

00:01:10.166 features like reaction emoji, the ability to

00:01:13.466 raise your hand, request to the speaker

00:01:16.633 talks faster or slower, and so on.

 

00:01:20.900 The way this is implemented in WebRTC,

00:01:23.700 is using a protocol called SCTP running

00:01:27.533 inside a secure UDP tunnel.

 

00:01:30.566 I’m not going to talk much about SCTP.

 

00:01:33.566 SCTP is the Stream Control Transport Protocol,

00:01:37.233 and it was a previous attempt at replacing TCP.

 

00:01:41.700 The original version of SCTP ran directly

00:01:45.233 over IP, and was pitched as a

00:01:48.666 direct replacement for TCP, running as a

00:01:51.833 peer for TCP or UDP directly on the IP layer.

 

00:01:56.800 And it turned out this was too

00:01:58.333 difficult to deploy, so it didn't get

00:02:02.233 tremendous amounts of take-up. But, at the

00:02:04.933 point when the WebRTC standards were being

00:02:07.466 developed, it was

00:02:09.966 available, and specified, and it was deemed

00:02:12.966 relatively straightforward to move it to run

00:02:15.766 on top of UDP, to run on

00:02:17.733 top of Datagram TLS, to provide security,

00:02:20.800 as a deployable way of providing a

00:02:24.866 reliable peer-to-peer data channel.

 

00:02:29.166 And it would perhaps have been possible

00:02:31.100 to use TCP to do this,

00:02:34.066 but the belief at the time was

00:02:36.500 that NAT traversal for TCP wasn't very

00:02:40.533 reliable, and that something running over UDP

00:02:43.600 would work better for NAT traversal.

 

00:02:46.300 And I think that was the right decision.

 

00:02:50.033 And SCTP, the WebRTC data channel using

00:02:53.800 SCTP over DTLS over UDP,

00:02:57.800 provides a transparent data channel. It provides

00:03:01.600 the ability to deliver framed messages,

00:03:04.866 it supports delivering multiple sub-streams of data

00:03:07.900 over a single connection, and it supports

00:03:10.300 congestion control, retransmissions, reliability and so on.

 

00:03:16.366 And it makes it straightforward to build

00:03:18.066 peer-to-peer applications using WebRTC.

 

00:03:21.600 And gains all the deployments advantages that

00:03:24.000 we gained with QUIC, by running over UDP.

 

00:03:28.333 You might ask why WebRTC uses

00:03:34.866 SCTP to build its data channel, rather than using QUIC?

 

00:03:41.033 And, fundamentally, that's because WebRTC predates the

00:03:43.666 development of QUIC.

 

00:03:46.966 It seems likely, now that the QUIC

00:03:49.300 standard is finished, that future versions of

00:03:51.600 WebRTC will migrate, and switch to using

00:03:54.433 QUIC, and gradually phase out the SCTP-based data channel.

 

00:03:59.466 And QUIC learned, I think, from this

00:04:02.066 experience, and is more flexible and more

00:04:04.466 highly optimised than the SCTP, DTLS, UDP stack.

 

00:04:13.666 In addition to the media transport and

00:04:16.166 data, you need some form of signalling,

00:04:18.933 and some sort of session description,

00:04:20.766 to specify how to set up a video conferencing call.

 

00:04:29.300 Video conferencing calls run peer-to-peer. The goal

00:04:32.666 of a system like Zoom, or Skype,

00:04:35.900 or any of these systems, is to

00:04:37.700 set up peer-to-peer data, where possible,

00:04:40.700 so that they can achieve the lowest possible latency.

 

00:04:45.066 They need some sort of signalling protocol

00:04:47.300 to do that. They need some sort

00:04:49.033 of protocol to convey the details of

00:04:51.866 what transport connections are to be set

00:04:54.466 up, to exchange the set of candidate

00:04:56.700 IP addresses on which they can be

00:04:58.500 reached, to set up the peer-to-peer connection.

 

00:05:01.666 They need to specify the media formats

00:05:04.333 they want to use. Is it just

00:05:06.300 audio? Or is it audio and video?

00:05:08.366 And which compression algorithms are to be

00:05:10.366 used? And they want to specify the

00:05:12.166 timing of the session, and the security

00:05:15.200 parameters, and all the other parameters.

 

00:05:20.266 A standardised way of doing that is

00:05:23.733 using a protocol called the Session Description Protocol.

 

00:05:27.533 The example on the right at the

00:05:29.133 slide is an example of an SDP,

00:05:32.300 a Session Description Protocol, description of a

00:05:35.300 simple multimedia conference.

 

00:05:40.366 The format of SDP is unpleasant.

00:05:42.533 It’s essentially a set of key-value pairs,

00:05:46.000 where the keys are all single letters,

00:05:48.533 and the values are more complex,

00:05:51.633 one key-value pair per line, with the

00:05:54.800 key and the value separated by equals signs.

 

00:05:58.300 And, as we see in the example,

00:06:00.766 it starts with a version number,

00:06:02.666 v=0. There’s an originator line, and it

00:06:06.300 was originated by Jane Doe, who had

00:06:09.100 IP address 10.47.16.5.

 

00:06:12.866 It's a seminar about session description protocol.

00:06:15.933 It's got the email address of Jane

00:06:18.366 Doe, who set up the call,

00:06:20.566 it's got their IP address, the times

00:06:22.966 that session is active,

00:06:25.400 it's receive only, it’s broadcast so that

00:06:29.666 the listener just receives the data,

00:06:33.166 it’s sending using audio and video media,

00:06:36.366 and it specifies the ports and some

00:06:38.366 details of the video compression scheme, and so on.

 

00:06:43.133 The details of the format aren't particularly

00:06:44.933 important. It’s clear that it's sending

00:06:49.333 what the session is about, the IP

00:06:53.133 addresses, the times, the details of the

00:06:55.766 audio compression, the details of the video

00:06:57.800 compression, the port numbers to use,

00:06:59.500 and so on. And how this is

00:07:01.066 encoded isn't really important.

 

00:07:07.133 In order to set up an interactive

00:07:08.666 call, you need some sort of a negotiation.

 

00:07:11.166 You need some sort of offer to

00:07:12.500 communicate, which says this is the set

00:07:15.566 of video compression schemes, this is the

00:07:17.800 set of audio compression schemes, that the sender supports.

 

00:07:21.033 This is who is trying to call

00:07:23.133 you. This is the IP address that

00:07:24.866 they're calling you from. These are the

00:07:27.866 public key for trying to negotiate the

00:07:29.800 security parameters. And so on.

 

00:07:33.033 And that comprises an offer.

 

00:07:35.900 And the offer gets sent via a

00:07:38.833 signalling channel, via some out of band

00:07:42.166 signalling server, to the

00:07:45.500 responder, to the person you're trying to call.

 

00:07:49.633 The responder generates an answer, which looks

00:07:52.666 at that set of codecs the offer

00:07:55.600 specified, and picks the subset it also understands.

 

00:07:59.200 It provides the IP addresses it can

00:08:01.733 be reached at, it provides its public

00:08:04.933 keys, confirms its willingness to communicate,

00:08:07.333 and so on. And the answer flows

00:08:09.300 back to the original sender, the initiator of the call.

 

00:08:15.500 And this allows the offering party and

00:08:17.466 the answering party, the initiator and responder,

00:08:20.200 to exchange the details they need to establish the call.

 

00:08:24.066 The offer contains all the IP address

00:08:27.266 candidates that can be used with the

00:08:29.433 ICE algorithm to probe the NAT bindings.

 

00:08:31.933 The answer coming back contains the candidates

00:08:34.700 for the receiver, that allows them to

00:08:37.266 do the STUN exchange, the STUN packets,

00:08:40.000 to run the ICE algorithms that actually

00:08:41.600 sets up the peer-to-peer connection.

 

00:08:43.566 And it's also got the details of

00:08:45.266 the compression algorithms, the video codec,

00:08:47.466 the audio formats, the security parameters, and so on.

 

00:08:53.000 Unfortunately SDP, which we have ended up

00:08:56.000 using as the negotiation format, really wasn't

00:08:58.600 designed to do this. It was originally

00:09:01.633 designed as a one way announcement format

00:09:05.166 to describe video on demand sessions,

00:09:08.033 rather than as a format for negotiating

00:09:10.200 parameters. So the syntax is pretty unpleasant,

00:09:13.533 and the semantics are pretty unpleasant,

00:09:16.266 and it's somewhat complex to use in practice.

 

00:09:20.266 And this complexity wasn't really visible when

00:09:24.500 we started developing the these systems,

00:09:27.633 these tools, but unfortunately it turned out

00:09:31.100 that SDP wasn't a great format here,

00:09:33.166 but it's now too entrenched

00:09:35.633 for alternatives to take off. So we’re

00:09:37.766 left with this quite unpleasant, not particularly

00:09:39.966 well-designed format. But, we use it,

00:09:42.633 and we negotiate the parameters.

 

00:09:47.666 Exactly how this is used depends on

00:09:49.833 the system you're using, There’s two widely used models.

 

00:09:55.066 One is a system known as the Session Initiation Protocol.

 

00:09:59.300 And the Session Initiation Protocol, SIP,

00:10:02.866 is very widely used for telephony,

00:10:05.533 and it's widely used for stand-alone video

00:10:09.100 conferencing systems.

 

00:10:11.466 If you make a phone call using

00:10:13.166 a mobile phone, this is how the

00:10:16.933 phone locates the person you wish to

00:10:19.200 call, and sets up the call,

00:10:21.233 is using SIP, for example.

 

00:10:25.033 And SIP relies on a set of

00:10:26.900 conferencing servers, one representing the person making

00:10:30.766 the call, and one representing person being called.

 

00:10:34.733 And the two devices, typically mobile phones

00:10:37.500 or telephones these days, have a direct

00:10:39.933 connection to those servers, which they maintain

00:10:41.933 at all times.

 

00:10:44.233 On the sending side, when you try

00:10:47.366 to make a call, the message goes

00:10:49.466 out to the server, and that allows,

00:10:52.233 at that point there's a set of

00:10:54.166 STUN packets exchanged, and a set of

00:10:56.133 signalling messages exchanged, that allow the initiator

00:10:59.233 to find its public NAT bindings.

 

00:11:02.466 And then the message goes out to

00:11:04.100 the server, and that locates the server

00:11:07.000 for the person being called, and passes

00:11:09.200 the message back over the connection to

00:11:13.600 their server, and it eventually reaches the responder.

 

00:11:17.366 And that gives the responder the candidate

00:11:20.400 addresses, and all the connection details,

00:11:23.133 and the codec parameters, and so on,

00:11:24.900 needed for it to decide whether it

00:11:26.833 wishes to accept the call, and to

00:11:29.166 start setting up the NAT bindings.

 

00:11:32.300 And it responds, and the message goes

00:11:34.133 back through the multiple servers to the

00:11:36.233 initiator, and that completes the offer answer exchange.

 

00:11:39.533 At that point, they can start running

00:11:41.866 the ICE algorithm, discovering the NAT bindings.

 

00:11:44.700 And they've already agreed the parameters at

00:11:46.766 this point, which codecs they using,

00:11:49.000 what public keys that are using,

00:11:50.300 and so on. And that lets them

00:11:52.066 set up a peer-to-peer connection

00:11:54.666 using the ICE algorithm, and using STUN,

00:11:57.933 to set up a peer-to-peer connection over

00:11:59.866 which the media data can flow.

 

00:12:04.066 And it's an indirect connection set up.

00:12:06.566 The data flows from initiator, to their

00:12:08.766 server, to the responder’s server, to the

00:12:11.100 responder, and then back via the server path.

 

00:12:15.466 And that indirect signalling setup allows the

00:12:18.333 direct peer-to-peer connection to be created.

 

00:12:25.433 In more modern systems, systems using the

00:12:28.966 WebRTC browser-based approach,

00:12:33.566 the trapezoid that we have in the

00:12:36.833 SIP world, with the servers representing each

00:12:40.733 of the two parties, tends to get

00:12:42.033 collapsed into a single server representing the

00:12:44.533 conferencing service.

 

00:12:47.233 And the server, in this case,

00:12:48.866 is something such as the Zoom servers,

00:12:51.633 or the Webex servers, or the Microsoft Teams servers.

 

00:12:56.033 And, it's essentially following the same pattern.

00:12:58.900 It’s just that there's a now a

00:13:00.366 single conferencing server that initiates the call,

00:13:03.566 rather than being a cross-provider, with server

00:13:07.633 for each party.

 

00:13:10.466 And this is how web-based conferencing systems such as

00:13:14.766 Zoom, and Webex, and Teams, and the like, work.

 

00:13:19.866 You get your Javascript application, your web-based

00:13:24.000 application, sitting on top. This talks to

00:13:27.000 the WebRTC API in the browsers,

00:13:30.233 and that provides access to the session

00:13:32.666 descriptions which you can exchange with the

00:13:36.466 server over HTTP GET and POST requests

00:13:40.000 to figure out the details of

00:13:43.033 how the communication should be set up.

 

00:13:45.733 And, once you've done that, you can

00:13:47.333 fire off the data channel, and the

00:13:49.100 media transport, and establish the peer-to-peer connections.

 

00:13:54.000 So the initial signalling is exchanged via

00:13:56.133 HTTP to the web server, that controls

00:13:58.433 the call. The offer-answer exchange in SDP

00:14:02.566 is exchanged with the server, and that

00:14:04.733 exchanges it with the responder, and then,

00:14:06.833 when all the parties agree to communicate,

00:14:10.566 the server sends back the session description

00:14:14.500 containing the details which the browsers need

00:14:18.133 to set up the call. And they

00:14:19.533 then established a peer-to-peer connection.

 

00:14:22.366 And the goal is to integrate the

00:14:24.166 video conferencing features into the browsers,

00:14:27.033 and allows the server to control the call setup.

 

00:14:30.366 And, as we've seen over the course

00:14:33.933 of, I guess, the last year or

00:14:36.200 so, it actually works reasonably well.

00:14:39.666 These video conferencing applications work

00:14:42.000 reasonably well in practice.

 

00:14:47.600 So what's happening with interactive applications?

00:14:50.100 Where are things going?

 

00:14:53.166 I think there’s two ways these types

00:14:56.233 of applications are evolving.

 

00:14:59.100 One is supporting better quality, and supporting

00:15:04.166 new types of media. Obviously, over time,

00:15:08.166 the audio and the video quality,

00:15:09.966 and the frame rate, and the resolution,

00:15:11.533 has gradually been increasing, and I expect

00:15:13.733 that will continue for a while.

 

00:15:16.866 There's also people talking about running various

00:15:20.433 types of augmented reality, virtual reality,

00:15:23.933 holographic 3D conferencing, and

00:15:27.133 tactile conferencing where you transmit a sense

00:15:30.533 of touch over the network. And some

00:15:33.733 of these have perhaps stricter requirements on

00:15:36.600 latency, and stricter requirements on quality but,

00:15:39.933 as far as I can tell,

00:15:40.966 they all fit within the basic framework we've described.

 

00:15:44.266 They can all be transmitted

00:15:46.366 over UDP using either RTP,

00:15:50.900 or the data channel, or something

00:15:52.500 very like it. And they all fit

00:15:54.066 within the same basic framework, of add

00:15:57.033 a little bit of buffering to reconstruct

00:15:58.800 the timing, graceful degradation for the media transport.

 

00:16:05.633 Currently, we have a mix of RTP

00:16:09.133 for the audio and video data,

00:16:11.433 and the SCTP-based data channel.

 

00:16:15.000 It's pretty clear, I think, that the

00:16:16.700 data channel is going to transition to

00:16:18.366 using QUIC relatively soon.

 

00:16:21.766 And there's a fair amount of active

00:16:23.833 research, and standardisation, and discussion, about whether

00:16:26.566 it makes sense to also move the

00:16:28.633 audio and video data to run over QUIC.

 

00:16:32.400 And people are building unreliable datagram extensions

00:16:35.666 to QUIC to support this, so I

00:16:37.933 think it's reasonably likely that we’ll end

00:16:39.833 up running both the audio and the

00:16:41.633 video and the data channel over

00:16:43.633 peer-to-peer QUIC connections, although the details of

00:16:46.933 how that will work are still being discussed.

 

00:16:53.700 And that's what I would say about

00:16:55.066 interactive applications. In the next part I

00:16:58.066 will move on talk about video on

00:17:01.633 demand, and streaming applications.

Part 4: Streaming Video

The final part of the lecture discusses streaming video. It talks about HTTP Adaptive Streaming and MPEG DASH, content delivery networks, and some reasons why streaming media is delivered over HTTP. The operation of HTTP adaptive streaming protocols is discussed, and their strengths and limitations are highlighted.

 

00:00:00.466 In this last part of the lecture,

00:00:02.266 I want to talk about streaming video

00:00:04.033 and HTTP adaptive streaming.

 

00:00:08.100 So how do streaming video applications,

00:00:10.666 such as Netflix, the iPlayer, and YouTube, actually work?

 

00:00:15.100 Well, what you might expect them to

00:00:17.000 do, is use RTP, the same as

00:00:18.933 the video conferencing applications, to stream the

00:00:21.833 video over the network in a low-latency

00:00:24.700 and loss-tolerant way.

 

00:00:26.566 And, indeed, this is how streaming video,

00:00:28.833 streaming audio, applications used to work.

 

00:00:31.400 Back in the late 1990s, the most

00:00:33.966 popular application in this space was RealAudio,

00:00:36.800 and later RealPlayer when it incorporated video support.

 

00:00:40.433 This did exactly as you would expect.

00:00:43.066 It streamed the audio and the video

00:00:45.566 over RTP, and had a separate control

00:00:48.266 protocol, the Real Time Streaming Protocol,

00:00:50.800 to control the playback.

 

00:00:53.033 These days, though, most applications actually deliver

00:00:56.300 the video over HTTPS instead.

 

00:00:59.833 And as a result, they have significantly

00:01:01.933 worse performance. They have significantly higher latency,

00:01:05.466 and significantly higher startup latency.

 

00:01:09.366 The reason they do this, though,

00:01:11.400 is that by streaming video over HTTPS,

00:01:14.500 they can integrate better with

00:01:15.933 content distribution networks.

 

00:01:20.366 So what is a content distribution network?

 

00:01:23.433 A content distribution network is a service

00:01:26.066 that provides a global set of web

00:01:28.200 caches, and proxies, that you can use

00:01:30.600 to distribute your application, that you can

00:01:34.466 use to distribute the web data,

00:01:36.366 the web content, that comprises your application

00:01:39.133 or your website.

 

00:01:42.166 They're run by companies such as Akamai,

00:01:44.533 and CloudFlare, and Fastly. And these companies

00:01:47.600 run massive global sets of web proxies,

00:01:50.766 web caches. And they take over the

00:01:53.766 delivery of particular sets of content from

00:01:57.666 websites. As a website operator, you give

00:02:01.300 the files, the images, the videos,

00:02:03.500 that you wish to be hosted on

00:02:05.533 the CDN, to the CDN operator.

00:02:08.600 And they ensure that they’re cached throughout

00:02:10.900 the network, at locations close to where

00:02:12.866 your customers are.

 

00:02:14.733 And each of those files, or images,

00:02:16.700 or videos, is given a unique URL.

 

00:02:19.600 And the CDN manages the DNS resolution

00:02:23.033 for that URL, so that when you

00:02:25.200 look up the name, it returns you

00:02:27.233 an IP address that corresponds to a

00:02:29.100 proxy, or a cache, which is located

00:02:31.200 near physically near to you.

 

00:02:33.666 And that server has the data on

00:02:35.533 it such that the response comes quickly,

00:02:38.166 and such that the load is balanced

00:02:39.666 around these servers, around the world.

 

00:02:43.966 And these CDNs, these content distribution networks,

00:02:46.833 are extremely effective at delivering and caching

00:02:49.700 HTTP content.

 

00:02:51.933 They support some extremely high volume applications:

00:02:56.366 game delivery services such as Steam,

00:03:00.800 applications like the Apple software update,

00:03:04.000 or the Windows software update, and massively

00:03:07.566 popular websites.

 

00:03:09.900 And they have global deployments, and they

00:03:12.100 have agreement with the overwhelming majority of

00:03:14.466 ISPs to host these caches, these proxy

00:03:17.700 servers, at the edge of the network.

00:03:19.866 So, no matter where you are in

00:03:22.000 the network, you're very near to a

00:03:23.933 content distribution node.

 

00:03:28.633 A limitation of CDNs, though, is that

00:03:30.833 they only work with HTTP-based content.

 

00:03:33.933 They’re for delivering web content. And the

00:03:36.566 entire infrastructure is based around delivering web

00:03:39.700 content over HTTP, or more typically these days

00:03:43.033 HTTPS. They don't support RTP based streaming.

 

00:03:49.600 The way streaming video is delivered,

00:03:52.900 these days, is to make use of

00:03:55.166 content distribution networks. It's delivered using HTTPS

00:03:59.200 from a CDN node.

 

00:04:03.100 The contents of a video, in a

00:04:05.666 system such as Netflix, is encoded in

00:04:08.300 multiple chunks, where each chunk comprises,

00:04:12.466 typically, around 10 seconds worth of the video data.

 

00:04:17.233 Each of the chunks is designed to

00:04:19.366 be independently decodable, and each is made

00:04:22.333 available in many different versions, at many

00:04:24.366 different quality rates, many different bandwidths.

 

00:04:29.000 A manifest file provides an index for

00:04:31.666 what chunks are available. It's an index,

00:04:35.300 which says, for the first 10 seconds of the movie

00:04:38.433 there are these six different versions available,

00:04:40.933 and this is the size of each

00:04:42.533 one, and the quality level for each

00:04:44.266 one, and this is a URL where

00:04:46.000 it can be retrieved from.

 

00:04:47.833 And the same for the next 10 seconds,

00:04:49.933 and the next 10 seconds, and so on.

 

00:04:53.233 And the way the video streaming works,

00:04:55.700 is that the client fetches the manifest,

00:04:59.666 looks at the set of chunks,

00:05:04.266 and starts downloading the chunks in turn.

 

00:05:07.333 And it uses standard HTTPS downloads to

00:05:10.166 download each of the chunks. But,

00:05:12.100 as it's doing so, it monitors how

00:05:13.933 quickly it’s successfully downloading. And. based on

00:05:17.000 that, it chooses what encoding rate to fetch next.

 

00:05:21.033 So, it starts out by fetching a

00:05:23.100 relatively low rate chunk, and measures how

00:05:26.300 quickly it downloads. Maybe it's fetching a

00:05:29.000 chunk that's encoded at 500 kilobits per second,

00:05:31.966 and it measures how fast it actually

00:05:33.733 downloads. And it sees if it's actually

00:05:36.133 managing to download the 500 kilobits per

00:05:38.733 second video faster, or slower, than 500 kilobits.

 

00:05:42.966 If it's downloading slower than real-time,

00:05:47.033 it will pick a lower quality,

00:05:48.733 a smaller chunk, for the next time.

00:05:50.900 And, if it's downloading faster than real-time,

00:05:53.266 then it will try and pick a

00:05:54.933 higher quality, a higher rate, chunk the

00:05:57.366 next time. So it can adapt the

00:05:59.533 rate at which it downloads the video

00:06:01.233 by picking a different quality setting for

00:06:03.100 each of the chunks, each of the pieces of the video.

 

00:06:06.966 And as it downloads the chunks,

00:06:08.966 it plays each one out, in turn,

00:06:10.766 while it's downloading the next chunk.

 

00:06:15.600 And each of the chunks of video

00:06:17.666 is typically five or ten seconds,

00:06:19.933 or thereabouts, worth of video

00:06:21.766 content. And each one is compressed multiple

00:06:25.600 different times, and it's available at multiple

00:06:27.600 different rates, and it’s available at multiple

00:06:30.933 different sizes, for example.

 

00:06:33.733 And the chart on the graph,

00:06:35.300 on the right, gives an example of

00:06:37.466 how Netflix recommend videos are encoded,

00:06:40.600 starting at a rate of 235 kilobits

00:06:45.633 per second, for a 320x240 very low

00:06:49.833 resolution video, and moving up to 5800

00:06:53.300 kilobits per second, 5.8 megabits per second,

00:06:56.266 for a full HD quality video.

 

00:06:59.000 You can see the each 10 second

00:07:01.300 piece of content is available at 10

00:07:03.233 different quality levels, 10 different sizes.

 

00:07:07.000 And the receiver fetches the manifest to

00:07:09.666 start off with, which gives it the

00:07:11.200 index of all of the different chunks,

00:07:12.833 and all of the different sizes,

00:07:14.866 and which URL each one is available at.

 

00:07:17.700 And, as it fetches the chunk,

00:07:21.633 it tries to retrieve the URL for

00:07:26.133 that chunk, which involves a DNS request,

00:07:29.066 which involves the CDN redirecting it to

00:07:33.000 a local cache. And for that local

00:07:34.966 cache, as it downloads that chunk of

00:07:36.833 video, it measures the download rate.

 

00:07:39.766 If the download rate is slower than

00:07:42.466 the encoding rate, it switches to a

00:07:44.733 lower rate for the next chunk.

00:07:46.066 If the download rate is faster than the encoding rate,

00:07:49.200 it can consider switching up to a

00:07:50.666 higher quality, a higher rate, for the

00:07:52.233 next chunk. It chooses the encoding rate

00:07:55.866 to fetch based on the TCP download rate.

 

00:08:00.766 And we see what's happening is that

00:08:03.766 we've got two levels of adaptation going on.

 

00:08:07.500 On one level, we've got the dynamic

00:08:11.233 adaptive streaming, the DASH clients, fetching the

00:08:15.200 content over HTTP.

 

00:08:17.133 They’re fetching ten seconds worth of video

00:08:19.166 at a time, measuring the total time

00:08:21.033 it takes to download that ten seconds

00:08:22.800 worth of video. And they’re dividing the

00:08:25.333 time taken by the number of bytes

00:08:27.800 in each chunk, and that gives them

00:08:29.566 an average download rate for chuck.

 

00:08:34.000 They're also doing this, though, over a

00:08:37.033 TCP connection. And, as we saw in

00:08:39.966 some of the previous lectures, TCP adapts

00:08:41.966 its congestion window every round-trip time.

 

00:08:44.833 And it's following a Reno or a

00:08:47.266 Cubic algorithm, and it's following the AIMD

00:08:50.533 approach. And, as you see at the

00:08:52.600 top of the slide, the sending rate’s

00:08:54.633 bouncing around following the sawtooth pattern,

00:08:57.266 and following the slow start and the

00:08:58.866 congestion avoidance phases of TCP.

 

00:09:01.400 So we've got quite a lot of

00:09:03.400 variation, on very short time scales,

00:09:06.233 as TCP does its thing. And then

00:09:08.800 that averages out, to give an overall

00:09:10.533 download rate for the chunk.

 

00:09:14.566 And, depending on the overall download rate

00:09:16.933 that TCP manages to get, averaged over

00:09:20.366 the ten seconds worth of video for

00:09:22.833 chunk, that selects the size of the next

00:09:25.666 chunk to download. The idea is that

00:09:28.033 each chunk can be downloaded, at least

00:09:31.900 at real-time speed, and ideally a bit

00:09:34.166 faster than real-time, so the download gets

00:09:37.300 ahead of itself.

 

00:09:41.366 And, when you start watching a movie

00:09:45.233 on Netflix, or watching a program on

00:09:47.033 the iPlayer, for example, you often see

00:09:48.900 it starts out relatively poor quality,

00:09:51.833 for the first few seconds, and then

00:09:53.466 the quality jumps up after 10 or 20 seconds or so.

 

00:09:57.666 And what's happening here, is that the

00:09:59.400 receiver’s picking a conservative download rate for

00:10:01.800 the initial chunk, it’s picking one of

00:10:05.200 the relatively low quality, relatively small,

00:10:09.966 chunks, and downloading that first, and measuring

00:10:13.000 how long it takes. And, typically,

00:10:15.366 that's a conservative choice, and it realises

00:10:17.700 it can actually download,

00:10:19.033 it realises that the chunks are actually

00:10:21.666 downloading faster, so it switches up the

00:10:23.566 quality level fairly quickly. And, after the

00:10:25.766 first 10, 20, seconds, after a couple

00:10:28.300 of chunks have gone, the quality level has picked up.

 

00:10:35.500 A consequence of all of this,

00:10:37.633 is that it takes quite a long

00:10:41.033 time for streaming video to get started.

 

00:10:44.900 It’s quite common that when you start

00:10:47.266 playing a movie on Netflix, or a

00:10:50.133 program on the iPlayer, that it takes

00:10:51.633 a few seconds before it gets going.

 

00:10:54.800 And the reason for this, is some

00:10:57.600 combination of the chunk duration, and the

00:11:00.966 playout buffering, and the encoding delays if

00:11:03.566 the video’s being encoded live.

 

00:11:06.800 Fetching chunks, which are typically 10 seconds

00:11:10.300 long, you need to have one chunk

00:11:13.933 being played out at any one time.

00:11:17.000 You need to have 10 seconds worth

00:11:18.766 of video buffered up in the receiver,

00:11:20.566 so you can be playing that chunk

00:11:22.033 out while you're fetching the next one.

 

00:11:24.900 So you've got one chunk being played

00:11:26.633 out, and one being fetched, so you’ve

00:11:28.333 immediately got two chunks worth of buffering.

00:11:30.300 So that's 20 seconds worth of buffering.

 

00:11:32.600 Plus the time it takes to fetch

00:11:34.466 over the network, plus, if it's being

00:11:37.133 encoded live, the time it takes to

00:11:38.766 encode the chunk which will be at

00:11:40.300 least a, it needs to

00:11:42.166 pull in the entire chunk before it

00:11:43.833 can encode it, so you've got at

00:11:45.166 least another chunk, so that'd be another 10 seconds.

 

00:11:49.400 So you get a significant amount of

00:11:51.300 latency because of the ten second chunk

00:11:54.400 duration. You also need enough chunks of

00:11:58.333 video buffered up, such that

00:12:02.166 if the TCP download rate changes,

00:12:06.666 and it turns out that the available

00:12:08.466 capacity changes, so a chunk downloads much

00:12:10.566 slower than you would expect, that you

00:12:12.333 don't want to run out of video to play.

 

00:12:14.900 You want enough video buffered up,

00:12:17.666 that if something takes a long time,

00:12:20.033 you have time to drop down to

00:12:22.000 a lower rate for the next chunk,

00:12:24.000 and keep the video coming, even at

00:12:26.533 a reduced level, without it stalling.

00:12:28.100 Without you running out of video to play out.

 

00:12:32.033 So you’ve got to download a complete

00:12:33.733 chunk before you start playing out.

00:12:36.100 So you download and decompress a particular

00:12:38.300 chunk, and while you're doing that you're

00:12:40.700 playing the previous chunk, and everything stacks

00:12:44.633 up, the latency stacks up.

 

00:12:50.600 In addition to the fact that you're

00:12:52.400 just buffering up the different chunks of

00:12:55.000 video, and you need to have a

00:12:57.100 complete chunk being played while the next

00:12:59.600 one is downloading, you get the sources

00:13:02.000 of latency because of the network,

00:13:03.800 because of the way the data is transmitted over the network.

 

00:13:07.733 As we saw when we spoke about

00:13:09.766 TCP, the usual way TCP retransmits lost

00:13:13.900 packets, is following a triple duplicate ACK.

 

00:13:18.766 What we see on the slide here,

00:13:21.800 is that the data, on the sending

00:13:24.800 side, we have the user space,

00:13:26.700 where the blocks of data, the chunks

00:13:29.066 of video, are being written into a TCP connection.

 

00:13:32.766 And these get buffered up in the

00:13:34.633 kernel, in the operating system kernel on

00:13:36.866 the sender side, and transmitted over the network.

 

00:13:40.200 At some point later they arrive in

00:13:42.066 the operating system kernel on the receiver

00:13:44.400 side, and that generates the acknowledgments as

00:13:47.533 those chunks, as the TCP packets,

00:13:50.333 the chunks of video, are received.

 

00:13:53.433 And, if a packet gets lost,

00:13:55.533 it starts generating duplicate acknowledgments. And,

00:13:58.266 eventually, after the triple duplicate acknowledgement,

00:14:00.866 the packet will be transmitted.

 

00:14:04.600 And we see that this takes time.

 

00:14:07.700 And if this is video, and the

00:14:09.666 packets are being sent at a constant

00:14:11.333 rate, we see that it takes time

00:14:13.833 to send four packets, the lost packet

00:14:16.633 plus the three following that generate the duplicate ACKs,

00:14:20.033 before the sender notices that a packet

00:14:25.966 loss has happened. Plus, it takes one

00:14:29.000 round trip time for the acknowledgements to

00:14:31.300 get back to the sender, and for

00:14:32.900 it to retransmit the packet.

 

00:14:35.333 So the time before, if a packet

00:14:38.333 has been lost, it takes four times

00:14:40.033 the packet transmission time, plus one round-trip

00:14:42.733 time, before the packet gets

00:14:44.433 retransmitted, and arrives back at the receiver.

 

00:14:47.766 And that adds some latency. It’s got

00:14:50.833 to add at least four packets,

00:14:53.733 plus one round-trip time, extra latency to

00:14:56.233 cope with a single retransmission.

 

00:14:59.133 And, if the network's unreliable, such that

00:15:01.366 more than one packet is likely to

00:15:02.933 be lost, you need to add in

00:15:04.300 more buffering time, add in additional latency,

00:15:07.000 to allow the packets to arrive,

00:15:09.133 such that they can be given to

00:15:10.933 the receiver without disrupting the timing.

 

00:15:14.133 So you need to add some latency

00:15:16.433 to compensate for the retransmissions that TCP

00:15:18.933 might be causing, so that you can

00:15:21.633 keep receiving data smoothly while accounting for

00:15:24.866 the retransmission times.

 

00:15:31.166 In addition, there’s some latency due to

00:15:33.800 the size of the chunks of video.

 

00:15:36.633 Each chunk has to be independently decodable,

00:15:39.466 because you're changing the

00:15:43.500 compression, potentially changing the compression level,

00:15:45.866 at each chunk. So each one can't

00:15:48.466 be based on the previous one.

00:15:49.633 They all have to start from scratch

00:15:52.700 at the beginning of each chunk,

00:15:54.333 because you don't know what version came before.

 

00:15:58.766 And, if you look at how video

00:16:00.666 compression works, it's all based on predicting.

 

00:16:03.033 You send initial frames, what are called

00:16:04.966 I-frames, index frames,

00:16:06.800 which give you a complete frame of

00:16:08.700 video. And then they predict, based on

00:16:11.333 that, the next few frames based on

00:16:13.600 that. So, at the start of a

00:16:15.733 scene, you’ll send an index frame,

00:16:18.800 and then, for the rest of the

00:16:20.133 scene, each of the successive frames will

00:16:22.633 just include the difference from the previous

00:16:25.000 frame, from the previous index frame.

 

00:16:30.266 And how often you send index frames

00:16:35.833 affects that the encoding rates, because the

00:16:37.900 index frames are big.

 

00:16:39.833 They're sending a complete frame of video,

00:16:41.900 whereas the predicted frames, in between,

00:16:43.933 are much smaller. The index frames are

00:16:46.033 often, maybe, 20 times the size of

00:16:47.900 the predicted frames.

 

00:16:49.933 And depending how you encode the chunks,

00:16:52.733 if the smaller chunks are, because each

00:16:55.166 chunk of video has to start with

00:16:57.266 an index frame, it has to start

00:16:58.866 with a complete frame, the shorter each

00:17:01.200 chunk is, the fewer P-frames that can

00:17:04.166 be sent before the start of the

00:17:05.900 next chunk and the next index free.

 

00:17:09.233 So you have this trade-off. You can

00:17:12.300 make the chunks of video small,

00:17:14.866 and that reduces the latency in the

00:17:17.200 system, but it means you have more

00:17:19.200 frequent index frames. And the more frequent index frames

00:17:24.300 need more data, because the index frames

00:17:27.800 are large compared to the predicted frames,

00:17:30.000 so the encoding efficiency goes down,

00:17:32.633 and the overheads go up.

 

00:17:35.633 And this tends to enforce a lower

00:17:38.533 bound of around two seconds before the

00:17:41.066 overhead of sending the frequent index frames

00:17:44.066 gets to be too much, it gets to be excessive.

 

00:17:46.733 So chunk sizes tend to be more

00:17:49.866 than that, tend to be 5,

00:17:51.000 10, seconds just keep the overheads down,

00:17:53.166 to keep the compression efficiency, the video

00:17:55.266 compression efficiency, reasonable.

 

00:17:57.966 And that's the main source of latency in these applications.

 

00:18:06.800 So, this clearly works. Applications like Netflix,

00:18:12.266 like the iPlayer, clearly work.

 

00:18:15.366 But they have relatively high latency.

 

00:18:18.100 Because you're fetching video chunk-by-chunk, and each

00:18:22.366 chunk is five or ten seconds worth of video,

00:18:25.233 you have five or ten seconds wait

00:18:30.033 when you start the video playing,

00:18:31.966 before it actually starts playing. And it's

00:18:36.000 difficult to reduce that latency, because of

00:18:39.500 the compression efficiency, because of the overheads.

 

00:18:44.266 And it would be desirable, though, to reduce that latency.

 

00:18:50.433 It will be desirable for people who

00:18:53.466 watch sport, because the latency for the

00:18:56.766 streaming applications is higher than it is

00:18:59.133 for broadcast TV,

00:19:00.666 so, if you're watching live sports,

00:19:02.300 you tend to see the action 5,

00:19:05.666 or 10, or 20, seconds behind broadcast

00:19:08.466 TV, and that can be problematic.

 

00:19:14.400 It’s also a problem for people trying

00:19:17.466 to offer interactive applications, and augmented reality,

00:19:20.300 where they'd like the latency to be

00:19:22.800 low enough that you can interact with

00:19:25.066 the content, and maybe dynamically change the

00:19:27.633 view point, or interact with parts of the video.

 

00:19:31.766 So people are looking to build lower-latency

00:19:34.933 streaming video.

 

00:19:37.766 I think there's two ways in which this is likely to happen.

 

00:19:43.366 The first is that we might go back to using RTP.

 

00:19:47.700 We might go back to using something

00:19:51.566 like WebRTC to control the setup,

00:19:54.466 and build streaming video using essentially the

00:19:57.200 same platform we use for interactive video conferencing,

00:20:00.800 but sending in one direction only.

 

00:20:04.800 And this is possible today.

00:20:07.733 The browsers support

00:20:09.700 WebRTC, and there's nothing that says you

00:20:13.533 have to transmit as well as receiving

00:20:15.566 in a WebRTC session. So you could

00:20:17.766 build an application uses WebRTC to stream

00:20:20.033 video to the browser.

 

00:20:22.666 It would have much lower latency than

00:20:25.133 the DASH-based, dynamic adaptive streaming over HTTP

00:20:28.900 based, approach that people use today.

 

00:20:31.833 But it's not clear that it would

00:20:33.500 play well with the content distribution networks.

00:20:35.733 It’s not clear that the CDNs would support RTP streaming.

 

00:20:39.600 But if they did, if the CDNs

00:20:42.200 could be persuaded to support RTP,

00:20:44.266 this would be a good way of getting lower latency.

 

00:20:48.700 I think what's perhaps more likely,

00:20:50.700 though, is that we will start to

00:20:53.266 see the CDNs switching to support QUIC,

00:20:55.966 because it gives better performance

00:20:58.600 for web traffic in general,

00:21:01.000 and then people start to switch to

00:21:03.633 delivering the streaming video over QUIC.

 

00:21:07.400 And, because QUIC is a user space

00:21:10.333 stack, it's easier to deploy interesting transport

00:21:15.900 protocol innovations. Because they're done by just

00:21:18.000 deploying a new application, you don't have

00:21:19.800 to change the operating system kernel,

00:21:21.566 you don't have to change,

00:21:23.066 if you want to change how TCP

00:21:24.666 works, you have to change the operating system.

00:21:26.600 Whereas if you want to change the way QUIC works,

00:21:28.733 you just have to change the application or the library

00:21:30.933 that's providing QUIC.

 

00:21:32.500 So I think it's likely that we

00:21:33.833 will see CDNs switching to use HTTP/3,

00:21:38.200 and HTTP over QUIC,

00:21:40.133 and I think it's likely that they'll

00:21:41.900 also switch to delivering video over QUIC.

 

00:21:43.866 And I think that gives much more

00:21:45.466 flexibility to change the way QUIC works,

00:21:47.733 to optimise it to support low-latency video.

 

00:21:51.600 And we’re already, I think, starting to

00:21:54.033 see that happening. YouTube is already delivering

00:21:57.500 video over QUIC.

 

00:21:59.000 There are people talking about datagram extensions

00:22:01.966 to QUIC in the IETF to get

00:22:04.033 low latency, so I think we’re likely

00:22:07.133 to see the video switching to be

00:22:08.866 delivered by the CDNs using QUIC,

00:22:11.333 but with some QUIC extension to provide lower latency.

 

00:22:18.600 So that's all I want to say

00:22:20.466 about real-time and interactive applications.

 

00:22:25.366 The real-time applications have latency bounds.

 

00:22:29.266 They may be strict latency bounds,

00:22:31.833 150 milliseconds for an interactive application or

00:22:36.566 a video conference, or they may be

00:22:38.733 quite relaxed latency bounds, 10s of seconds

00:22:41.766 for streaming video currently.

 

00:22:44.733 The interactive applications run over WebRTP,

00:22:48.300 which is the Real-time Transport Protocol,

00:22:51.266 RTP, for the media transport, with a

00:22:54.566 web-based signalling protocol put on top of

00:22:56.700 it. Of they use older standards,

00:22:59.633 such as SIP,

00:23:01.433 the way mobile phones or

00:23:03.633 the telephone network works, these days,

00:23:06.200 to set up the RTP flows.

 

00:23:09.533 Streaming applications, because they want to fit

00:23:12.266 with the content distribution network infrastructure,

00:23:15.866 because the amount of video traffic is

00:23:18.333 so great that they need the

00:23:20.933 scaling advantages that comes with distribution networks,

00:23:23.900 use an approach known as DASH,

00:23:26.100 Dynamic Adaptive Streaming over HTTP,

00:23:29.066 and deliver the video over HTTP as

00:23:31.466 a series of chunks, with a manifest,

00:23:33.600 and they let the browser choose which

00:23:36.133 chunk sizes to fetch, and use that

00:23:39.033 as a coarse-grained method of adaptation.

 

00:23:41.866 And this is very scalable, and it

00:23:45.100 makes very good use of the CDN

00:23:47.666 infrastructure to scale out, but it's relatively

00:23:50.933 high latency,

00:23:52.966 and relatively high overhead. And I think

00:23:56.666 the interesting challenge, in the future,

00:23:58.366 is to be combining these two approaches,

00:24:00.566 to try and get the scaling benefits

00:24:02.566 of content distribution networks,

00:24:04.500 and the low-latency benefits of protocols like

00:24:07.233 RTP, and to try and bring this

00:24:09.333 into the video streaming world.

Discussion

Lecture 7 discussed real-time and interactive applications. It reviewed the definition of real-time traffic, and the differing deadlines and latency requirements for streaming and interactive applications, the differences in elasticity of traffic demand in real-time and non-real-time applications, quality of service, and quality of experience.

Considering interactive conferencing applications, the lecture reviewed the structure of such applications and briefly described the standard Internet multimedia conferencing protocol stack. It outlined the features RTP provides for secure delivery of real-time media, and highlighted the importance of timing recovery, application level framing, loss concealment, and forward error correction. It briefly mentioned the WebRTC peer-to-peer data channel. And it discussed the need for signalling protocols to setup interactive calls, and briefly outlined how SIP and WebRTC use SDP to negotiate calls.

Considering streaming applications, the lecture highlighted the role of content distribution networks to explain why media is delivered over HTTP. It explained chunked media delivery and the Dynamic Adaptive Streaming over HTTP (DASH) standard for streaming video, showing how this adapts the sending rate and how it relates to TCP congestion control. The lecture also mentioned some sources of latency for DASH-style systems.

Discussion will focus on the essential differences between real-time and non-real-time applications, timing recovery, and media transport in both interactive and streaming applications.