RTP - Audio and Video for the Internet

, 2003.
Published by Addison-Wesley (ISBN 0-672-32249-8)

This book describes the protocols, standards, and architecture of systems that deliver real-time voice, music, and video over IP networks, such as the Internet. Relevant applications include voice-over-IP, telephony, teleconferencing, streaming video, and web-casting. The focus of the book is media transport: how to reliably deliver audio and video across an IP network, how to ensure high quality in the face of network problems, and how to ensure that the system is secure. The book adopts a standards-based approach, based around the Real-time Transport Protocol, RTP, and its associated profiles and payload formats. It describes the RTP framework, how to build a system that uses that framework, and extensions to RTP for security and reliability.


The book is logically divided into four sections: The first section introduces the problem space, provides background, and outlines the properties of the Internet that affect audio/video transport. These are the chapters in the first section:

  • Chapter 1, Introduction gives a brief introduction to the Real-time Transport Protocol, outlines the relationship between RTP and other standards, and describes the scope of the book.
  • Chapter 2, Audio/Video Communication over Packet Networks describes the unique environment provided by IP networks, and how this affects packet audio/video applications.
Cover photograph (Japanese)

The second part, consisting of the next five chapters, discusses the basics of the Real-time Transport Protocol. This is information you need to design and build a tool for voice-over-IP, streaming music or video, and so forth. Following are the related chapters:

  • Chapter 3, The Real-Time Transport Protocol introduces RTP and related standards, describes how they fit together, and outlines the design philosophy underpinning the protocol.
  • Chapter 4, RTP Data Transfer Protocol is a detailed description of the transport protocol, used to convey audio/visual data over IP networks.
  • Chapter 5, RTP Control Protocol describes the control protocol, which provides reception quality feedback, membership control, and synchroniszation.
  • Chapter 6, Media Capture, Playout and Timing explains how a receiver can reconstruct the audio/visual data and play it out to the user with correct timing.
  • Chapter 7, Lip Synchronization addresses a related problem: how to synchronisze media streams, for example, to get lip synchroniszation.

The third part of the book discusses robustness: how to make your application reliable in the face of network problems, and how to compensate for loss and congestion in the network. You can build a system without using these techniques, but it will sound a lot better, and the pictures will be smoother and less susceptible to corruption, if you apply them. These chapters make up the third part of the book:

  • Chapter 8, Error Concealment addresses the issue of concealing errors caused by incomplete reception, describing several techniques a receiver can use to hide network problems.
  • Chapter 9, Error Correction describes techniques that can be used to repair damaged media streams, wherein which the sender and receiver cooperate in repairing the media stream.
  • Chapter 10, Congestion Control discusses the way the Internet responds to congestion, and how this affects the design of audio/video applications.

The final section describes a number of techniques, that have more specialized use. Many implementations do not use these features, but they can give a significant performance increase in some cases. These are the chapters:

  • Chapter 11 Header compression outlines a technique that can significantly increase the efficiency of RTP on low-speed network links, such as dial-up modems or cellular radio links.
  • Chapter 12, Multiplexing and Tunnelling presents alternatives to header compression that work by combining several media streams into one. Again, the intent is to improve efficiency on low-speed links.
  • Chapter 13, Security Considerations describes how encryption and authentication technology can be used to protect RTP sessions; it also describes common security and privacy issues.


This book describes audio/video transport over IP networks in considerable detail. It assumes some basic familiarity with IP network programming, and the operation of network protocols, and builds on this knowledge to describe the features unique to audio/video transport. An extensive list of references is included, pointing readers to additional information on specific topics and to background reading material.

Several audiences will find this book useful:

  • Engineers: The primary audience is those building Voice-over-IP applications, teleconferencing systems, and streaming media and web-casting applications. This book is a guide to the design and implementation of the media transport engine part of such systems. It should be read in conjunction with the relevant technical standards, and it builds from those standards to show how a system is designed and engineered. This book does not discuss signalling (for example, SIP, RTSP, or H.323) since that is a separate subject worthy of a book in its own right. Instead it talks in detail about media transport, and how to achieve good quality audio and smooth motion video over IP networks.
  • Students: The book can be read as an accompaniment to a course in network protocol design or telecommunications, either at either graduate or advanced undergraduate level. Familiarity with IP networks, and layered protocol architectures, is assumed. The unique aspects of protocols for real-time audio/video transport are highlighted, as are the differences from a typical layered system model. The cross-disciplinary nature of the subject is noted, in particular the relation between the psychology of human perception and the demands of robust media delivery.
  • Researchers: Academics and industrial researchers can use this book as a source of information about the standards and algorithms comprising the current state of the art in real-time audio/video transport over IP networks. Pointers to the literature are, included in the References section, and will be useful starting points for those seeking further depth and areas where more research is needed.
  • Network Administrators: Understanding the technical protocols underpinning the common streaming audio/video applications illuminates for network administrators how those applications can affect the behaviour of the network, and how the network can be engineered to better suit those applications. This book includes extensive discussion of the network behaviour commonly seen, and covers how applications can adapt to it, the needs of congestion control, and the security implications of real-time audio/video traffic.

This book can be used as a reference, in conjunction with the technical standards, as a study guide, or as part of an advanced course on network protocol design or communication technology.


If you believe you have found a mistake in the book, please contact me.

Chapter 2

  • Figure 2.12: The arrow labelled "delay" should cover the gap between the onset of the delay spike and the first delayed packet, rather than the onset of the delay spike and the time the transit delay returns to normal.
  • Figure 2.13: The vertical axis should be labelled "Average loss probability" rather than "Loss probability"

Chapter 4

  • Page 75: On the second to last line, wrap-around-count should be wrap_around_count.
  • Page 76: Line 1 of the sample program is missing a semicolon, and should read:
    uint16_t udelta = seq - max_seq;
  • Page 76: Line 4 of the sample program is missing a semicolon, and should read:
  • Page 86: The header extension length field counts 32-bit words excluding the initial 32 bits, not octets.

Chapter 5

  • Figure 5.1: RTCP packets are a multiple of 32 bits in length. Padding is used to increase the length of the packet to another multiple of 32 bits, longer than the natural length of the packet. Accordingly, the figure is incorrect, since it implies that the padding is needed to pad to a 32 bit boundary.
  • Figure 5.3: The brace labelling "First Report Block" should include the fields from the Reportee SSRC down to the DLSR. It should not include the Reporter SSRC, V, P, RC, PT and Length fields.
  • Figure 5.12: the length of the SDES packet should be 10, since the padding is included in the length.
  • Page 122: Listing 5.1 is missing an opening brace in the function definition. The listing should begin:
    validate_rtcp(rtcp_t *packet, int length)
       rtcp_t  *end = (rtcp_t *) (((char *) packet) + length);
  • Page 123: The code fragment to check that all RTCP packets are compound packets is missing an opening parenthesis, and should read:
    if (((packet->length + 1) * 4) == length) {
  • Page 130: The code fragment is incorrectly wrapped; "75% of RTCP bandwidth" should all be on one line.

Chapter 6

  • Page 154: In the NTSC video example, the timestamp should increase by exactly 3003 per frame (not per packet).
  • Page 178: The calculation of d_n is reversed compared to the formula on page 176, and should read:
    uint32_t   d_n = curr_time - p->ts
    since there is an unknown constant random offset applied to the RTP timestamp, this change makes no difference in practice.
  • Page 190: Comments in the code sample are incorrectly formatted. Also the calculation of delta_var is missing a minus sign, and should read:
    delta_var = (abs(transit - last_transit) + abs(transit - last_last_transit))/8;
  • Page 196: The first line of the code sample should read:
    if ((curr->pt == COMFORT_NOISE_PT) || is_comfort_noise(curr)) {
  • Page 205: just before the figure, "directory memory access" should be "direct memory access"

Chapter 7

  • Page 217: The equation should be: Ts = Tssr + (M - Msr) / R

Chapter 8

  • Page 231: The third line of the sample code is missing the closing ] and should read:
    sample missing_frame[samples_per_frame])
  • Page 236: In the listing the calculation of fade_per_sample is split across two lines, and incorrectly indented. Also, the code in the body of the second "for" loop is missing semicolons, and should read:
    for (j=0; j<samples_per_frame; j++) {
       missing_frame[j] *= scale_factor;
       scale_factor -= fade_per_sample;

Chapter 9

  • Figures 9.3 and 9.9: The legend should read "CC = contributing source count" instead of "CC = list of contributing sources"
  • Page 261: just before the SDP, "for example, 122" should read "in this example, 122"

Chapter 10

  • Figure 10.1: The horizontal axis should be labelled "Packet Sending Rate"

Chapter 11

  • Figure 11.3: The field containing the M'S'T'I' headers should contain the CC header, not a Link Sequence header


  • Reference 43 has now been published as RFC 3545.
  • Reference 49 has now been published as RFC 3551.
  • Reference 50 has now been published as RFC 3550.
  • Reference 51 has now been published as RFC 3555.
  • Reference 54 has now been published as RFC 3569.

There are no significant technical changes in the RFC versions of these references, compared to the draft versions used in the preparation of the book.

Thanks are due to Akimichi Ogawa, Badri Natarajan, Edd Inglis, Jason Van Eaton, David Tod, Jungkhun Byun, Yeong-Chuan Lim and Jeffrey Jo for reporting a number of errata.