Multiplexing and RTP Sessions

7 August 2011 / rtcweb

The following is extracted from version -02 of our RTP Requirements for RTC-Web draft, and describes some of the issues relating to RTP session multiplexing. This was discussed at IETF 81 in Quebec City. The linked set of slides on Multiplexing RTP Sessions were prepared for the RTC-Web working group session at that meeting, but were not presented to the working group: discussion in a break-out meeting covered the key points.

Expected Topologies

As RTC-Web is focused on peer to peer connections established from clients in web browsers the following topologies further discussed in RTP Topologies [RFC5117] are primarily considered. The topologies are depicted and briefly explained here for ease of the reader.

The point to point topology (Figure 1) is going to be very common in single user to single user applications.

Figure 1: Point to Point

For small multiparty sessions it is practical enough to create RTP sessions by letting every participant send individual unicast RTP/UDP flows to each of the other participants (Figure 2). This is called multi-unicast and is unfortunately not discussed in the RTP Topologies RFC. This topology has the benefit of not requiring central nodes. The downside is that it increases the used bandwidth at each sender by requiring one copy of the media streams for each participant that are part of the same session beyond the sender itself. Thus this is limited to scenarios with few end-points unless the media is very low bandwidth.

Figure 2: Multi-Unicast

It needs to be noted that, if this topology is to be supported by the RTC-Web framework, it needs to be possible to connect one RTP session to multiple established peer to peer flows that are individually established.

An RTP mixer (Figure 3) is a centralised point that selects or mixes content in a conference to optimise the RTP session so that each end- point only needs connect to one entity, the mixer. The mixer also reduces the bit-rate needs as the media sent from the mixer to the end-point can be optimised in different ways. These optimisations include methods like only choosing media from the currently most active speaker or mixing together audio so that only one audio stream is required in stead of 3 in the depicted scenario. The downside of the mixer is that someone is required to provide the actual mixer.

Figure 2: RTP Mixer with Only Unicast Paths

If one wants a less complex central node it is possible to use an relay (called an Transport Translator) (Figure 4) that takes on the role of forwarding the media to the other end-points but doesn't perform any media processing. It simply forwards the media from all other to all the other. Thus one endpoint A will only need to send a media once to the relay, but it will still receive 3 RTP streams with the media if B, C and D all currently transmits.

Figure 2: RTP Translator with Only Unicast Paths

To support legacy end-point (B) that don't fulfil the requirements of RTC-Web it is possible to insert a Translator (Figure 5) that takes on the role to ensure that from A's perspective B looks like a fully compliant end-point. Thus it is the combination of the Translator and B that looks like the end-point B. The intention is that the presence of the translator is transparent to A, however it is not certain that is possible. Thus this case is include so that it can be discussed if any mechanism specified to be used for RTC-Web results in such issues and how to handle them.

Figure 2: RTP Translator Towards Legacy End-Point

RTP Multiplexing Points

There are three fundamental points of multiplexing within the RTP framework:

  • Use of separate RTP Sessions: The first, and the most important, multiplexing point is the RTP session. This multiplexing point does not have an identifier within the RTP protocol itself, but instead relies on the lower layer to separate the different RTP sessions. This is most often done by separating different RTP sessions onto different UDP ports, or by sending to different IP multicast addresses. The distinguishing feature of an RTP session is that it has a separate SSRC identifier space; a single RTP session can span multiple transport connections provided packets are gatewayed such that participants are known to each other. Different RTP sessions are used to separate different types of media within a multimedia session. For example, audio and video flows are sent on separate RTP sessions. But also completely different usages of the same media type, e.g. video of the presenter and the slide video, benefits from being separated.
  • Multiplexing using the SSRC within an RTP session: The second multiplexing point is the SSRC that separates different sources of media within a single RTP session. An example might be different participants in a multiparty teleconference, or different camera views of a presentation. In most cases, each participant within an RTP session has a single SSRC, although this may change over time if collisions are detected. However, in some more complex scenarios participants may generate multiple media streams of the same type simultaneously (e.g., if they have two cameras, and so send two video streams at once) and so will have more than one SSRC in use at once. The RTCP CNAME can be used to distinguish between a single participant using two SSRC values (where the RTCP CNAME will be the same for each SSRC), and two participants (who will have different RTCP CNAMEs).
  • Multiplexing using the Payload Type within an RTP session: If different media encodings of the same media type (audio, video, text, etc) are to be used at different times within an RTP session, for example a single participant that can switch between two different audio codecs, the payload type is used to identify how the media from that particular source is encoded. When changing media formats within an RTP Session, the SSRC of the sender remains unchanged, but the RTP Payload Type changes to indicate the change in media format.

These multiplexing points area fundamental part of the design of RTP and are discussed in Section 5.2 of [RFC3550]. Of special importance is the need to separate different RTP sessions using a multiplexing mechanism at some lower layer than RTP, rather than trying to combine several RTP sessions implicitly into one lower layer flow. This will be further discussed in the next section.

RTP Session Multiplexing

In today's network with prolific use of Network Address Translators (NAT) and Firewalls (FW), there is a desire to reduce the number of transport layer ports used by an real-time media application using RTP. This has led some to suggest multiplexing two or more RTP sessions on a single transport layer flow, using either the Payload Type or SSRC to demultiplex the sessions, in violation of the rules outlined above. It is not the first time some people look at RTP and question the need for using RTP sessions for different media types, and even more the potential need to separate different media streams of the same type into different session due to their different purposes. Section 5.2 of [RFC3550] outlines some of those problems; we elaborate on that discussion, and on other problems that occurs if one violates this part of the RTP design and architecture.

Why RTP Sessions Should be Demultiplexed by the Transport

As discussed in Section 5.2 of [RFC3550], multiplexing several RTP sessions (e.g., audio and video) onto a single transport layer flow introduces the following problems:

  • Payload Identification: If two RTP sessions of the same type are multiplexed onto a single transport layer flow using the same SSRC but relying on the Payload Type to distinguish the session, and one were to change encodings and thus acquire a different RTP payload type, there would be no general way of identifying which stream had changed encodings. This can be avoided by partitioning the SSRC space between the two sessions, but that causes other problems as discussed below.
  • Timing and Sequence Number Space: An RTP SSRC is defined to identify a single timing and sequence number space. Interleaving multiple payload types would require different timing spaces if the media clock rates differ and would require different sequence number spaces to tell which payload type suffered packet loss. Using multiple clock rates in a single RTP session is problematic, as discussed in [I-D.ietf-avtext-multiple-clock-rates]. This can be avoided by partitioning the SSRC space between the two sessions, but that causes other problems as discussed below.
  • RTCP Reception Reports: RTCP sender reports and receiver reports can only describe one timing and sequence number space per SSRC, and do not carry a payload type field. Multiplexing sessions based on the payload type breaks RTCP. This can be avoided by partitioning the SSRC space between the two sessions, but that causes other problems as discussed below.
  • RTP Mixers: Multiplexing RTP sessions of incompatible media type (e.g., audio and video) onto a single transport layer flow breaks the operation of RTP mixers, since they are unable to combine the flows together.
  • RTP Translators: Multiplexing RTP sessions of incompatible media type (e.g., audio and video) onto a single transport layer flow breaks the operation of RTP some types of RTP translator, for example media transcoders, which rely on the RTP requirement that all media are of the same type.
  • Quality of Service: Carrying multiple media in one RTP session precludes the use of different network paths or network resource allocations that are flow based if appropriate. It also makes reception of a subset of the media, for example just audio if video would exceed the available bandwidth, difficult without the use of an RTP translator within the network to filter out the unwanted media which unless they are trusted devices (and included in the key-exchange). This is difficult to combine with media security functions.
  • Separate Endpoints: Multiplexing several sessions into one transport layer flow prevents use of a distributed endpoint implementation, where audio and video are rendered by different processes and/or systems.

We do note that some of the above issues are resolved as long as there is explicit separation of the RTP sessions when transported over the same lower layer transport, for example by inserting a multiplexing layer in between the lower transport and the RTP/RTCP headers. But a number of the above issue are not resolved by this.

In the RTCWEB context, i.e. web browsers running on various end- points it might appear unlikely that flow based QoS is available on the end-points that will support RTCWEB. We don't disagree that it is unlikely for the common case of users in their home- network or at WiFi hotspots will have flow-based QoS available. However, if one considers enterprise users, especially using intranet applications, the availability and desire to use QoS is not implausible. There are also web users who use networks that are more resource-constrained than wired networks and WIFI networks, for example cellular network. The current access network QoS mechanism for user traffic in cellular technology from 3GPP are flow based.

RTP's design hasn't been changed, although session multiplexing related topics have been discussed at various points of RTP's 20 year history. The fact is that numerous RTP mechanism and extensions have been defined assuming that one can perform session multiplexing when needed. Mechanism that has been identified as problematic if one doesn't do session separation are:

  • Scalability: RTP was built with media scalability in consideration. The simplest way of achieving separation between different scalability layers is placing them in different RTP sessions, and using the same SSRC and CNAME in each session to bind them together. This is most commonly done in multicast, and not particularly applicable to RTC-Web, but gatewaying of such a session would then require more alterations and likely stateful translation.
  • RTP Retransmission in Session Multiplexing mode: RTP Retransmission does have a mode for session multiplexing. This would not be the main mode used in RTC-Web, but for interoperability and reduced cost in translation support for different RTP Sessions are beneficial.
  • Forward Error Correction: The RTP Payload Format for Generic Forward Error Correction and its update can only be used on media formats that produce RTP packets that are smaller than half the MTU if the FEC flow and media flow being protected are to be sent in the same RTP session, this is due to RTP Payload for Redundant Audio Data. This is because the SSRC value of the original flow is recovered from the FEC packets SSRC field. So for anything that desires to use these format with RTP payloads that are close to MTU needs to put the FEC data in a separate RTP session compared to the original transmissions. The usage of this type of FEC data has not been decided on in RTC-Web.
  • SSRC Allocation and Collision: The SSRC identifier is a random 32-bit number that is required to be globally unique within an RTP session, and that is reallocated to a new random value if an SSRC collision occurs between participants. If two or more RTP sessions share a transport layer flow, there is no guarantee that their choice of SSRC values will be distinct, and there is no way in standard RTP to signal which SSRC values are used by which RTP session. RTP is explicitly a group-based communication protocol, and new participants can join an RTP session at any time; these new participants may chose SSRC values that conflict with the SSRC values used in any of the multiplexed RTP sessions. This problem can be avoided by partitioning the SSRC space, and signalling how the space is to be subdivided, but this is not backwards compatible with any existing RTP system. In addition, subdividing the SSRC space makes it difficult to gateway between multiplexed RTP sessions and standard RTP sessions: the standard sessions may use parts of the SSRC space reserved in the multiplexed RTP sessions, requiring the gateway to rewrite RTCP packets, as well as the SSRC and CSRC list in RTP data packets. Rewriting RTCP is a difficult task, especially when one considers extensions such as RTCP XR.
  • Conflicting RTCP Report Types: The extension mechanisms used in RTCP depend on separation of RTP sessions for different media types. For example, the RTCP Extended Report block for VoIP is suitable for conversational audio, but clearly not useful for Video. This may cause unusable or unwanted reports to be generated for some streams, wasting capacity and confusing monitoring systems. While this is problem may be unlikely for VoIP reports, it may be an issue for the more detailed media agnostic reports which are sometimes be used for different media types. Also, this makes the implementation of RTCP more complex, since partitioning the SSRC space by media type needs not only to be one the media processing side, but also on the RTCP reporting
  • RTCP Reporting and Scheduling: The RTCP reporting interval and its packet scheduling will be affected if several RTP sessions are multiplexed onto the same transport layer flow. The reporting interval is determined by the session bandwidth, and the reporting interval chosen for a high-rate video session will be different to the interval chosen by a low-rate VoIP session. If such sessions are multiplexed, then participants in one session will see the SSRC values of the other session. This will cause them to overestimate the number of participants in the session by a factor of two, thus doubling their RTCP reporting interval, and making their feedback less timely. In the worst case, when an RTP session with very low RTCP bandwidth is multiplexed with an RTP session with high RTCP bandwidth, this may cause repeated RTCP timer reconsideration, leading to the members of the low bandwidth session timing out. Participants in an RTP session configured with high bandwidth (and short RTCP reporting interval) will see RTCP reports from participants in the low bandwidth session much less often than expected, potentially causing them to repeatedly timeout and re-create state for those participants. The split of RTCP bandwidth between senders and receivers (where at least 25% of the RTCP bandwidth is allocated to senders) will be disrupted if a session with few senders (e.g., a VoIP session) is multiplexed with a session with many senders (e.g., a video session). These issues can be resolved if the partition of the SSRC is signalled, but this is not backwards compatible with any existing RTP system. The partition would require re-implementing large part of the RTCP processing to take the individual sessions into account.
  • Sampling Group Membership: The mechanism defined in RFC2762 to sample the group membership, allowing participants to keep less state, assumes a single flat 32-bit SSRC space, and breaks if the SSRC space is shared between several RTP sessions.

As can be seen, the requirement that separate RTP sessions are carried in separate transport-layer flows is fundamental to the design of RTP. Due to this design principle, implementors of various services or applications using RTP have not commonly violated this model, and have separated RTP sessions onto different transport layer flows. After 15 years of deployment of RTP in its current form, any move to change this assumption must carefully consider the backwards compatibility problems that this will cause. In particular, since widespread use of multiplexed RTP sessions in RTC-Web will almost certainly cause their use in other scenarios, the discussion regarding compatibility must be wider than just whether multiplexing works for the extremely limited subset of RTP use cases currently being considered in the RTC-Web group. Any such multiplexing extension to RTP must therefore be developed by the AVTCORE working group, since it has much broader applicability and scope than RTC- Web.

Arguments for a single transport flow

The arguments we are aware of for why it is desirable to use a single underlying transport (e.g., UDP) flow for all media, rather than one flow for each type of media are the following:

  • End-Point Port Consumption: A given IP address only has 16-bits of available port space per transport protocol for any consumer of ports that exists on the machine. This is normally never an issue for a end-user machine. It can become an issue for servers that has large number of simultaneous flows. However, in RTCWEB where we will use authenticated STUN requests a server can serve multiple end-point from the same local port, and use the whole 5-tuple (source and destination address, source and destination port, protocol) as identifier of flows. Thus, in theory, the minimal number of media server ports needed are the maximum number of simultaneous RTP sessions a single end-point may use, when in practice implementation probably benefit from using more.
  • NAT State: If an end-point is behind a NAT each flow it generates to an external address will result in state on that NAT. That state is a limited resource, either from memory or processing stand- point in home or SOHO NATs, or for large scale NATs serving many internal end-points, the available ports run-out. We see this primarily as a problem for larger centralised NATs where end-point independent mapping do require each flow mapping to use one port for the external IP address, thus affecting the the maximum aggregation of internal users per external IP address. However, we would like to point out that a RTCWEB session with audio and video are likely using 2 or 3 UDP flows. This can be contrasted with that certain web applications that can result that 100+ TCP flows are opened to various servers. Sure they are recovered more quickly due to the explicit session teardown when no longer need, at the same time more web sites may be simultaneously communicated in various browser tabs. So the question is if the UDP mapping space is as heavily used as the TCP mapping space, and that TCP will continue to be the limiting factor for the amount of internal users a particular NAT can support.
  • NAT Traversal taking additional time: When doing NAT/FW traversal it takes additional time to open additional ports. And it takes time in a phase of communication between accepting to communicate and the media path being established which is a fairly critical. The best case scenario for how much extra time it can take following the specified ICE procedures are. 1.5*RTT + Ta*(Additional_Flows-1), where Ta is the pacing timer, which ICE specifies to be no smaller than 20 ms. That assumes a message in one direction, and then an immediate triggered check back. This as ICE first finds one candidate pair that works prior to establish multiple flows. Thus, there is no extra time until one has found a working candidate pair, from that is only the time it takes to in parallel establish the additional flows which in most case are 1 or 2 more additional flows.
  • NAT Traversal Failure Rate: In cases when one needs more than a single flow to be established through the NAT there is some risk that one succeed in establishing the first flow but fails with one or more of the additional flows. The risk that this happens are hard to quantify. However, that risk should be fairly low as one has just prior successfully established one flow from the same interfaces. Thus only rare events as NAT resource overload, or selecting particular port numbers that are filtered etc, should be reasons for failure.


As we have noted in the preceding sections, implicit multiplexing of multiple RTP sessions onto a single transport flow raises a large number of backwards compatibility issues. It has been argued that these issues are either not important, since the RTP features disrupted are not of interest to the current set of RTC-Web use cases, or can be solved by somehow explicitly dividing the SSRC space into different regions for different RTP sessions. We believe the first argument is short-sighted: those RTP features may not be important today, but the successful deployment of simple RTC-Web applications will generate interest to try more advanced scenarios, which may well need those features. Partitioning the SSRC space to separate RTP sessions results in new set of issues, where the biggest from our point of view is that it effectively creates a new variant of the RTP protocol, which is incompatible with standard RTP. Having two different variants of the core functionality of RTP will make it much more difficult to develop future protocol extensions, and the new variant will likely also have different set of extensions that work. In addition the two versions aren't directly interoperable, and will force anyone that want to interconnect the two version to deploy (complex) gateways. It also reduces the common user base and interest in maintaining and developing either version.

On the other hand, we are sympathetic to the argument that using a single transport flow does save some time in setup processing, it will save some resources on NATs and FWs that are in between the end- points communicating, it may have somewhat higher success rate of session establishment.

Thus we consider it required that RTP sessions are multiplexed using an explicit mechanism. We strongly recommend that the mechanism used to accomplish this multiplexing is to use unique UDP flows for each RTP session, based on simplicity and interoperability. However, we can accept a WG consensus that using a single transport layer flow between peers is the default, and that also the fallback of using separate UDP flows are supported, under one constraint: that the RTP sessions are explicitly multiplexed in such a way existing mechanism or extensions to RTP are not prevented to work, and that the solution does not result in that an alternative variant of RTP is created (i.e., it must not disrupt RTCP processing, and the RTP semantics). In this later case we recommend that some type of multiplexing layer is inserted between UDP flow and the RTP/ RTCP headers to separate the RTP sessions, since removing this shim- layer and gatewaying to standard RTP sessions is simpler than trying to separate RTP sessions that are multiplexed together to gateway them to standard RTP sessions.