Multiplexing and RTP Sessions
7 August 2011
/ rtcweb
The following is extracted from
version -02 of our RTP Requirements for RTC-Web draft, and
describes some of the issues relating to RTP session multiplexing.
This was discussed at IETF 81 in Quebec City. The linked set of
slides on Multiplexing RTP
Sessions were prepared for the RTC-Web working group session at
that meeting, but were not presented to the working group: discussion
in a break-out meeting covered the key points.
Expected Topologies
As RTC-Web is focused on peer to peer connections established from
clients in web browsers the following topologies further discussed
in RTP Topologies
RFC5117 are primarily considered. The topologies are depicted
and briefly explained here for ease of the reader.
The point to point topology (Figure 1) is going to be very common in
single user to single user applications.
For small multiparty sessions it is practical enough to create RTP
sessions by letting every participant send individual unicast RTP/UDP
flows to each of the other participants (Figure 2). This is called
multi-unicast and is unfortunately not discussed in the
RTP Topologies RFC.
This topology has the benefit of not requiring central nodes. The
downside is that it increases the used bandwidth at each sender by
requiring one copy of the media streams for each participant that are
part of the same session beyond the sender itself. Thus this is
limited to scenarios with few end-points unless the media is very low
bandwidth.
It needs to be noted that, if this topology is to be supported by the
RTC-Web framework, it needs to be possible to connect one RTP session
to multiple established peer to peer flows that are individually
established.
An RTP mixer (Figure 3) is a centralised point that selects or mixes
content in a conference to optimise the RTP session so that each end-
point only needs connect to one entity, the mixer. The mixer also
reduces the bit-rate needs as the media sent from the mixer to the
end-point can be optimised in different ways. These optimisations
include methods like only choosing media from the currently most
active speaker or mixing together audio so that only one audio stream
is required in stead of 3 in the depicted scenario. The downside of
the mixer is that someone is required to provide the actual mixer.
If one wants a less complex central node it is possible to use an
relay (called an Transport Translator) (Figure 4) that takes on the
role of forwarding the media to the other end-points but doesn't
perform any media processing. It simply forwards the media from all
other to all the other. Thus one endpoint A will only need to send a
media once to the relay, but it will still receive 3 RTP streams with
the media if B, C and D all currently transmits.
To support legacy end-point (B) that don't fulfil the requirements of
RTC-Web it is possible to insert a Translator (Figure 5) that takes
on the role to ensure that from A's perspective B looks like a fully
compliant end-point. Thus it is the combination of the Translator
and B that looks like the end-point B. The intention is that the
presence of the translator is transparent to A, however it is not
certain that is possible. Thus this case is include so that it can
be discussed if any mechanism specified to be used for RTC-Web
results in such issues and how to handle them.
RTP Multiplexing Points
There are three fundamental points of multiplexing within the
RTP framework:
- Use of separate RTP Sessions:
The first, and the most important, multiplexing point is the RTP
session. This multiplexing point does not have an identifier within
the RTP protocol itself, but instead relies on the lower layer to
separate the different RTP sessions. This is most often done by
separating different RTP sessions onto different UDP ports, or by
sending to different IP multicast addresses. The distinguishing
feature of an RTP session is that it has a separate SSRC identifier
space; a single RTP session can span multiple transport connections
provided packets are gatewayed such that participants are known to
each other. Different RTP sessions are used to separate different
types of media within a multimedia session. For example, audio and
video flows are sent on separate RTP sessions. But also completely
different usages of the same media type, e.g. video of the presenter
and the slide video, benefits from being separated.
- Multiplexing using the SSRC within an RTP session:
The second multiplexing point is the SSRC that separates different
sources of media within a single RTP session. An example might be
different participants in a multiparty teleconference, or different
camera views of a presentation. In most cases, each participant
within an RTP session has a single SSRC, although this may change
over time if collisions are detected. However, in some more complex
scenarios participants may generate multiple media streams of the
same type simultaneously (e.g., if they have two cameras, and so send
two video streams at once) and so will have more than one SSRC in use
at once. The RTCP CNAME can be used to distinguish between a single
participant using two SSRC values (where the RTCP CNAME will be the
same for each SSRC), and two participants (who will have different
RTCP CNAMEs).
- Multiplexing using the Payload Type within an RTP session:
If different media encodings of the same media type (audio, video,
text, etc) are to be used at different times within an RTP session,
for example a single participant that can switch between two
different audio codecs, the payload type is used to identify how the
media from that particular source is encoded. When changing media
formats within an RTP Session, the SSRC of the sender remains
unchanged, but the RTP Payload Type changes to indicate the change in
media format.
These multiplexing points area fundamental part of the design of RTP
and are discussed in Section 5.2 of [RFC3550].
Of special importance is the need to separate different RTP sessions
using a multiplexing mechanism at some lower layer than RTP, rather
than trying to combine several RTP sessions implicitly into one lower
layer flow. This will be further discussed in the next section.
RTP Session Multiplexing
In today's network with prolific use of Network Address Translators
(NAT) and Firewalls (FW), there is a desire to reduce the number of
transport layer ports used by an real-time media application using
RTP. This has led some to suggest multiplexing two or more RTP
sessions on a single transport layer flow, using either the Payload
Type or SSRC to demultiplex the sessions, in violation of the rules
outlined above. It is not the first time some people look at RTP and
question the need for using RTP sessions for different media types,
and even more the potential need to separate different media streams
of the same type into different session due to their different
purposes. Section 5.2 of [RFC3550]
outlines some of those problems; we elaborate on that discussion, and
on other problems that occurs if one violates this part of the RTP
design and architecture.
Why RTP Sessions Should be Demultiplexed by the Transport
As discussed in Section 5.2 of [RFC3550],
multiplexing several RTP sessions (e.g., audio and video) onto a
single transport layer flow introduces the following problems:
- Payload Identification:
If two RTP sessions of the same type are multiplexed onto a single
transport layer flow using the same SSRC but relying on the Payload
Type to distinguish the session, and one were to change encodings and
thus acquire a different RTP payload type, there would be no general
way of identifying which stream had changed encodings. This can be
avoided by partitioning the SSRC space between the two sessions, but
that causes other problems as discussed below.
- Timing and Sequence Number Space:
An RTP SSRC is defined to identify a single timing and sequence
number space. Interleaving multiple payload types would require
different timing spaces if the media clock rates differ and would
require different sequence number spaces to tell which payload type
suffered packet loss. Using multiple clock rates in a single RTP
session is problematic, as discussed in
[I-D.ietf-avtext-multiple-clock-rates].
This can be avoided by partitioning the SSRC space between the two
sessions, but that causes other problems as discussed below.
- RTCP Reception Reports:
RTCP sender reports and receiver reports can only describe one timing
and sequence number space per SSRC, and do not carry a payload type
field. Multiplexing sessions based on the payload type breaks RTCP.
This can be avoided by partitioning the SSRC space between the two
sessions, but that causes other problems as discussed below.
- RTP Mixers:
Multiplexing RTP sessions of incompatible media type (e.g., audio and
video) onto a single transport layer flow breaks the operation of RTP
mixers, since they are unable to combine the flows together.
- RTP Translators:
Multiplexing RTP sessions of incompatible media type (e.g., audio and
video) onto a single transport layer flow breaks the operation of RTP
some types of RTP translator, for example media transcoders, which
rely on the RTP requirement that all media are of the same type.
- Quality of Service:
Carrying multiple media in one RTP session precludes the use of
different network paths or network resource allocations that are flow
based if appropriate. It also makes reception of a subset of the
media, for example just audio if video would exceed the available
bandwidth, difficult without the use of an RTP translator within the
network to filter out the unwanted media which unless they are
trusted devices (and included in the key-exchange). This is
difficult to combine with media security functions.
- Separate Endpoints:
Multiplexing several sessions into one transport layer flow prevents
use of a distributed endpoint implementation, where audio and video
are rendered by different processes and/or systems.
We do note that some of the above issues are resolved as long as
there is explicit separation of the RTP sessions when transported
over the same lower layer transport, for example by inserting a
multiplexing layer in between the lower transport and the RTP/RTCP
headers. But a number of the above issue are not resolved by this.
In the RTCWEB context, i.e. web browsers running on various end-
points it might appear unlikely that flow based QoS is available on
the end-points that will support RTCWEB. We don't disagree
that it is unlikely for the common case of users in their home-
network or at WiFi hotspots will have flow-based QoS available.
However, if one considers enterprise users, especially using intranet
applications, the availability and desire to use QoS is not
implausible. There are also web users who use networks that are more
resource-constrained than wired networks and WIFI networks, for
example cellular network. The current access network QoS mechanism
for user traffic in cellular technology from 3GPP are flow based.
RTP's design hasn't been changed, although session multiplexing
related topics have been discussed at various points of RTP's 20 year
history. The fact is that numerous RTP mechanism and extensions have
been defined assuming that one can perform session multiplexing when
needed. Mechanism that has been identified as problematic if one
doesn't do session separation are:
- Scalability:
RTP was built with media scalability in consideration. The simplest
way of achieving separation between different scalability layers is
placing them in different RTP sessions, and using the same SSRC and
CNAME in each session to bind them together. This is most commonly
done in multicast, and not particularly applicable to RTC-Web, but
gatewaying of such a session would then require more alterations and
likely stateful translation.
- RTP Retransmission in Session Multiplexing mode:
RTP Retransmission does
have a mode for session multiplexing. This would not be the main
mode used in RTC-Web, but for interoperability and reduced cost in
translation support for different RTP Sessions are beneficial.
- Forward Error Correction:
The RTP Payload Format
for Generic Forward Error Correction and
its update can only be used on media formats that produce RTP
packets that are smaller than half the MTU if the FEC flow and media
flow being protected are to be sent in the same RTP session, this is
due to
RTP Payload for Redundant Audio Data. This is because the SSRC
value of the original flow is recovered from the FEC packets SSRC
field. So for anything that desires to use these format with RTP
payloads that are close to MTU needs to put the FEC data in a
separate RTP session compared to the original transmissions. The
usage of this type of FEC data has not been decided on in
RTC-Web.
- SSRC Allocation and Collision:
The SSRC identifier is a random 32-bit number that is required to be
globally unique within an RTP session, and that is reallocated to a
new random value if an SSRC collision occurs between participants.
If two or more RTP sessions share a transport layer flow, there is no
guarantee that their choice of SSRC values will be distinct, and
there is no way in standard RTP to signal which SSRC values are used
by which RTP session. RTP is explicitly a group-based communication
protocol, and new participants can join an RTP session at any time;
these new participants may chose SSRC values that conflict with the
SSRC values used in any of the multiplexed RTP sessions. This
problem can be avoided by partitioning the SSRC space, and signalling
how the space is to be subdivided, but this is not backwards
compatible with any existing RTP system. In addition, subdividing
the SSRC space makes it difficult to gateway between multiplexed RTP
sessions and standard RTP sessions: the standard sessions may use
parts of the SSRC space reserved in the multiplexed RTP sessions,
requiring the gateway to rewrite RTCP packets, as well as the SSRC
and CSRC list in RTP data packets. Rewriting RTCP is a difficult
task, especially when one considers extensions such as RTCP XR.
- Conflicting RTCP Report Types:
The extension mechanisms used in RTCP depend on separation of RTP
sessions for different media types. For example, the RTCP Extended
Report block for VoIP is suitable for conversational audio, but
clearly not useful for Video. This may cause unusable or unwanted
reports to be generated for some streams, wasting capacity and
confusing monitoring systems. While this is problem may be unlikely
for VoIP reports, it may be an issue for the more detailed media
agnostic reports which are sometimes be used for different media
types. Also, this makes the implementation of RTCP more complex,
since partitioning the SSRC space by media type needs not only to be
one the media processing side, but also on the RTCP reporting
- RTCP Reporting and Scheduling:
The RTCP reporting interval and its packet scheduling will be
affected if several RTP sessions are multiplexed onto the same
transport layer flow. The reporting interval is determined by the
session bandwidth, and the reporting interval chosen for a high-rate
video session will be different to the interval chosen by a low-rate
VoIP session. If such sessions are multiplexed, then participants in
one session will see the SSRC values of the other session. This will
cause them to overestimate the number of participants in the session
by a factor of two, thus doubling their RTCP reporting interval, and
making their feedback less timely. In the worst case, when an RTP
session with very low RTCP bandwidth is multiplexed with an RTP
session with high RTCP bandwidth, this may cause repeated RTCP timer
reconsideration, leading to the members of the low bandwidth session
timing out. Participants in an RTP session configured with high
bandwidth (and short RTCP reporting interval) will see RTCP reports
from participants in the low bandwidth session much less often than
expected, potentially causing them to repeatedly timeout and
re-create state for those participants. The split of RTCP bandwidth
between senders and receivers (where at least 25% of the RTCP
bandwidth is allocated to senders) will be disrupted if a session
with few senders (e.g., a VoIP session) is multiplexed with a session
with many senders (e.g., a video session). These issues can be
resolved if the partition of the SSRC is signalled, but this is not
backwards compatible with any existing RTP system. The partition
would require re-implementing large part of the RTCP processing to
take the individual sessions into account.
- Sampling Group Membership:
The mechanism defined in
RFC2762 to sample the group membership, allowing participants to
keep less state, assumes a single flat 32-bit SSRC space, and breaks
if the SSRC space is shared between several RTP sessions.
As can be seen, the requirement that separate RTP sessions are
carried in separate transport-layer flows is fundamental to the
design of RTP. Due to this design principle, implementors of various
services or applications using RTP have not commonly violated this
model, and have separated RTP sessions onto different transport layer
flows. After 15 years of deployment of RTP in its current form, any
move to change this assumption must carefully consider the backwards
compatibility problems that this will cause. In particular, since
widespread use of multiplexed RTP sessions in RTC-Web will almost
certainly cause their use in other scenarios, the discussion
regarding compatibility must be wider than just whether multiplexing
works for the extremely limited subset of RTP use cases currently
being considered in the RTC-Web group. Any such multiplexing
extension to RTP must therefore be developed by the AVTCORE working
group, since it has much broader applicability and scope than RTC-
Web.
Arguments for a single transport flow
The arguments we are aware of for why it is desirable to use a single
underlying transport (e.g., UDP) flow for all media, rather than one
flow for each type of media are the following:
- End-Point Port Consumption:
A given IP address only has 16-bits of available port space per
transport protocol for any consumer of ports that exists on the
machine. This is normally never an issue for a end-user machine. It
can become an issue for servers that has large number of simultaneous
flows. However, in RTCWEB where we will use authenticated STUN
requests a server can serve multiple end-point from the same local
port, and use the whole 5-tuple (source and destination address,
source and destination port, protocol) as identifier of flows. Thus,
in theory, the minimal number of media server ports needed are the
maximum number of simultaneous RTP sessions a single end-point may
use, when in practice implementation probably benefit from using
more.
- NAT State:
If an end-point is behind a NAT each flow it generates to an external
address will result in state on that NAT. That state is a limited
resource, either from memory or processing stand- point in home or
SOHO NATs, or for large scale NATs serving many internal end-points,
the available ports run-out. We see this primarily as a problem for
larger centralised NATs where end-point independent mapping do
require each flow mapping to use one port for the external IP
address, thus affecting the the maximum aggregation of internal users
per external IP address. However, we would like to point out that a
RTCWEB session with audio and video are likely using 2 or 3 UDP
flows. This can be contrasted with that certain web applications
that can result that 100+ TCP flows are opened to various servers.
Sure they are recovered more quickly due to the explicit session
teardown when no longer need, at the same time more web sites may be
simultaneously communicated in various browser tabs. So the question
is if the UDP mapping space is as heavily used as the TCP mapping
space, and that TCP will continue to be the limiting factor for the
amount of internal users a particular NAT can support.
- NAT Traversal taking additional time:
When doing NAT/FW traversal it takes additional time to open
additional ports. And it takes time in a phase of communication
between accepting to communicate and the media path being established
which is a fairly critical. The best case scenario for how much
extra time it can take following the specified ICE procedures are.
1.5*RTT + Ta*(Additional_Flows-1), where Ta is the pacing timer,
which ICE specifies to be no smaller than 20 ms. That assumes a
message in one direction, and then an immediate triggered check back.
This as ICE first finds one candidate pair that works prior to
establish multiple flows. Thus, there is no extra time until one has
found a working candidate pair, from that is only the time it takes
to in parallel establish the additional flows which in most case are
1 or 2 more additional flows.
- NAT Traversal Failure Rate:
In cases when one needs more than a single flow to be established
through the NAT there is some risk that one succeed in establishing
the first flow but fails with one or more of the additional flows.
The risk that this happens are hard to quantify. However, that risk
should be fairly low as one has just prior successfully established
one flow from the same interfaces. Thus only rare events as NAT
resource overload, or selecting particular port numbers that are
filtered etc, should be reasons for failure.
Summary
As we have noted in the preceding sections, implicit multiplexing of
multiple RTP sessions onto a single transport flow raises a large
number of backwards compatibility issues. It has been argued that
these issues are either not important, since the RTP features
disrupted are not of interest to the current set of RTC-Web use
cases, or can be solved by somehow explicitly dividing the SSRC space
into different regions for different RTP sessions. We believe the
first argument is short-sighted: those RTP features may not be
important today, but the successful deployment of simple RTC-Web
applications will generate interest to try more advanced scenarios,
which may well need those features. Partitioning the SSRC space to
separate RTP sessions results in new set of issues, where the biggest
from our point of view is that it effectively creates a new variant
of the RTP protocol, which is incompatible with standard RTP. Having
two different variants of the core functionality of RTP will make it
much more difficult to develop future protocol extensions, and the
new variant will likely also have different set of extensions that
work. In addition the two versions aren't directly interoperable,
and will force anyone that want to interconnect the two version to
deploy (complex) gateways. It also reduces the common user base and
interest in maintaining and developing either version.
On the other hand, we are sympathetic to the argument that using a
single transport flow does save some time in setup processing, it
will save some resources on NATs and FWs that are in between the end-
points communicating, it may have somewhat higher success rate of
session establishment.
Thus we consider it required that RTP sessions are
multiplexed using an explicit mechanism. We strongly recommend
that the mechanism used to accomplish this multiplexing is to use
unique UDP flows for each RTP session, based on simplicity and
interoperability. However, we can accept a WG consensus that using a
single transport layer flow between peers is the default, and that
also the fallback of using separate UDP flows are supported, under
one constraint: that the RTP sessions are explicitly multiplexed in
such a way existing mechanism or extensions to RTP are not prevented
to work, and that the solution does not result in that an alternative
variant of RTP is created (i.e., it must not disrupt RTCP processing,
and the RTP semantics). In this later case we recommend
that some type of multiplexing layer is inserted between UDP flow and
the RTP/ RTCP headers to separate the RTP sessions, since removing
this shim- layer and gatewaying to standard RTP sessions is simpler
than trying to separate RTP sessions that are multiplexed together to
gateway them to standard RTP sessions.