2025-10-23 • iOS, iPadOS, macOS, tvOS, visionOS • 5:28:55
This event features sessions, community discussions, and a live Q&A with members of the Apple Streaming Media engineering team. Learn about the latest updates to HTTP Live Streaming (HLS) across all platforms, as well as HLS.js, the AVFoundation framework, Immersive Video, and Spatial Audio.
Speakers: Sergey Maliuk, Roger Pantos, Julia Xu, Ashank Verma, David Wang, Rob Walch, Dr. Deep Sen, Dr. Tejaswi Nanjundaswamy, Mickaël Raulet, Zachary Cava, Yuriy A. Reznik, Sumaya
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Hello everyone. And welcome to the Apple Developer Center. I'm so grateful you all have joined us today online and in person. We are here to talk about HLS and media technologies. My name is Sergey. I am media technologies evangelist here at Apple. I recently joined the company, so I'm pretty new here. But I have been working on media technologies for over 20 years now.
Before we begin, I want to share a bit about this building and today's schedule. This developer center is part of Apple Park in Cupertino and one of four developer centers around the world. In our developer centers, we host designers and developers for sessions, labs, and workshops. This room is Big Sur. It's designed to support a range of activities, including in-person presentations, studio recordings, and live broadcasts. There are also labs, briefing rooms, and conference rooms that allow us to host various activities.
A quick note for everyone here in person. There is an open Apple Wi-Fi network you can use to stay connected throughout the day. No password is required. There are also power outlets in the front of every seat. They are located at the front of the armrest. And now, these presentations are meant to be special just for you, both online and in person. So, I ask you all to refrain from recording this video or live streaming during presentations. We have that covered for you. This event is currently live on YouTube and the developer website. And you're welcome to take photos throughout the day, so you won't miss a thing. And with that small request out of the way, let's check out the schedule, because we have a lot of stuff lined up for you. And just a quick note, this schedule is in Pacific Standard Time. We'll start the day with presentation from the Apple Streaming Media team. Roger Pantos will start by talking about his proposal to change the HLS specification to rotate content keys more efficiently. Then we'll have a presentation covering the latest updates to HLS technologies, such as coordinated playback and interstitial skip control and preload. We'll have a short break at 11:00 a.m. And after the break, there'll be two more sessions: how to enhance analytics with the framework and the HLS GS player updates. At 12:00, a healthy lunch will be served upstairs.
And in the afternoon, we'll start with a session on the APAC codec and spatial audio. Another exciting part of the day will be community topics. We'll host presentations from Disney, ATAM, and SVTA on interstitials and content steering. There will be another short break at 2:15 after the community topics.
And after the break, we'll host a Q&A session. This is our favorite part where we'll answer the questions you submitted throughout the day. And if you are joining us in person, please scan this QR code to submit your questions. And for those watching us online, there is a button next to the video player which opens the panel. There, you can also put questions asked by others. That will help us to prioritize the questions. And please note, we'll be answering the questions about all the sessions at the end of the day. So for example, if you ask a question now or during a morning session, it will be answered in the Q&A panel in the afternoon. The only exception will be the first presentation by Roger Pantos. And finally, there will be a mixer event at the end of the day. It's your opportunity to connect with fellow colleagues from other companies as well as chat with Apple engineers. There will be refreshments served in the lobby area outside of the theater. And now, let's welcome our first presenter, Roger Panthos, with his proposal on content key rotation. Roger, welcome to the HLS Summit 2025. Thank you, Sergey.
Scale content key rotation
Hi, everyone. So in this session, I'm going to propose a change to HLS to make it easier to rotate content keys in large-scale live streams. There is going to be some time for Q&A at the end of this session. So if you scan this QR code, it'll take you to the questions page. If you're watching online, you can hit the Q&A button, and that'll let you post your questions during the presentation, and we'll go through them at the end. So, first I'm going to talk about how key rotation works in HLS. Then I'll go over the challenge that shows up when you've got tens of thousands, hundreds, or even millions of viewers watching live. And then, finally, I'll cover the proposed solution.
So, key rotation. In HLS, when segments are encrypted, the decryption key is specified by prefixing the segments with a key tag containing a key URL, as you can see in this example on the right. To switch to using a different key in the same playlist, a new key tag is placed before the first segment that uses that new key. When clients are playing a live HLS stream, they discover new segments by reloading the playlist. that means that when the server switches to using a new key, the clients will see that new key tag when the first segment using the new key appears following a playlist reload. And not a moment sooner.
This can create a scaling challenge. Let's go back to our live stream example on the right to illustrate that. In live, usually all the clients are playing at the live edge. A new segment is added to the every six seconds in this case, and the clients are continuously reloading the playlist to discover them.
Now, in regular live HLS, the discovery of that new playlist is distributed pretty evenly over that six-second target duration because clients are polling independently for updates. Now, it turns out six seconds is bad enough, but in low latency HLS, all the clients will get that same playlist update at about the same time because of blocking playlist reload. So that spread, that interval, is actually much smaller. Now, consider what happens when a new key tag appears.
Most players will make a key request as soon as they see it, which they will all do over that same short interval. So, say there's 100,000 live viewers. All of them will ask for that new key over that six second or maybe in low latency all at the same time. And this is our scaling problem. Because the higher the load on a server at any given time, the more server capacity you need to handle it.
And for a number of reasons, key servers in particular are difficult to scale just by adding server capacity. And of course, in practical terms, difficult means expensive. So we fortunately already have a strategy for dealing with this kind of problem, one that has worked out pretty well in other contexts, and that is to spread the key requests over a longer period of time. And that enables the same server to handle a larger number of clients. HLS already has a notion of preloading using the preload hint tag. The preload hint tag is used to preannounce a resource that will be needed for playback, enabling a client to load it in advance. Now, for our proposal here to allow this tag to preload content keys, we need to add two things to it. Number one is some new attributes that we're going to borrow from the key tag that clients will need to actually make the key request when all they see is the preload hint. And number two is a way to indicate over how long, like how long it will be before the key is needed so that the clients can distribute their requests over that interval.
Here are the syntax changes. Today, the preload hint tag has two attributes: the type of the resource that is available for preload and the URL where it can be found. To these, we propose to add first a data first use attribute and that will define the request period. And then next, three attributes we get from the key tag, the method, the key format, and the key format versions, and that will enable clients to actually make that key request when they see it. The date of first use attribute will be optional. If it's not there, it just means the client should try to load the key as as soon as they can. Note that if you do use data first use, that does require that your playlist also contain program date time tags. Here's an example as it might appear in a playlist.
Note that it has a type of key and it has some attributes. The key method attribute and there is a date of first use that tells us the predicted point where we think the key 8 will first be needed. Next I'm going to talk about what a playlist packager needs to do to offer key preloading. First it needs to add the preload hint tag to the playlist in advance. Now, how far in advance is up to the server, but we recommend that it get put there just as soon as the key server is able to handle a request for the key, and that will maximize the period over which the clients can distribute their requests. The preload hint tag should use the same URL that will eventually land in the key tag once that appears. If the same key is being offered for multiple key formats, such as Fairplay and Widevine, then the packager should add a different preload hint tag for each one. The date of first use attribute should be set to the date of the first segment that's scheduled to use the new key. Since this is a live stream, more segments may get added to the playlist using the current key that's already in the playlist. While that is happening, the packager should retain that preload hint tag in the playlist and keep it there until the first segment that uses the new key appears. Once that first segment that uses the new key appears, the packager should remove the preload hint tag and replace it with the new key tag and that first segment. So you can see here in our playlist on the right that the preload hint tag is gone, and in its place we have the new key tag with the same key URL. So, that's the packager. On the playback side of things, here is what clients need to do to preload keys. First, when a preload hint tag appears, they should first estimate the current date. And they can do that by extrapolating from the last programmed date tag they've seen the playlist up until the end of the playlist, the last segment of the playlist, and that will give them an estimate of the date that the packager finished writing the last segment that's currently in the playlist. If the playlist is being pulled by the client, so like regular live HLS, then the player should increase its estimate of the current date by whatever polling interval it's using, since the playlist might have been written that much earlier to the server.
Similarly, if the playlist response has an age header, then the player should add that age value to its current date as well, to compensate for the time that the playlist has sat in an HTTP cache before the client obtained it. Once the client has a good estimate of the server's current date, it can then figure out how long it has until the key will be needed by subtracting the current date from the date of first use. And once it knows how long it has, then it can pick a random time between now and then to issue a key request. That will result in all the client requests being distributed fairly equally across the preload interval, so the server doesn't need to handle them all at once. So that's the idea.
We introduce a new type of preload hint tag, have clients use that to spread out their key requests over a longer interval in order to give the key server enough time to handle Thank you for letting me walk you through it. Now that you've seen the proposal, I'm interested in hearing your feedback. If any of you guys operate live encrypted streams, I'd like you to just take a second and think about whether this approach is going to work for your situation. We have a few minutes now, so if you'd like to ask for any clarifications about how this would work, or bring up any problematic scenarios you foresee, then please add them to the Q&A. Sergey, can we actually check to see if we've accumulated any sessions as we've been talking here? Yeah, can you please move over there? Sure. Oh, yeah. Give people the QR code. Snap the QR. And thank you for the proposal.
It looks solid. Thank you. All right. Well, you guys have submitted your questions. I have some questions on my own. So you said that there can be only be one preload hint tag in a playlist at once. Does it mean that this approach can only hint one key at a time? Yes, so the as I said earlier, the preload hint tag is a pre-existing part of the HLS spec and the spec says you can only have one preload hint tag for every resource type in the playlist at any given time. We'll modify that a little bit for different key formats, but the rule will be just one preload hinted key at a time. And we think that that is sufficient, and we'd like to focus attention on the next key, and we'd like to make life simpler for client implementations so we have interoperability. That being said, there are some, if there is a demand, you know, there are some DRMs, for instance, that can respond to a key request with multiple keys. Fair Play is not one of those today, but if there is an interest in using that kind of thing to actually allow a single preload hint to deliver, for example, several future keys at once, that's actually something we'd be interested in hearing your feedback on. You know come find me or we have some I think we have some folks from fair play team actually here today Come find them after the break and you know, we'd be interested to hear about you know, what you guys need for your solution So tell us about that Great. Thank you. And we have some questions from Jacob. Thank you so much for submitting the question I know if you're watching here on the line, but the first question is what are the main security benefits of the new X X? key handling There aren't, it's not so much a security benefit as a performance benefit. The key tag itself is assumed to be fairly strong. There is, you know, there's some, there's a theory out there that if you have kind of a leaky DRM, then rotating your content key more frequently can help you sort of, you know, can help increase your security. And I want to be a little bit cautious about that because there's really a limit to that strategy, to the benefits of that. Essentially, if someone can crack your DRM once, they can crack it a second time. And once they can, once an attacker has mastered the art of extracting your key and distributing it to, you know, all of its, you know, clients or however, then rotating your content key just means that they have to run their thing. They have to turn their crank more frequently. And that's just, you know, that's not hard. That's not rocket science. So I think that the, I don't want to sell this as a security benefit. I think it's really something, you know, the biggest benefit of this is it allows you to scale your content key servers to larger audiences. Makes sense. All right. And the second question is, are there any tools already available on frameworks that allow to test content key rotations?
There are not. Well, I don't think so. Assuming that the feedback we get from this response is positive, then I think what we'll do here at Apple is we'll actually implement the client side of it. And then once we, you know, we'll put it into one of the, you know, an iOS release of, you know, of some description.
and then, as you know, we pre-release iOS betas for seeding purposes. And so I expect if this goes forward, the player implementation, our player implementation of this will end up in an iOS, tvOS, et cetera, seed, and we will send out email to HLS interest letting you know it's available. In the meantime, I would encourage you, if you're an HLS JS user or even Shaka or other players like that, we would welcome if folks want to go ahead and just from this presentation, the information here, if you want to build an implementation, try it out. You're certainly welcome to do that. I would just caution against deploying it until we're a little more settled in terms of the names of the attributes and exactly the semantics and things like that. But certainly, if you want to build something now, go ahead. Sure. Thank you. So the next question is from Alex, and he's asking about data first use. So is that in media time or wall clock time? So that is a good question. That is against the same clock that the program date time tags in the playlist are. Got it. So whatever those dates are, the data first use is against that same clock. So we would expect if a preload hint has a data first use of 2 p.m., then the first segment that appears might have an EXT program date time tag of 2 p.m. attached to it. Got it. Next question is about AV Foundation support. There is a question if AV Foundation will support the content key rotation by default. The preloading, yeah. Preloading, yeah.
Yes. Again, if this goes forward, so for those of you who may not be aware, AV Foundation is the HLS player framework on iOS and tvOS and Apple platforms. So yeah, assuming this goes forward, yes, we will build this support into those and it will be on by default. Okay. So no adoption needed by apps.
Got it. There is one more question from Jacob. He is asking for recommendations about how often content keys should be rotated in Livestream. I asked our data security folks about this one time. There is a view that if you have a solid DRM, then the only attack on these things is a brute force sort of plain text, known plain text kind of thing against AES, which requires a large number of megabytes of data. And so you can kind of look at your content bit rate and figure out, okay, in order to rotate every 100 megabytes or 50 megabytes of data, And if you've got a leaky DRM, then it's like, well, how much data does it need to leak the key and stuff like that? That becomes sort of an implementation-dependent thing because what platform is the attack on, et cetera, et cetera. I would say, you know, for me, you know, wave a finger in the air. Every minute or two is about as aggressive as is useful. I think going more aggressively than that, you're not really getting any benefits. If someone can attack a key rotation every minute, they can attack a key rotation every 10 seconds, realistically.
And so key rotation is really not your solution. but to give people sort of a sense that their content is secure against these kind of brute force attacks, rotating every five minutes is probably good. Great. Awesome. This is super helpful. And the next question is about the quantity. How many can you preload and how many content keys can you preload and how many keys can the CDM hold for use? Especially if you scrub back and forth, what are the recommendations there? Yeah. And so there's sort of two questions tangled in there a little bit. There's the question of how many keys can you preload at once? And I think I said today with FairPlay, the answer is just one, so one at a time. So you preload one key at a time, then you move to preloading the next. The second question is around, So if you have like the so-called DDR window, so you've got a live playlist that has like 15 or 20 minutes of content in it, how many keys is it reasonable for a client to sort of hold on to so that if you scrub backwards, it doesn't have to redo the key request? And that's really implementation dependent. So I don't want to speak too broadly about that.
I will say for AV Foundation, it's a pretty good size. it's certainly tens of keys will be held. And it's not the end of the world if a key gets ejected. It just means that if the client does scrub backward and play over something again, there might be a little bit of a delay while we go get that key again. Your key server might get a little bit of additional attention. But that's pretty much a minority case and most likely be lost in the noise. All right.
I believe these are all the questions that we can take now. There are a few more outstanding questions in my list. We'll take them at the end of the day in our Q&A session. Sounds good. Yeah, thank you so much. And now we'd like to welcome Julia, who will talk about sign language coordinated playback. Awesome.
Coordinated multiview playback: Sign Language streams
Hello, thank you, Roger and Sergey. So I'm Julia, and today I'll be discussing coordinated multi-view playback. And specifically, I'll focus on coordinating playback of sign language streams. So first, I'll start with an overview of what multi-view and sign language streams are. Then I'll go into detail on playback coordination and how it works. Lastly, I'll discuss some of the more advanced multi-view technologies. I'll get started.
A coordinated multi-view playback experience consists of playing multiple streams of audio and video in synchronization. Examples of synchronized streams might include a sports event or a music concert which has multiple camera angles. For example, a soccer game that has an audio stream for announcer and two video streams with different perspectives of the field. In this case it's crucial that the streams are perfectly coordinated with each other so that all the important moments line up. Another use case which is the one I'll focus on today is when there's a main stream with a corresponding sign language stream. In this case playback of the main video and sign language video should be coordinated with each other. So sign language streams are sign language interpretations of spoken language or audio content delivered as video streams. These are often available for various live events, broadcasts, and recorded content. And the importance of these sign language streams is for accessibility and inclusion for deaf and hard of hearing individuals. In the US alone, about 11 million Americans consider themselves deaf or have serious difficulty hearing. And about 48 million Americans have some degree of hearing loss. It's crucial to make content and information accessible for all of these individuals and deliver accessible and inclusive media content. Here's an example of this. On the left, there's a video stream of the main presentation content. It's from Apple's Worldwide Developers Conference, WWDC. And on the right, there's a corresponding sign language video stream using American Sign Language, or ASL. Currently, to show a main stream with a sign language stream, some pipelines involve stitching together a main and a sign language stream into a separate single stream that contains both content. with the sign language content often overlaid over the main content. And the content is coordinated, but creating this separate stitched-in stream involves complex processing pipelines. There also may be challenges related to the actual content of the streams. There are many different sign languages, such as ASL or BSL, and for global audiences, it may be necessary to create a separate stream for each of these. Additionally, since these streams are stitched in, there could be conflicts between the positioning or sizing of the sign language content against the main content. Other pipelines might simply have two separate streams: a main stream and a separate corresponding sign language stream. And while this is more straightforward than creating the separate stitched in stream, the downside of this is that the streams are not necessarily coordinated. So today I'll describe how to leverage Apple's AV Foundation framework as an alternative to showing sign language streams with main streams. The AV Foundation Playback Coordination API enables coordinated streams. It makes things simple and avoids complex processing pipelines. Additionally, it enables greater flexibility in the content shown. Beyond sign language streams, this applies to any multi-view content where you want coordination. Next, I'll go into the details of playback coordination, how to author HLS content, and describe the AV Foundation API for synchronized multi-view playback experiences. As I previously mentioned, playback coordination is synchronized playback across multiple audio and video players. What this means is that playback behaviors, such as play, pause, and seek, are coordinated.
And beyond these basic behaviors, the Playback Coordination API also handles more complex behaviors such as stalling, interruptions, and more. To ensure that playback coordination works, you must correctly author the HLS playlist. For VOD HLS playlists, the playlists are aligned assuming time zero for the start segment of each playlist. I'll go over an example of this. Here, I have an example of a playlist for the WWDC mainstream. It's a VOD HLS playlist.
If you pay attention to the highlighted segment, that's the first segment, and time zero corresponds to segment 10. Now on the right, I have an example of a VOD ASL playlist. For this playlist, segment 200, which is highlighted, is the first segment, and time zero corresponds to the start of segment 200.
These two playlists, the WWDC playlist on the left and the ASL playlist on the right, will be coordinated assuming that time zero is the start of their first spec segment. And the durations of these playlists should match each other. While the playlists will be coordinated even if the durations aren't matched, the recommendation is for the durations to be aligned.
The requirements for live and event HLS playlists are slightly different. For these HLS playlists, they're aligned based on the program date/time information. This requires using the program date/time tag in the HLS playlist for coordination. I'll go through an example of a live HLS playlist. Here's an example of a live playlist for the WWDC Mainstream. There's a program date/time tag highlighted, and it indicates that segment 10 begins at 12:00 p.m. on October 1st, for example. Now on the right, I have an example of the live ASL playlist. This playlist also includes the program date/time tags. The program date/time tag highlighted indicates that segment 201 will begin at 12:00 p.m. on October 1st. The WWDC playlist and the ASL playlist will be coordinated according to the program date/time tags. This means that the start of segment 10 for the WWDC playlist on the left will be aligned with the start of segment 201 for the ASL playlist on the right. For VOD, live, and event HLS playlists, content must be correctly authored in order for them to be coordinated. Next, I'll dive into how playback coordination works. In the AV Foundation framework, the AV Playback Coordinator object shares playback state across multiple players to keep them in sync. Each player has an associated playback coordinator, which is accessed via the playback coordinator property. The AV Playback Coordinator negotiates between the playback state of the player and all other players.
It does so by communicating through a common coordination medium. In AV Foundation, the AV playback coordination medium synchronizes playback between AV players on the same device. It communicates state changes across all playback coordinators and passes state from one coordinator to other connected coordinators. In this example, I have two video players connected to a common medium. This allows playback coordinators to communicate with the other connected coordinators and send information about important state changes such as playback rate and time. For example, if one player pauses, this pause state will be sent through the coordination medium to the other connected coordinators and those playback coordinators will handle and apply that that playback state. In this way, the players stay in sync. Using the AVPlaybackCoordinator, an AVPlaybackCoordinationMedium only takes a few lines of code. I start by setting up my AVPlayers, one of the WWDC keynote, and one of the ASL stream. Next, I create the coordination medium. Then I connect each player to the coordination medium using the coordinate method. Now both playback coordinators are connected to the coordination medium and the actions on each player will be synchronized. In this example, I'm watching the WWDC keynote video with a coordinated sign language stream. The main presentation is in full screen and the ASL video is in the bottom right corner. First, I'll start playing both players. Throughout playback, the sign language player will stay in perfect synchronization with the main player. I'll let it play for about 10 seconds and then press pause at the end. They'll pause perfectly in sync. Next I'll play and issue a seek forwards by 10 seconds. And with each of these actions, the players will remain in synchronization. Again, I'll pause at the end. Productivity and creativity. But what should we call it? They stumbled into their minibus and wove a trail toward the Sierras, eventually rolling to a stop in a beautiful national park. Additionally, I also have the flexibility to move and resize the ASL player depending on my needs. If there are multiple different types of sign languages, such as ASL, BSL, and more, you can dynamically replace one with the other depending on your audience.
Next, I'll reposition the ASL player from the bottom right to top right and adjust its size larger and smaller. - Staring skyward up the towering trunks surrounding them, they felt a deep kinship with anything that could get that high. They knew they'd found their spot. Welcome to macOS Sequoia. - So if this had been a stitched in stream, I wouldn't have the ability to easily move or scale the ASL stream depending on my needs. This gives more flexibility for the content shown and playback coordination provides frame level synchronization while also handling the complexities of stalling and other stream issues that may arise. After setting up coordinated multi-stream, the playback experience can be even further optimized. I'll discuss some of the more advanced multi-view features involving quality optimization and other features such as AirPlay. First, I'll talk about quality optimization.
When streaming content, each player consumes network bandwidth. On the AV players, it's possible to prioritize the distribution of network bandwidth among players. And also specify which stream maintains higher quality when bandwidth is limited. To support this, set the network resource priority of the AV player. The network resource priority of an AV player can be set to high, default, or low. A priority level of high means that the player requires a high level of network resources, and streaming in a high-quality resolution is crucial.
For example, the main presentation video is larger, and if it's more important to see that content more clearly, Then the presentation video should play at a higher quality if bandwidth is limited. I set its network resource priority to high. A priority level of low means that the player requires minimal network bandwidth, and streaming and high-quality resolution is not as crucial.
If the ASL video is smaller, it may not need to play at a high quality under limited bandwidth. So I set its network resource priority to low. The exact configuration of these priority levels are up to the client app and the needs of their target audience. For example, the SLL stream could be high priority and the WWDC video low. These network priorities are there to help indicate the priority of the player when the system allocates network bandwidth resources. But the exact network bandwidth distribution a variety of other factors into consideration, such as the number of other players, video layer size, hardware constraints, and more. Now that you've learned about optimizing quality, finally, I'll discuss how these streams work seamlessly with other features like AirPlay. The AV routing framework provides support for seamless integration of AirPlay and other external audio and video playback experiences.
With AirPlay, it's possible to cast one stream onto the big screen while keeping the other stream on your device. The AV Routing Playback Arbiter API manages all of this stream switching complexity. Let's check out an example of this. For example, I'm watching coordinated multi-view playback of the WWDC keynote with a corresponding sign language stream on my iPad. I want to AirPlay the main content to my Apple TV and experience watching the main presentation on the big screen. However, since AirPlay to the Apple TV only supports a single video stream, the ASL video wouldn't be AirPlayed. Instead, it will remain full screen on my iPad while the main content displays on the Apple TV. If instead I wanted to watch the ASL stream on the Apple TV, I can switch them. The streams will still stay perfectly in sync and all playback behaviors will be coordinated. If you're interested in learning more about all of these features, check out this year's WWDC video on creating a seamless multi-view playback experience. This dives into more detail on all the multi-view related AV foundation and AV routing API. Along with this video is a sample lab that provides an example of how to use the multi-view related API. Additionally, learn more in-depth on how playback coordination works in the 2021 WWDC video on coordinating multimedia experiences.
To summarize, playback coordination is a technology that can be used to coordinate main content with sign language streams. It avoids the complexity of having a separate stitched in stream and eases the process to create more accessible content. Beyond sign language streams, the API related to playback coordination, network bandwidth allocation, and airplay routing is applicable to a wide variety of Multiview use cases. These are all powerful tools that can be used to create compelling synchronized Multiview experiences. I encourage you to check it out. Thank you. Next I'll hand it over to Ashank for his presentation on adding skip control and preloads to HLS interstitials. Thank you, Julia.
Add Skip Control and Preloads to HLS Interstitials
Hi, I'm Ashank, and I'm an AV Foundation engineer. I'll be going over how to add skip control and asset preloading to HLS interstitials. I'll start with a quick intro of HLS interstitials and then move on to two new features that were added to iOS, macOS VisionOS, and tvOS 26.
Skip control and preloads. So first, a quick overview. HLS interstitials are a mechanism to insert additional content into your main program. For example, they can be ads, promos, recaps, or studio banners. It's a drop-in replacement for discontinuities. And additionally, they provide several benefits over traditional ad insertion methods. It's easier to serve dynamic ads. They allow for late binding ad decisioning. They replace burned in ads with dynamic ads.
And the ads do not need to be conditioned to the primary content. I'll show you an example. Interstitials are authored in an HLS media playlist via a date range tag as shown on the right. They are of the class com.apple.hls.interstitial. They have an ID and a start date to denote when to play the interstitial with respect to the primary content. Finally, the xAssetUri points to the multivariant playlist of the content, which is separate from the multivariant playlist of the asset being played. This is scratching the surface of HLS interstitials. There are several WWDC videos since 2021 that go into much more detail. Watch these videos to learn the basics of HLS interstitials and more advanced features such as the integrated timeline, which exposes UI customizability for interstitials. So that's an overview of HLS interstitials. There's a new feature in iOS, macOS VisionOS, and tvOS 26 called Skip Control. Skip Control gives a facility to render a skip button for interstitial playback during airplay, where the UI is controlled by the system. This also provides an easier way for clients to create their own skip button to match their custom UI. To do this, an interstitial now has a notion of a skippable range to define when the skip button appear on screen for the interstitial. Clients can also customize the label of the button and provide translations to fit the system language of the device. The skip control time range of an interstitial event indicates to the client the range of time that the skip button should be shown on screen. In this diagram, the playhead is in the primary. When it reaches the eligible period, a skip button should appear on screen. When the viewer clicks on the skip button, interstitial playback will terminate and return to the primary. So here's an interstitial date range. To indicate to the client that the interstitial is skippable, there are three new attributes. First is XSkipControlOffset. It indicates at what time the skip button should appear relative to the start of the interstitial. Next comes the XSkipControlDuration. This indicates how long the skip button should be on screen for. Note that if this is omitted, the client should render the skip button until the end of the interstitial.
Additionally, the server can indicate a label ID that the client can use to load app customizable labels and translations. An app may want to render a button with different text based on the type of interstitial. Additionally, a viewer may have a different system language depending on their settings. The XSKIP_CONTROL_LABEL_ID allows a server to inform the client which type of label to render. For example, on AVPlayer, the app sets up an NSBundle with a localizable table with translation for English and Spanish for the label ID in this example, skip add label ID. Once the bundle has been configured, it can be set on the AVPlayer interstitial event controller via a new API that we're calling the localized strings bundle. When the AVPlayer interstitial event controller needs to decide the label for a playing interstitial, it will grab the relevant skip control label ID, which in this case is skip add label ID, the system locale, which is Spanish here, and the bundle to generate the correct translation for the custom label. It will make this accessible via the AV Player Interstitial Event Monitor.
Note that the AV Player Interstitial Event Controller is a subclass of the AV Player Interstitial Event Monitor. If your app has custom playback UI, AV Player Interstitial Event Monitor provides ways to find out how and when to draw your own skip button. There are a couple of new APIs to support the skippable time range defined by the server. The AVPlayerIndustrialEventMonitor now has an enum known as a skippable event state, whose values change depending on the playhead's position in the diagram. Any time the state changes, the monitor posts a notification to the app, which can track the value of the enum via the new current event skippable state API. In this diagram, playback is currently in the not skippable state. When the playhead moves to the start of the interstitial, it is in the not yet eligible range.
The AVPlayer Interstitial Event Monitor posts a notification to the app indicating so. For the app processing the notification, this means to either not display the skip button or display a disabled skip button. When the playhead moves to the skippable range, it is now in the eligible range, prompting the AVPlayerInitial Event Monitor to post a notification to the app. This indicates to the app to render the skip button, and when the playhead moves to the no longer eligible range, the AVPlayerInitial Event Monitor posts a notification again, and the app should respond by either disabling or hiding the skip button. Note that if the skip button was clicked during the eligible period, the playhead would go straight back to the primary, which is in the not-skippable range. I have a couple of examples to show here today.
First, I will show a custom app that adopts the new skip control interstitial APIs to render its own custom skip button. In this example, an interstitial is scheduled at t equals 10 seconds. The skippable region of this interstitial starts at an offset of roughly 4 seconds. And the app has defined a custom label for this button called skip add, which appears in English, the system language of the device. So here, playback is in the primary. Soon playback enters the interstitial. And four seconds into the interstitial, the skip button appears. And when the skip button is clicked, interstitial playback terminates and goes back to the primary. Now I'll go over a different example, where a viewer is watching the same exact content, but here they decide to AirPlay to an Apple TV where the app cannot draw custom UI. The receiver takes care of drawing the button based on the system language of the receiver and the bundle of labels set on the sender iPhone. In this example, the Apple TV's system language is in Spanish while the iPhone's system language is in English. Since the app has configured a custom label ID, The framework ensures that the label on the receiver is translated to Spanish, while the label on the sender is translated to English. So the AirPlay session starts. Playback soon enters the interstitial. And the skip button should appear on the Apple TV in Spanish and English on the phone. And then when the viewer clicks the skip button on the receivers, the interstitial is seamlessly terminated over AirPlay, and both UIs update accordingly. So that's the new skip control feature. Now I'll move on to a different interstitial feature that we're calling asset preloads. So some backstory first. HLS interstitials have been available for a few years, And since then, partner adoption has steadily grown. As usage has expanded, we've continued to refine the future by adding new capabilities and improving the experience along the way. However, certain providers operating at massive scale encountered a significant challenge when adopting interstitials.
They were running into this issue specifically in a live streaming scenario where audience can scale to millions of concurrent viewers. In that environment, the timing of interstitial requests becomes absolutely critical. In this case, several clients with relatively decent network connections would be requesting a late binding interstitial at approximately the same moment. At small scale, this behavior is harmless, but at massive scale, these simultaneous requests form what's known as a thundering herd, which is a sudden surge of traffic that overwhelms the server. The result could be latency, timeouts, or even outright failures during critical ad playback moments. To understand why this happens, it's helpful to look at how late binding interstitials are defined in the first place. Late binding interstitials are defined by an X asset list in an interstitial date range. It points to a JSON file that has a list of assets, and it allows a server to defer decisioning ads when served in the HLS playlist. When the interstitial is ready to be buffered, the client will request the JSON file, at which point the server can populate the JSON with relevant ads. In this example, there are two asset URIs for this ad pod. In a live stream, the client player will request the asset list for the upcoming event only when the primary has buffered to some time before the events start. In this case, this happens to be roughly 75 seconds. A problem arises during a live scenario with the spike of concurrent viewers. If clients have relatively good network connections, they will request a server for the asset list the same time, and this can result in a thundering herd, which can overwhelm the server. To mitigate this problem, we are introducing a new server-side directive that indicates a preload range to the client. The client's responsibility is then to pick a random point within this preload period to fetch the asset list for the specified event. Since every client picks its own random preload point within the specified range, this distributes a load on the server for AdPod requests. I'll go over an example of configuring a preload date range to avoid the thundering herd. Here is a live media playlist with one second segments.
The player joins the live stream at roughly three target durations from the live edge. So that means the player joins at T equals one second. The playlist has also defined an interstitial with an x-asset list, which will start playback at one minute and 30 seconds into the stream. Now, I'll add the new preload date range. Note that there's a new class, com.apple.hls.preload, and its ID in this example is pre-event one. The preload start date is at two seconds into playback, with a duration of 60 seconds. This means that the preload period between 2 and 62 seconds. This preload period is prior to the interstitial's start date. The preload indicates that the X target class has a com.apple.hls interstitial, since that's the class that's being preloaded. And the X target ID indicates the interstitial ID to be preloaded. And finally, the X URI indicates the URI to be preloaded, which matches up up with the interstitial X asset list. With this preload date range defined, the client then randomly picks a time between 2 and 62 seconds to fetch the asset list. And this has the effect of distributing the X asset list request across the preload period among several clients. So that's preloads. With these new interstitial features, your next steps are to adopt AirPlay compatible skip controls to improve the experience for your customers receiving HLS interstitials. And finally, mitigate the thundering herd issues during live streams with significant traffic by adopting the new preload date range class to distribute ad decisioning requests. Thank you.
[Transcript missing]
Enhance your analytics with AVMetrics
Thanks, Sergey. Hello, I'm David, a media streaming engineer. Today, I'll share how you can enhance your app's analytics with AVmetrix. I'll start by giving a brief introduction of AVmetrix, and then I'll dive into specifics for HLS, progressive downloads, and HLS offline downloads. So first, an overview of AVmetrix, what its capabilities are, and how to use it. AV Metrics is a framework that allows you to collect metrics about playback and downloads. It upgrades your analytics by providing insights into what AV Player and AV Asset Download tasks do. This can be combined with your own analytics about what your app does to get a full picture of the user experience.
AV Metrics can also help diagnose issues by helping you narrow down the root cause of of an error or measuring the frequency of that error. AV Metrics offers insight into some of the important actions the player is taking under the hood in the form of events, each of which have a set of relevant metrics. Events can be grouped into the following-- network events for network requests, such as for the HLS playlists or segments; Startup events, which represent when the player reaches likely to keep up and includes information such as startup time. Playback events, which include high level playback events such as seeks, rate changes, and stalls. Variant changes, which indicate when variant switches are attempted and completed. Error events for both fatal and non-fatal playback errors. And summary events, which give overview levels information at the end of the session. With these events, you can compose an accurate picture of a playback session. Keep in mind that the presence or absence of events provide clues about what's going on too. For example, if playback fails to start and there aren't any playlist request events, including the multivariate one, then something clearly went wrong very early in the process before the stream even got very close to being ready to play. And with AV metrics, you can choose which events you're interested in receiving or choose to receive them all.
I'll take you through a code example. Because I'm interested in measuring variance switch events. Starting with an AVPlayer item, a sequence of metrics can be extracted by event type. In this instance, it's the variance switch and variance switch start events. Each of these sequences can be iterated over in a for loop, where each iteration is a new metric event as the events come in. Using chronological merge, I can merge multiple sequences together. The result of chronological merge is a combined sequence that I can also iterate over in the same way. Events are published as they happen during playback, and once I receive them, I can then send them to the server, for example.
Apple values privacy, so this data is sent to the server anonymously, and only if the user opted into sharing it. On the server, I can aggregate information across many playback sessions to understand the behavior of my app. I can also combine this information with other data available to me to draw more precise conclusions. To give you an idea, I could use these statistics to determine which audio and subtitle languages are most popular. This could lead me to update my service to improve the user experience as well. For instance, I might prioritize those most popular languages when dubbing and subtitling new content. To learn more about all of the different AV metric events, watch the WWDC24 video titled Discover Media Performance Metrics in AV Foundation and check out the online documentation. Now I'll focus on AV metrics for HLS. All the metrics I mentioned earlier are available for HLS playback. I'd like to highlight a couple of updates for HLS. Firstly, segment duration is now available in the segment request event. This enables you to contextualize other metrics against the segment duration. For example, you can normalize the length of the request and response times for fetching that segment against the duration of the segment itself. Another set of additions are fields that identify which media rendition is playing.
Media renditions are different versions of a video, audio, or subtitle track that differ in aspects like language or accessibility settings. These fields have been added to the Variant Switch and Variant Switch Start events. I'll go through what this new capability looks like in code. Just like before, I get a variant switch metric sequence out of the player item. When the event is published and I receive it, I can inspect it and extract a stable ID and a URL for the audio rendition. I'm showing the audio rendition here, but I can do the same for the video and subtitle Depending on how my HLS playlists are set up, this lets me uniquely identify which media rendition is currently playing at the time of the Variant Switch. These fields are also available on the Variant Switch start event. Now, I'll talk about progressive downloads. Progressive downloads refer to when the AV player plays a single media file, such as an MP4, directly from an HTTP server. When you subscribe to AVmetrix on an AV player item, you'll receive metrics on both HLS and progressive download assets. The set of available events for progressive downloads is similar to the ones available for HLS playback. The exception is variant change events, since variants are concept specific to HLS. Network, startup, playback, error, and summary events are still available. You can interact with them in the same way you can in HLS playback. Now, I'll talk about AV metrics for HLS offline downloads, which is the process of persisting HLS content to disk for later playback. For these, AVMetrics only supports summary events. It allows you to receive metrics such as how long it took to download an asset and how many total bytes were downloaded. It's published at the end upon either success or failure of the download. When it comes to HLS offline downloads, there is no subscription for AV metrics. Instead, to receive download metrics, implement the delegate method did receive metric event in the AV asset download delegate protocol. I'll go through a code example.
Here, I have a custom class which conforms the AV asset download delegate. The AV asset download delegate protocol defines methods to manage the lifecycle of asset downloads for offline playback. It's used to monitor download progress in the background, handle errors, locate assets once the download completes, and much more. My custom class implements the didReceiveMetricEvent method, which gets called when the download completes or terminates with a failure. From here, the workflow is the same as before. I can inspect the event here in the app, or I can send it to my analytics backend to do further processing.
And that's how you can use AV metrics to enhance your analytics. So you haven't already, adopt AV metrics in your app. Its new capabilities allow you to obtain deeper information about HLS playback and observe metrics for progressive downloads and offline HLS. Aggregate information to understand usage patterns in your app.
Based on those usage patterns, determine if your app is well optimized for your users or if the user experience could be improved in some way. Leverage AV metrics to understand the errors that your users face. Error messages are directly available in error events. Also, the presence or absence of other events will give you clues about what's going on as well. We'd love to hear about how you're using AV metrics or how we might be able to improve it. We truly value your feedback. Thank you for coming today and have a great HLS Summit. I'll hand it off to Rob now to share some updates in HLS.js. Thank you, David.
HLS.js updates
[Rob Walch]
Hi, everyone. I'm Rob Walch. I help developers use HLS. As part of the HLS team here at Apple, you can often find me on the HLS interest mailing list, video dev Slack, or GitHub, answering questions and contributing to the open source project. HLS.js is a JavaScript library built for reliable HLS playback on platforms with MSE. Its modular, lightweight, and performant design ensures optimal handling of HLS streams.
In this talk, I'll go over some of the core principles of HLS/JS's development, as well as the HLS features contributed to this last year. These include interstitials, live playback, codec selection, and content protection. Finally, I'll review the roadmap and discuss next steps. Modularity is core to HLS.js. Features are encapsulated into modules, many of which can be extended, disabled at runtime, and even excluded from builds. Tailoring HLS to include only essential features helps to achieve faster page load, application startup, and video playback. HLS.js should be easy to adopt and upgrade. It is tested across a large number of browsers and smart devices with the aim of providing the best possible playback experience on every platform. Runtime API compatibility is maintained between releases so that upgrading HLSJS does not require additional work. Occasionally, library types and ES modules change, requiring some migration when updating between minor dot releases. So HLSJS must also be a good citizen of the runtime it occupies and the web at large.
Keeping it small and simple addresses many performance concerns and makes it easy to debug and profile. On platforms with resource constraints, HLS-JS offers options to set buffer limits, eject media more proactively, and toggle the parsing of media in a separate worker context. Each instance of the HLS-JS API is responsible for the streaming of a single HLS asset to an attached HTML media element. The configurability and performance aspects I've just highlighted are what enable support for multiple instances. This plays an integral role in supporting HLS interstitials. Interstitials in HLS.js work by having the primary instance and queue new instances, which buffer additional assets at specific timeline offsets. The integrated timelines generated by an interstitial schedule require new API to navigate. In HLSJS, this is called the Interstitials Manager. New API events track interstitial-specific network and playback progress. I'll show how one event can be used to customize asset player configuration from the primary settings they inherit by default. Finally, assetless playback has been improved to play all available assets, excluding any that error up to the playout limit. So the interstitials manager allows developers to render player controls and pull progression through an interstitial schedule. Traditionally, playback position would be tracked with the HTML media element or video tags current time property.
Since interstitials can occupy a different amount of time than the primary content they replace, this must be done in new ways. To get the integrated playback position, HLS.js provides the Interstitials Manager integrated Playhead Times interface. There's also an interstitial player used to navigate only within an active break. This is the one-stop shop for the schedule, a list of interstitial events, and enqueued asset players.
There's even a skip function used to cancel the active break. So that's Interstitials Manager. Let's talk about events. There are a number of new API events to track interstitial network and playback progress, as I mentioned before. Interstitials Updated signals program schedule changes, such as new and removed interstitials, and also timing changes that come from obtaining more precise duration values. Buffer to boundary signals that the media buffer has reached the scheduled start of a break.
Typically, this is where assets are preloaded. Asset lists are a new type of resource with corresponding load events. The asset player created event happens when enqueuing new HLSJS instances. The asset players are responsible for buffering of the interstitial one asset at a time. Each player will emit all of the classic HLSJS events related to streaming of an item.
So at a higher level, though, progress through the scheduled program is conveyed as followed. Interstitial started signals the beginning of a break, with asset started and asset ended, repeated for each asset, until interstitial ended. Consecutive interstitials will repeat this sequence. When there are no more interstitials to play, primary resumed signals resumption of the main content.
The asset player created event offers an opportunity to customize configuration. For example, if the primary player is configured to use CMCD, asset players will as well, unless the CMCD settings are modified in this event handler. This is also where you could attach another HTML media element. You can do all kinds of stuff here. To support smooth play through of long asset lists decisioning on each item, the assets are loaded sequentially to better support timeline adjustments. For example, when buffering an interstitial asset list, after loading the first two assets successfully, if the third fails to load before being played, it will be ignored. Its duration does not contribute to the playout or the default resumption offset of the interstitial. And this is handled prior to loading the remaining assets to avoid additional rescheduling and buffering work. So let's talk about Live. HLSJS has always supported live playback. Improvements based on developer feedback and your contributions have helped make HLSJS a great choice for apps with long-running live content with interstitial breaks. Some of the updates to Live include joining Live with interstitials, sliding window resumption, and more resilient error handling. Let's review each one individually. When joining a live stream, HLS.js loads and appends media segments before determining the start position that it will seek to. Readying content for playback takes time. To ensure this doesn't add more latency to the live experience, By default, HLSJS adjusts the live time to align closely with the target latency. Trying to hit a moving target can make startup unstable. So on the other hand, if lower latency is not a priority, use the start on segment boundary option to play through all buffered content. OK, so that's startup. Now on the other hand, if after some time, playback falls behind the live sliding window, then it will have to skip ahead to catch up to the available segments in the last playlist update. By default, it'll jump straight to the live edge. But it can be configured to play through all buffered content using the Live Sync Mode option. This is new.
So in terms of error handling, additional validation on playlist updates ensures that HLS authors testing with HLSJS are producing live content that is compatible across all HLS clients and that HLSJS is compliant with the HLS spec. Live reload scheduling now recovers from intermittent errors, like receiving an invalid or empty playlist response. That's the worst. Resumption and recovery after app suspension or connection loss have been improved. Waking or reconnecting a device, running a live stream will pick back up where it left off or catch up, depending on the type of playlist, event, or live sliding window. There was a bug that was found in low latency HLS VTT part loading.
That's been disabled for now since live VTT segment loading works as expected. A fix for part loading is coming soon. So codecs, supplemental codecs are used to pick Dolby Vision profile eight and 10 enhancement layers when available. When Dolby support checks fail, setup will fall back to baseline HEVC or AV1. Unknown video and audio codecs that are supported by the browser are now allowed, but will not be selected before known ones. This was kind of a misnomer that HLSJS would play any content with codecs supported by its runtime? Previously, it could not because of a dependency on a preset lookup of audio and video codec types. Support for FairPlay streaming, Widevine, and PlayReady have been undergoing improvements. HLS.js can pick the appropriate key system based on a combination of license configurations and HLS key tags. Including session key tags in your multivariant playlists allows for this process to start sooner. Key sessions are generated as early as the first HLS key tag is encountered while requesting the associated media segment. HLS.js handles key status errors and reduces redundant license requests in cases where multiple keys are carried in a single HLS key tag URI. So that was an overview of some of the most important updates. The features landing in upcoming releases include iFrame support, preload hints and date ranges-- lots of preloading to do-- client-side interstitial insertion, and score selection. There are also enhancements to land, including expansion of gapless interstitial playback, enhancements to prevent live buffer underrun, text track removal between items-- this is a community contribution already in development-- and ES module composability improvements. If there's something you would like to see added or prioritized, please file an issue. In addition to reaching out to me, GitHub issues really help with tracking and prioritization. It's one of the best ways to participate in open source projects. And I look forward to seeing your contributions land in HLSJS next. Thank you.
[Transcript missing]
Stream Spatial Audio with APAC
Well, thank you, Sergey. Good afternoon. I'm Deep. I'm the lead immersive audio architect at Apple. I'll be joined later by Tejas Shui, who is an R&D engineer in the immersive audio team. I, along with my team at Apple, have developed the entire end-to-end immersive audio technology that's been used for the immersive content on the Vision Pro. In this talk, I'll be providing an overview of the new Apple Spatial Audio Codec that we call APAC. And in the process, I'll also briefly go over the features of the new Apple Spatial Audio format that motivated the new codec. Then I'll cover the capabilities and architecture of the new codec. And finally, Tejesh will discuss how to create the HLS manifest for streaming with APAC. I'll start by defining what Apple positional audio codec is and what motivated its development. APAC is a new audio codec that is capable of efficiently compressing high-resolution spatial audio signals. It facilitates advanced capabilities such as metadata-driven automatic generation of acoustics for object-based audio. APAC also supports high-order ambisonics. I'll talk more about these capabilities later in the presentation.
With the introduction of the Apple Vision Pro, Apple Immersive Video, also known as AIV, was introduced. AIV allows viewers to feel present, almost teleported into that scene. However, to be completely immersed or feel present in that scene, we also needed immersive audio and overhead phones. Immersion for audio is a combined experience of naturalness and externalization. Externalization is a sense of audio sources emanating from correct distances outside of the head. This level of immersive audio experience when listening with headphones is not possible with other formats. This motivated us to create this new format that we call Apple Spatial Audio Format, or ASAP. The new codec, APAC, supports this new format on Apple devices. The APAC decoder is available on iOS 17, macOS 14, TV OS 17, and Vision OS. Profile 0 of APAC is used to encode ASAP for the Vision Pro's Apple Immersive Media. This profile of APAC codec supports up to 10th order ambisonics, 1024 objects with metadata, and mono to 22.2 channel layouts. APAC's profile 5, on the other hand, is available for encoding on macOS iOS, and Vision OS. This profile is capable of encoding up to third order ambisonics and up to 714 channels.
There are multiple ways to create these types of audio signals. For example, you can use specialized mic arrays for capturing high-order ambisonics. There are also these kind of microphone rigs to capture 5.1 or 7.1.4. There are also 180-degree and 360-degree cameras with built-in ambisonic microphone arrays, such as the Insta360 and GoPro. and you capture these with the Apple Projected Media Profile. iPhone 16 and above, as well as the Apple Vision Pro, captures first order ambisonic and encodes it with this new codec, APAC. Digital audio workstations, or DAW, tools, for example, IRCAM SPAT, can convert audio captured from mono microphones into ambisonics. For detailed information on encoding and decoding, check out Apple's positional audio codec PDF documentation on developer.apple.com. Next, before diving into APEX capabilities, it's important to understand the intrinsics of the new Apple Spatial Audio Format, ASAF, or at least at a very high level anyway.
So ASAF is a new immersive audio format that has been brought up to provide compelling, externalized, and natural audio experiences. All the immersive content on the Vision Pro were created using ASAPH. ASAPH enables natural and externalized audio experiences by ensuring that the acoustic cues that are used to render the audio are extremely accurate. It's composed of new metadata coupled with linear PCM and a powerful new spatial renderer that's built into Apple platforms. It produces high-resolution spatial audio using numerous point sources or objects and high-resolution sound scenes or high-order ambisonics. The rendered audio is generated computationally and adaptively based on object position and orientation, as well as the listener's position and orientation. relative to the acoustic environment. None of it is baked in for these objects. This allows the sounds in ASAP to come from all directions in any position and at any distance, providing a genuine 3D audio experience. Content for ASAP is created using DAWs such as DaVinci Resolve from Blackmagic and Pro Tools, and saved as a.Wave broadcast WAV file. The WAV file contains both PCM signals, as well as time-varying metadata. The maximum HOA order and number of objects is dictated by the capabilities of these tools that I talked about, the Blackmagic and Pro Tools. Profile Zero APAC is supported in both of those tools as an export option only at a single bitrate. The upcoming release of Compressor will also support encoding with that Profile Zero APAC encoding.
Third-party integrators are implementing APAC encoders as well. I will now go through the end-to-end content creation and distribution process for ASAPH. Content is created by the creative mixer, by bringing in various microphone recordings and stems into the ASAV DAW tooling. This produces a set of PCM signals representing a combination of objects, ambisonics, and channels, as well as that set of time-varying metadata such as position, direction, room acoustics, etc. The creative mixer is able to listen to that by rendering that PCM and metadata. Once the creative mixer is done, there is a representation of that format, the PCM and the metadata and also the renderer. But it's assumed that the renderer is present in the playback device, so the audio format is really the PCM plus metadata. This is then saved into that broadcast WAV file.
The WAV file is then encoded using APAC encoding and subsequently converted using HLS tools to a format that's suitable for streaming. On the playback device, the fragmented MP4 is decoded into those PCM signals and time-varying metadata again. These are then ingested into that adaptive renderer, which also gets the position and orientation of the listener. This allows the rendering of the immersive audio experience. Now that I've defined the requirements dictated by the Apple Spatial Audio format, I'll review APAC's codec capabilities.
So APAC is designed to support very high-resolution spatial content. This is required for that immersive experience, but we try to keep the bitrate low. For the highest quality, the bitrate is dictated by the complexity of the content and maintaining perceptual transparency. So most of the content on the Vision Pro uses a combination of fifth-order ambisonics and 15 objects.
That's a total of 51 LPCM channels and metadata, providing a spatial resolution that's really unprecedented for any existing spatial audio format. At 32 bits per sample, that's an 81 megabit per second payload. That doesn't even include the metadata. Now, APAC is able to encode that payload at one megabit per second and still maintain excellent quality. That's an 80 to one compression ratio and an average of 20 kilobits per second per channel. Consider the fact that audio bitrate for transparent stereo music streaming is 256 kilobits per second. That's 128 kilobits per second per channel. 20 kilobits per second per channel is indeed industry-leading. At a slightly lower bitrate of 768 kilobits per second or 15 kilobits per second per channel, the quality is still great.
Any combination of ambisonics channels and objects can be encoded at a total bitrate of as low as 64 kilobits per second, while still allowing head-tracked spatial audio playback. For more encoding bitrate recommendations, check out the documentation, HLS authoring specification for Apple devices on developer.apple.com. Next, I'll go over how APAC achieves its compression efficiency. ASAP content consists of ambisonics, objects, channels, as well as metadata. Objects can be further classified in various ways, such as dialogue and interactive elements tagged respectively through metadata.
The content is then ingested into the APAC encoder. The APAC encoder achieves its compression efficiency through a four-stage architecture. The first stage is a spatial analysis block. Its purpose is to classify the content as salient or non-salient. For example, objects such as moving objects, loud sounds, or speech, really those that can be easily discerned from the ambient sounds, are classified as salient. The second stage of encoding is aimed at reducing the dimensionality of the content. A matching pursuit algorithm provides a 10 to one reduction in dimensionality. Parametric techniques can provide an even higher ratio with a trade off in spatial fidelity. This is a key contribution to the overall coding efficiency. This dimensionally reduced signal is then fed into various coding toolboxes. There isn't time to go over all of these in any detail. They include things like MDCT psychacoustic modeling, a novel bandwidth extension algorithm, a new channel and line stereo to take advantage of the correlation between signals, as well as vector quantization tools. The last stage is metadata encoding that has been specifically designed with just noticeable distortion criteria for spatial audio perception. For example, how does the quantization of basic coordinates such as x, y, and z affect the perception of position? This is not just for three degrees of freedom or three dof, but also takes into account that the listener may interactively move closer to an object when six degrees of freedom is enabled. All of these then comprise the final bitstream. Other capabilities of APAC include profiles and levels that have been designed in a hierarchical order to allow for future expansion.
I've mentioned Profile 0 and Profile 5 encoding already. However, the decoders on all Apple devices support Profile 0 bitstreams, which because of this hierarchical design, means that all the bitstreams from Profile 0 to Profile 31, will be decodable on Apple devices. Interactive elements, multiple scenes, and languages can all be contained in one APAC bitstream. This is achieved through the support of independent substreams and an elaborate scene description mechanism. Now that I've described the capabilities of the APAC codec, I'll hand off to Tejasvi. He'll cover more details about how to prepare APAC bitstreams for streaming with HLS. Thank you, Deep. Hi, everyone.
My name is Tejasvi Nanjundaswamy. I'm an R&D software engineer on the Immersive Audio team. To understand how to prepare APAC bitstreams for streaming, I'll start by explaining a little bit more about the different types of APAC packets Deep just showed. Specifically, there are independently decodable packets, or IP, audio sync packets, or ASP, and dependent packets, or DP. Here are some examples of valid combinations of packets for HLS delivery. The first example is a combination of ASPs and IPs. The second example has ASPs and DPs. And the third example has all three types of ASPs, IPs, and DPs. I'll next explain more about these packet types. First, there is the independently decodable packet or IP. These packets can be decoded without any knowledge of previous packets. But if decoding begins with an IP, it does not produce valid audio output until additional packets have been decoded. The number of these additional packets to be decoded is specified as decoder refresh count in the P-roll sample group of the MP4 container. This decoder refresh count is typically one for APAC. Next, there is the AudioSync packet, or ASP. This type of packet contains all the information required to produce valid audio output. While ASPs have the benefit of producing fully valid audio output, they carry extra information.
That is, they have bigger packet size and result in increased bit rate. ASPs are recommended to be included about every 1 and 1/2 seconds, which is about every seventh packet for 48 kilohertz audio. This ensures that the impact of bigger packet size of ASP is negligible over the average bit rate. Finally, there is the dependent packet, or DP. These packets are encoded with dependency on previous packets for compression efficiency. Note that starting a decoder with DP will result in an error. Now looking at the different combination of packet types, the first combination of ASPs with IPs is not very bitrate efficient since IPs do not exploit the dependencies across packets. But on the upside, random access requires very little throwaway decoding. Whereas the combination of ASPs and DPs is more bitrate efficient, but for random access the decoder has to go back to the nearest ASP to start decoding and throw away samples that are not needed.
The third combination with less frequent ASPs, including more frequent IPs in between DPs, is a good midpoint in this trade-off. It provides both compression efficiency and ease of random access. Now I'll cover steps to prepare the APAC bitstream for HLS. The first step is to use the media file segmenter command shown here to segment the bitstream. To learn more about the available options for the media file segmenter tool, take a look at its man page.
Download the media file segmenter tool from the Using Apple's HLS Tools page on the Apple Developer website. The MediafileSegmenter command creates an M3/U8 file, a PList file, and fragmented MPEG4 or FMP4 segments. The M3/U8 file should be referenced to in the Multivariant HLS playlist. The PList file contains all the details required to create the Multivariant HLS playlist. If you use your own segmenter, it should ensure that each FMP4 segment starts with an ASP. Apple's media file segmenter tool already takes care of this requirement.
Here is an example of multivariant playlist file for the Apple immersive video profile with APAC spatial audio that has a fifth order ambisonic content and 15 objects. I'll break this down for you. The first highlighted line of this playlist shows how the M3U8 file created in the previous step is referenced to in the multivariant playlist. The other attributes required for this line is available in the pre-list file created in the previous step. The second line shows how the audio-related attributes are combined with those of the video. The details about the content are included in the channels attribute of the HLS playlist. The first number of this attribute is the total number of channels. This is followed by the details of the content. Some of the examples for this are NOA, which means there is an nth order ambisonic content, bed-M, which means there is a channel bed with M channels, for example a 5.1 bed or object-P, which means there are P number of audio objects. In this specific example shown here, there There are 51 total number of channels, which includes 5th order ambisonic content and 15 objects. And finally, here is an example of HLS playlist for the Apple projected media profile, where the APAC bitstream has a 3rd order ambisonic content. Here again, the first highlighted line shows how to include the M3O8 file created by the media file segmenter tool. And the second line shows how the audio-related attributes are combined with those of the video. Here are the next steps for all of you. Use the Apple Spatial Audio format to create the most realistic 3D audio environments possible, leveraging its support for a vast set of new metadata for audio objects, higher-order ambisonics, and adaptive rendering. APAC, which is an efficient, extendable, and versatile audio codec for transportation of various immersive audio formats, including ASAPH. APAC is supported in HLS, which makes its integration in existing and new streaming pipelines effortless and seamless. So go and adopt these technologies to stream spatial audio. Thanks and I'm looking forward to experiencing the amazing spatial audio you will all create. Now over to Sergey. Thank you so much, Tessie and Deep. All right, this was our last presentation from the Apple Streaming Media team. And now we are moving on to community topics. And for the first presentation in this section, I would like to welcome Mikhail Rley from ATAM to share his proposal for handling some advanced advertisement formats. Welcome, Mickaël. Thank you, Tony.
Proposal to enable picture in picture and L-shape with interstitials (Ateme)
Hello, everyone. I'm Mickaël Rollet. I'm going to speak about the proposal to enable Picture-in-Picture and L-shaped ads. So this is the presentation that I will be speaking about. So I will start by explaining the genesis of the project, the proposal, and I will finish by two examples, the picture-in-picture and L-shaped adds. And I will finish by next steps. So in this proposal, we want to enable low latency HLS and dynamic picture-in-picture L-shaped ads all together. So on the video that you are getting on the left, you see the picture-in-picture, the HLS interstitial that is displayed. So when you have timecode, this is when it is live. When you don't have the timecode, this is when it is an advert that is displayed. For this content, content producer can insert overlay without interrupting the primary presentation to display additional content to viewers. So this work has been modeled as part of the HLS interstitial specification.
We want to enable two features. The first one is picture in picture where you can target ads to the viewers. You can also target any favorite sports teams during big events like the Olympic Games or the World Cup. You can prepare the viewers with its next targeted sports. That's another feature that we can enable with it. The second feature you can enable is L-shape, where you can add HTML5 pages containing ads or widgets. For the industry activities, to enable picture-in-picture or L-shape ads, there is ongoing standardization activities at MPEG-DASH. There is a Dash extension that has already been made available in SCETI 2014-1, edition 2024. All what will be presented in the demos that I'm going to have, I will be showcasing, is using ChakaPlayer as a proof of concept for both Dash and HLS. So we use a feature for that to enable this. So we have multiple players like what we do for HLS Interstitial, but in that case we will enable either L-shape or picture-in-picture.
So I will now present the overlay concept. So server can schedule overlays by adding the class com.apple.hls.overlay to extend date range tags into the playlist of the primary asset. Overlays are self-content. It can be different type of content, HTML, images or video, and are specified by URI. Device not compatible should ignore server generated overlay events when playing a primary asset. I will now go into the details of the proposed class. So we have various types of attributes. So I will start by the first one that is describing the asset. xOverlayId, xAssetUri, and xAssetMineType.
The xOverlayId is value is a quoted string used to identify overlay events. XAssetUri is value is a quoted string absolute URI for a single overlay asset. XAssetMimeType is value is a quoted string of the MimeType associated with XAssetUri. All the attributes that we are proposing are required. The second class, the second extension that we are doing are relative to the position of the overlay. So there are various kind of attributes that we have defined: xViewport, xOverlayPosition, x-overlay size, x-depth, x-quiz current, x-quiz current position and x-background. x-overlay position: its value is a decimal resolution of the coordinate of the top left of the overlay in the viewport coordinates. x-overlay size: its value is a decimal resolution of the size of the overlay to be inserted in viewport units.
attribute requires X viewport. Both of these attributes are optional. The value of X_death is the sin decimal floating point indicating at which death the overlay should be displayed. Valid values are -1, 0 and 1. A value of -1 indicates that the player should display the overlay behind the main content. A value of 0 indicates that the player should display the overlay on top of the main content, but behind alternate content such such as interstitials. The value of 1 indicates that the player should display the overlay on top of the main content and any other alternative content such as interstitials. And now go to the main content, Xquiz Current. Its value is the decimal floating point percentage used to reduce the size of the main content to a percentage of its initial size. X_QUIZ_CURRENT_POSITION, its value is a decimal resolution of the coordinate of the top left of the main content after having it squeezed with X_QUIZ_CURRENT in viewport coordinates. X_BACKGROUND, so now we take care of the background. Its value is a decimal-quotid string URI used to display simple background behind the main content and all other overlays. And this attribute is optional. Now I will present two examples demonstrating the technology. So on the video on the left, you are having the timecode when it is alive, and when there is no timecode, that means that we are going back to some interstitials or overlays. So I'll start by the beginning. So on the right here you get some extra information. First part indicates that the class is extending x x date range with the start date and duration. Then we are providing the xAssetURI together with xOverlayId and its xAssetMineType, so it's related to the content. And in the next part, we are getting the overlay on... we are having the following describing the position of the overlay on top of the content. Another demo that we are having where we can do targeted advertisements in that case, so we'll demonstrate the same with picture in picture but with some targeted advertisements. On the left you have a different advert than on the right. So you can target the viewer exactly the same than with interstitial.
Last example is L-shaped ads. So with the L-shaped ads, the first part of it is an extension of X at that range with the start and the duration. Then, we provide the asset attributes like the XAssetUi together with XOverlayId and the XAssetMemeType. And we are finishing by deposition of the overlay on top of the main content. The last part is the next steps. We see some improvements on top of what we were doing. All of these have been experimental phases, but have been demonstrated to several trade shows. There are still some open items and maybe more. Here are some of the questions that we might have. Should we consider playing it as standard in Sturdy Show for full screen playback when device is not supporting overlays. How is the audio taken into account? Because we have multiple video that we want to display. How is keep control taken into account? So now I would like to hand off to Sergei and thank you all. SERGEI BAKHTANY: Thank you so much, Mikhail. So what I would like to ask Mikhail in the back there is to start a thread in HLS interest group, which I believe everyone here is part of. So we can provide him a feedback or suggest some answers to the question he had on the stage. And that way we can get the communication going. And the next to present is Zach Kava from Disney+. He'll talk about Disney's experience adopting interstitials in their apps. Welcome, Zach.
HLS Interstitials case study (Disney+)
Thank you, Sergey. Hi, everybody. Yes. created a lot of problems for a lot of people here in the room. So I'm happy to talk about them, too. Like Sergey said, I'm from the Walt Disney Company. I lead the media architecture team there. And I wanted to give us kind of a little flavor of HLS interstitials and how we've adopted them, used them, and seen a lot of success with them. And to really go into this, we need to... There we go. We want to... Left, right. I got it backwards. Okay. We want to talk about three kind of things. First, I want to walk you through the history of how the experience within the Disney Plus application has evolved because it's really driven our technology adoption. From there, we'll talk about our HLS interstitials deployment, how we've done it, and what we've achieved with it. And then I want to push a little bit at the edge and talk about what we're starting to see at the forefront of our implementation and feed it back to the community for some conversation as a little bit of a look ahead. So let's talk a little bit about our technology journey. Now, long ago, far away, in the year 2019, we launched Disney+. And it feels so much longer than it is. But when we launched Disney+, it was meant to be the showcase for Disney content, The premier destination, the highest quality with most features, but it was only for Disney content. And the goals in the first years was to expand that and get that as far into the world as possible. And so we focused on global expansion.
But as with all things, the needs of the consumers changed, the needs of the business changed, and so we started doing new feature development. Particularly, we started out with ad insertion, bringing a choice to consumers of having ad-free or ad-supported experience. And in the same time brought more experiences like live events. Going into 2023, we brought live linear support in and started rolling that out internationally within Latin American markets. And started to bring Hulu and Disney+ a little bit closer together by bringing the Hulu tile to Disney+ in the United States. And most recently, we've continued this trend of expanding features and experience and content by bringing the ESPN experience within Disney+ as well. This gives you a true content hub that you can experience all types of content on demand, live linear, live event.
But the thing that made this successful was that back in 2021, we made the choice to fundamentally change how we did media. And we moved to a concept of composable media experiences. And so what do I mean by composable media experiences? Well, the first thing is that I need to have a pool of main content. this content needs to be prepared and usable in any number of experiences or use cases or scenarios that I want to use it in. To couple with that and to actually make these experiences that consumers are going to interact with, I need dynamic auxiliary content. And what kind of content is this? Well, if you think about across experiences, on Disney+ and on Hulu, the same content might actually have different branding. So you'd have a Disney+ bumper on Disney+ and a Hulu bumper on Hulu. Only murders in the building, if you haven't seen that.
And then there's other aspects of content too, right? Advertisements. Ads are going to be decision per user and you'll have uniqueness around it. And then even content experiences and content inflation can differ per application. The Disney+ experience is actually very language aware and has a lot more translation that happens with it. And so we actually have the concept of dub cards that can be stitched in to attribute the voice actors within that. And you're composing this together on the same base content in order to produce these unique experiences. And it's with these essential components that we really wanted to build a full library to support many platforms and many experiences globally. Now, we had done a little bit of this before with Hulu. In fact, we tried every method of intermixing assets together with Hulu. Client side, server side, they were all great. But at the time in 2021, we were already exploring server-guided ad insertion. And we were already talking about it a little bit too, but mostly in a Dash context. Now Disney+ of course was based within HLS as a platform, and we wanted to take that further. And 2021 is actually when Interstitials was first introduced to the world in great form. And so we decided to adopt that and try it out. And it's been great so far. So I want to walk us through how we went about deploying this. So let's talk about HLS Interstitials deployed. When we think about taking these things, we actually had to build some foundational assumptions to the platform. We had three kind of key ones. The first was how we conditioned the content, bringing strong content-conditioned boundaries. The next was how we dealt with auxiliary content and its codex consistency with main content. And finally, we really wanted to separate the concept of the experience from the underlying playlist that we served. So let's take a little bit deeper look at each of these things. So when I say content boundary, I really mean segmentation, right? If I want to insert content into the main content, or auxiliary content into the main content, or replace portions of the main content, I'm going to do that at a segmentation boundary. So what does this mean from each of us in a statistical perspective? Every single one of our data range tags has x snap out n. There's not one that doesn't have out. There's not one that doesn't have in. They're always there. It's always the same. It's always consistent. And this takes a massive amount of preparation work. When we launched ad support for our on-demand content, we already had a pool of content, and it was not prepared for ads. We could have done the harder route of trying to do it without segmenting, but that was actually going to limit the reach of our ability. So we actually had to develop a lot of advanced techniques to bring segmentation into content without reprocessing all of it. But this had a really big benefit at the actual play-out side, because we were able to achieve seamless play-out across the entire device landscape. Caveat of, there are some old devices that are a little bit slower, but the majority are getting seamless deployment. And that's very, very key to having a very, very good user experience. That's one click behind me.
Sorry. But that's only half the seamless play-out picture, because the other half is the auxiliary codec -- auxiliary content codec. And the specification talks about this, and it highly recommends that you have codecs that are consistent with the content you're splicing into. I'm here to tell you it's really not a recommendation. It's a must. And this is because across the environment, there's not a lot of devices that can actually transition between codecs that exist. It's very, very hard to go from ABC to HEVC. It's even very hard to go from AAC to AAC if it's low complexity to high efficiency. And so we ended up, again, having to look at processing as the point where we decided to lift the effort forward. And we did advanced techniques like taking stereo and bringing in surround sound or position audio mixes, and taking SDR to HDR, because you don't really want to see the TV flicker and rebadge as you're going in and out of the interstitial content. But the nice thing is it's not as strict as SSAI. You don't have to have every bitrate matching exactly to the same standards as the content, so you can still produce the pools of auxiliary content independent of the main content. But it is required that we get to the same level of code consistency and baseline. So that's really the preparation on the media side. It gets you a consolidation, a very nice pool of assets to move forward with. But then how do we describe that? And that's where the majority of the HLS interstitial spec is, because it's all about description, right? And we made the mistake that we want to keep the experience agnostic from the playlist. And so this has -- this has the reason of being for scale, right? The more that we can have both the content and the interstitial assets highly cached, the better chance we have of actually serving at the global scale that we want to serve at. And this has an interesting impact on the attributes within the interstitial spec. First, you can't really use xasset URI. You only can use xasset list. Because the URI you're inserting there can't be custom to the user. And if it is going to be not custom to the user, you can either stitch a bunch of playlists together to make one in three eight. Again, not something we wanted to do. Or you can use that asset list object as a customization vector. And that was our choice. XResumeOffset is always set based on the content itself. So it's how we want to manipulate or move you within the content without a consideration of the auxiliary content that's being inserted. And in the live case, XPlayoutLimit is always set to be XResumeOffset. That is actually just kind of a constraint of our library. We always want you to stay within the live head of that content. We don't want to cut it off. But then there's actually other attributes within the spec. XRestrict, timeline occupies, timeline style. Those are about describing to the player how to render or ARPlay how to control. And that works well, except we're going to use that content with different apps that have different experience expectations or different users that have different aspects of their subscription. So those values can't be set within the playlists themselves. And this really means that everything that we want to do for our customization perspective, we do via the asset list.
That's part of our serve. Now we also do some aspects via the player APIs. But the asset list is that customization factor. And with just those three foundational bits, we've actually been able to drive all the experiences that I listed earlier. This is deployed globally at scale. It is not just Disney+, but also ESPN as of this August. Every stream, FOD, live, linear, is being done with this technology. I can tell you that it works, and you can get there too. We have one-to-one personalization across every stream for our ad decisioning. And the experiences are able to evolve without changing any content. You might have seen recently that we launched the Hulu brand internationally by replacing the Star brand. Many years ago, this would have been months effort across many teams and across a lot of encoding. This took no encoding, it took no adjustments, it was just a config change. And the entirety of the content was rebranded in that experience. It's a very, very powerful mechanism and it's bringing a lot of customization to us. By the end of the year, we'll have a lot more of the Walt Disney Company applications running on this as well. But as always, technology continues to evolve. More scenarios, more use cases. So I want to walk you through a couple. A couple interesting ones that we started poking at. Just to throw it out in the room, try to get some understanding, some thoughts. And the first one I want to talk about is mixed purpose interstitial resolution.
What does that mean? Well, when we wanted to put the concept of a bumper and an ad in the playlist, we broke our rule. We put two tags in and labeled one bumper and labeled one ad. So when we resolved it, we resolved the bumper and then we resolved the ad. This was great. Except it broke that rule. If we wanted to change anything about or introduce a new type, we'd have to go reprocess all the playlists and redo all that content again. That's not useful. So we had to go fix it. We narrowed it down, single interstitial resolution tag for any point. And that meant as we resolve, we do one single resolution in order to bring out that experience. So bumper and ads, they resolve in a single asset list. That's great. But if we want to introduce something new, there's actually more experience coordination that has to happen here. And today, we're doing a lot of this within the client applications. So when we introduce promos, we actually -- when we update all the clients, the promos is there. But we're looking to go more and more towards the service as a resolution vector for these list because we actually have these varying across experiences now. In some apps, you might get a promo, and then a bumper, and then some ads. And in other apps, you might get ads, and then bumper, and then a promo. And it might vary by region. It might vary by subscription. It might vary by the content. But that's the customization and dynamicism that we want to have for the product.
And so we're really trying to look at how do we go further with this. And we'd like to see if there's some way to bring some level of per asset control within that asset list. Because what we're talking about here is, oh, Can I lock the controls during the ads, but not the bumpers or the promos? Some of these mechanisms that already exist and present earlier, like skip control, is already starting down that path of thought, but looking at it more at the individual asset rather than the holistic asset list. So that was one case. Let's go a little bit deeper. Everybody loves live TV. Everybody loves live sports.
Live sports also are the most complicated thing to insert ads into. So when we think about an ad break today in live we think about a start and end we replace the whole thing Everybody's happy we go home until you get to live sports and in life sports a lot of times you might have something like a out slate followed by an advertisement followed by an in slate excellent But actually that advertisement might also be broken up to you might have a distributor placement opportunity Which means the just person distributing the stream can you do replace those ads or a provider? which means the person providing the stream can replace those ads. And you actually might need to vary what you're doing based on the experience that you're activating in or the user subscription that you're playing with. So one user might be in Denmark and resolve the slates to country customized slates, but then a full DAI break. And another user in the same stream in a different app, in a different subscription, might need to do something completely different. and they might also need to be playing portions of that main content as part of that ad break. This has been a little bit mind-bending. We have tried and played with this a little bit. We have tried introducing date range tags for every little piece, but then, of course, if we want to collapse them, we want to make a decision on the whole pod and not on the individual. So it is really -- there is more dynamicism to it, and we are interested in, like, can we do it, and can we also keep it as just a single interstitial resolution? It's an area to explore and think about. But it's the next frontier, and it's part of the fun of development, right? There's always more to do. But with that, you've reached the end of me talking. And just a couple closing thoughts, right? Like, where we are today and where we go is always kind of dynamic, right? We're always evolving based on the experiences that we need to power and the things we need to bring to the consumers. And that's really pushing forward our innovation. That's awesome. If you are in a similar situation, you want to bring customization to streams, you want to prepare and use streams across apps, server-guided ad insertion and HLS and HLS can 100% help you succeed in doing that. We've done it. You can do it. I believe in you. But importantly, I want everybody to remember that this all started because we had conversations together. It wasn't somebody off in a silo. It was all of us coming together to have conversations. That's what I really enjoy about this forum.
I really enjoy about the community. And I really want to thank everybody for letting me come here and talk today. As well as Sergey, who was gonna come back out here. Thank you. - Thank you so much, Zachary. Your teamwork is greatly appreciated on adoption of interstitials. And thank you for sharing this very useful insight with all of us. So the next community topic is with Yuri Reznik, representing the Streaming Video Technology Alliance. He's presenting on progress of content steering unification between HLS and Dash. Welcome, Yuri.
Toward a unified standard for Content Steering (SVTA)
YURI REZNIK: Thank you very much. My name is Yuri Reznik. I'm co-chair of Players and Playback working group at Streaming Video Technology Alliance. And I'm going to be talking about content steering. technology that was invented here by Apple about four years ago and which nowadays is finding much broader deployments and reach not only on HLS ecosystem but also Dash and on all sorts of devices. And I'll talk about what we are doing in Streaming Video Technology Alliance jointly with HLS Interest Group to make it even more ubiquitous, long-lasting, and successful in the industry. So the talk will have a few parts. I'll first talk about multi-CDN streaming systems, how this whole concept comes about, what problem it solves. Then I'll talk about how content steering technology works. Then I'll talk about what momentum we currently have in the industry with building players and technologies and servers to support it. And then I'll talk about things that we are doing and standards organizations and of course next steps. With this, let me jump to the main topic and problem, and context in which this technology arises, and this is multi-CDN streaming. So just a signage to check, I assume everyone knows what is streaming and what are the CDNs. I see smiling faces.
Good, thank you. So everyone who does streaming and know what the CDNs knows, that there is no such thing as perfect CDN. Every CDN has different geographic reach, has limits on capacity, has limits on caching capabilities, and sometimes it just has some failures. To design systems that can satisfy quality requirements and scale requirements, geographic reach requirements, At certain level, the most straightforward way to do it is to use multiple CDNs. CDNs could be used to achieve better geographic leach, to implement failovers if one of the CDNs becomes unavailable rate. They could be used to do load balancing, QA optimizations, cost optimizations, and so on. But there is one problem with designing multi-CDN systems, and that is how to implement switching.
So in a world of static HTTP content, of course you can use DNS switches, but if content is dynamic, this particular timelines of availability of content, or perhaps not exactly the same content coming from two redundant encodings, just switching midstreams might bring you to some surprises. There have been many other technologies tried to implement switching in the past. manifestly writes server-side solution, client-side solutions. In fact, the players and playback group at SVTA conducted study recently comparing all those technologies. Conclusion was there is no silver bullet until of course, content steering came around. So this technology came about as I mentioned about four years ago. Let me just give you a brief explanation how it works. So to make things simple, I show example of a system that is using two CDNs, containing media streams and there is another CDNs just managing CDNs. This content steering, we are adding one more entity. It's called content steering server. So this is a server-side element that sits in the system and there are declarations coming at manifest level. explaining how to use those content steering servers, where they are located. So in HLS, there is a xf, xx content steering tag and then dash, there is a content steering descriptors which are both pointing to steering server and they also provide some extra additional instructions. For example, which CDNs use first before you start calling it. Then players, once they notice that there is such declarations, there is a content steering server, they start calling the steering server entity with some parameters saying what is our current CDN that we are using and what bandwidth which we are observing. Then steering server sends back request which says, "This is my version." It's basically very simple JSON file saying, Yes, this is steering server response version 1. The TTL time at which you should call me back is, for example, 300 seconds. Reload me at this particular URL. Meantime, please use CDNs in this particular priority. For example, use beta CDN first, and then if something bad happens to it, you can of course switch to the second one as a second priority one. So very simple design. As you notice, it's just an extra server that is being added to existing deployment, and it allows us programmatically control behaviors of clients and implement things like failovers, load balancing, and many other cool things. So the benefits.
It is, of course, standards-based. It's now standardized in both HLS and Dash. It ensures seamless switching. Why? Because it is now clients who just get hints that now it's time to switch to this CDN and they can switch it properly. They know the state of the buffer, they could load ahead of time and make sure that the playback pipeline is ready to switch to the different representation and so on. Easy to implement and deploy, the steering servers could be added to existing deployments. You don't need to tear down anything, you don't need to replace any elements in the system. Supported by both HLS systems, In fact, steering servers could be the same for both HLS and Dash, and they support full spectrum of applications, load balancing, failover, QA optimization, so on. Wonderful technology. Not surprisingly, it has quite a bit of momentum.
So it is supported now by large selection of players. Of course, all Apple devices can, video capable Apple devices can use it. The web players using JavaScript, HLSJS, DashJS, ShakaPlayer, VideoJS, they all added support for this technology. ExoPlayer on Android side and world, also in process of adding this technology that was just communicated on HLS Interest Group, fantastic deployment. Now we're really talking about reaching all platforms, or majority of platforms that matter. Equally, there is a considerable number of packagers, steering servers, open source tools available to work with this technology, and there are integrated solutions from several companies already using this technology. Last but not least, I also must, of course, mention that at SVTA, we put together several open source projects and reference streams that developers could try and and try to see if this technology can work with them. And importantly, the momentum goes beyond just pure implementation communities. There is also interest in academic worlds.
There are a number of papers that's been published recently in academic venues talking about how to design systems, how to solve different optimization problems. For example, how to solve optimizations, improving QAE, or how to minimize costs, how to manage commits, CDN commits using this technology, and looking at new applications. For example, this European project called 5G Emerge, talking about streaming using multiple satellites.
Apparently, switching streams between satellites is also the main where this technology could be useful. It's a technology that's going to stars. Space age technology. So now, road to standard. Naturally, technology is out there. There is a momentum. And there have been several stages in its evolution that it's run to as far as standardization is concerned. So of course, initially, it was introduced by Apple and was added in IETF RFC 8016 BIS in 2021. But then shortly thereafter, the Dash community started looking at it and said, this should have very similar and ideally compatible at some level solution for Dash ecosystem. So Dash industry forum produced a specification called Content Steering for Dash, which was eventually standardized by Etsy. And there was also an update done in MPEG on MPEG-Dash 6 edition declaring manifest tags and related elements of standards as needed to practice this technology. But then around September last year in SVTA player and playback group, they looked at the standards and they said, look, they both define technologies that are largely overlapping and there is very large common elements. For example, the communication between server and client is almost identical. So what is the point of having two standards declaring same thing? Looking forward, the chief standards organization will keep evolving it on slightly different that will create fragmentation.
So they asked at that point for unification of the standards and pleased to recognize that HLS community reacted to this and produced a draft standard now. It's called ATF content steering specifications. proposed standard in ATF and that is a starting point for this unification activity so if we look at where we were initially with two stack of standards defining content steering for HLS and Dash being completely separate where we want and overlapping where we want ideally goes to a world where HLS and Dash standards of course define HLS and Dash but there is also a common standard in the middle that is now defining behavior of steering servers and that particular functionality that is definitely common. If we look at defining common specifications, then what happens is that there are elements that are definitely common. For example, things like manifest syntax and a general model of interaction between steering servers and clients, then there are also elements that are mostly common. So they are almost identical, but there are some small details.
Examples of this are elements of client behavior and steering query parameters. So this is where we could do probably a little bit more work, such that the common specs that we will put together will be bigger. And hopefully, it will be more solid foundation in the future. And then there are some elements that are different, which are naturally different. For example, manifest declarations are different. So we can't fix this. But with mostly common elements, there is a bit more work. And I'll have a couple more pages to just bring examples of things which we hope to eventually harmonize and make sure that we could move in the new world where we have a solid unified foundation and there is more broader interoperability between the system. So one of them is query before start. It's a feature in Dash which tells client that before you start playing, please call the steering server load manifest. In fact, if you call steering server load manifest, the steering server could at this point do pathway cloning, could introduce many CDNs, You could even have some manifest with no CDNs at all, and this query before start, you populate your whole distribution.
So in that you can do it, and HLS it's out of scope, and HLS currently says whether to call the steering server or not, it's entirely up to design of a client. What we think is that we probably need one extra clause in the standard to say that it's responsibility of content streaming protocol to do it. But I hope we can resolve this. It's a cool feature. And maybe HLS could also think about doing it. But another thing which is almost identical to the plans that-- it's called penalization delay. And in HLS, the specification says two minutes is good enough. And then dash says it's one TTL up to five minutes.
So I'm hoping maybe we could get together and figure out. Maybe there is something we could agree on and make behaviors of clients on both ends of the ecosystem identical. So example of other things. For example, like steering parameters. On HLS, they code underscore HLS pathway and underscore HLS throughput. And in dash, they code underscore dash pathway underscore dash. Why the heck we need thematically? They're identical. Maybe we could have a single set of parameters.
There are a couple of nuances. For example, in dash, you could, in fact, signal several values of pathways. For example, if there was a long time between TTL, between client and the server, and until next time client calls the server, it did several switches. For example, they were like CDN A, B, and C, and preferred one was CDN A, and it switched to B, and then it could say that, "Okay, I'm now using CDN C, but before I tried A and B, and these were the throughputs that I've seen before, and this is the reason why I switched."
So maybe instead of just signaling the last one, signaling the whole chain would be a useful kind of feature to have. Of course, in Dash, we also allow CM/CD to be piggybacked to parameter stream coming from clients to the content steering servers. Maybe again, we should talk more, and if we agree that some of these things make sense, I think life of implementation community will be just easier. We could design single servers that works consistently across all systems. And next steps, of course, we want to complete the work, right, and finish the content steering specification as a formal ATF draft. Then, of course, update HLS and DASH standards and build next-grade products that are using this technology. It really has a chance. It's almost ubiquitous. Let's make it ubiquitous. And on this note, thank you very much.
[Transcript missing]
Q&A
All right. Nice and comfy. Much better position than when you presented. Awesome. All right, let's get started. The first question is for Roger. Can you talk through different keys being used on different variants and how clients should handle key rotations on variant switch scenarios? Sure. So as most of you probably know, it's pretty common for content providers to secure different variants or different flavors of content with different content keys. And so your SD quality might have one key, your HD quality might have another.
Typically, a client is only operating on a single variant at a time. And so the and of course, so that means they're only the playlist, you know, there is only that content, like only the if it's sitting on the SDT, or only the SD playlist is visible to them. And so there isn't an opportunity necessarily to preload the key for a different variance. And that's sort of, you know, that's largely, that wouldn't be helpful because the client doesn't necessarily know which variant it's going to get to. And so if we get to the point where preloading is so successful and now the problem is that we're worried about keystorms as individual clients switch variants, which is not likely as much of a thundering herd, but if we do get there, we're probably looking at more of a multi-key type of scenario where you're going to want some kind of strategy where when you specify a preload key, let's say for your HD variants, for a particular HD variant, that the server, when it gives you that preload key, also hands you back a selection of other keys until you switch variants. So I think that's the path. Whether we get there, how soon we get there, it's kind of, we'll see. Great. Thank you so much, Roger. And sorry, I forgot to introduce you to Sumaya. You probably remember her from last year. She presented on Fairplay and she here in Fair Play team. So the next question is for her.
Does the HLS player handle key rotation automatically, or is it up to the implementation? - Right, it's not up to the implementation. We follow what the spec says, and so the AV player itself will handle all key rotations, as expected. - Thank you. - And the same goes for the key preloading.
- Yes.
- The player will do that for you automatically. You don't need to opt in, really. Great. Thank you for clarifying. And Ashank, a question for you. Can a skip control be configured for the main content, not an interstitial? ASHANK KHANNA: Yeah, so skip controls, this is a topic that we talked about today. That API was specific to our HLS interstitial solution for AV player interstitial events, and in the case of AV Foundation. If you want to have a skippable behavior for your main primary content, we don't have a facility to provide that natively with HLS. But there are ways you can configure it. And that's particularly via date range. And so you can, for example, indicate a date range that says, hey, this is the time where you want to have a skip button show up. and then it's up to the app then to handle rendering of that skip button and the action of what happens when you click that skip button. Thank you. And if you are in a position where you really need a skip control for like an AirPlay scenario or like some kind of system UI thing, I would say get in touch with us.
Let us know what you're looking for. We don't have a facility like that today, but like with anything, we're interested in knowing where your pain points are. Cool. And a quick follow-up question on skip control. Can skip control attribute be used with late-bound asset list HLS interstitials? For example, to skip one ad in an ad break? Yes. So to answer the first part of the question, yes. Today, we have a facility for a provider to specify a skip control for a singular late-binding interstitial. So that's across the whole ad pod.
So if your ad pod has multiple items, the state of the spec today says you can define skip controls for that whole ad pod. I think the question is also referring to a per item use case here. And we don't offer a facility for that today, but we'd like to hear. And this is like an evolving thing. This is the first rendition of skip controls, and we'd like to hear your use cases. So if you do have a use case that you think would be served by the singular asset skip control, please feel free to follow feedback requests and we'll look into it. I would just add one quick thing to that, which is that if your particular situation is that you want to skip the second half of an ad pod, you can do that using the skip offset or whatever we call it. So you can still have a single interstitial event containing multiple ads, and you can set up your sort of skip until so that the skip button only shows up during the second half is only active for that. So if that's your requirement, then that can be addressed with the current.
Right. And the other option is then to potentially consider splitting up your asset list into individual events if you would like to have that flexibility on a per event basis. Sounds good. Thank you for answering this. And let's switch gears a little bit. Let's talk about ASL use cases. So this is the question for Julia. How do you handle playback speed changes in trick play? This applies to both keys and ASL use cases. Yeah, so playback rate changes are coordinated.
So up to 2x speed, there will be playback coordination. That's frame-accurate. However, once you get beyond 2x speed or if you get into negative rates, then at that point it will still be coordinated. However, it won't necessarily be frame accurate synchronization between the streams. Okay. It's good to know.
And one more question on this is: does the post only go to the video streams and to the playback coordinator and never goes to the ASL stream? I'm not sure what this question means, but... I think, yeah, during the presentation, I had a diagram about how, like, playback behaviors are coordinated, such as a pause. So for that, how the pause works is both players have a playback coordinator that's connected to the coordination medium. So one of the players will pause, and that playback coordinator will send this through the coordination medium to the other playback coordinator, and that playback coordinator will apply that on the player, such that the pause or any rate changes seeks are coordinated. So in the end, both players will receive that pause signal, or any other connected players will receive that pause or rate change signal. Yeah, thank you. And actually, we got quite a bit of questions on sign language, so if you don't mind, I'll ask you one more. So is there a way for the auxiliary stream example ASL to appear and disappear Automatically that it only shows up when needed and if possible for the mainstream to authorize if needed for cases when an overlay behavior Behavior is what is not desired Yeah, so currently that type of behavior is not supported with like AVKit and the AV player view controller so that would require some custom player behavior or including metadata in the stream and handling that in the application. So we don't have any of that default behavior currently, and that would have to be handled by some more custom player logic. Yeah, it's worth sort of noting that the multiplayer stuff is still sort of a very app heavy kind of notion. Like, you know, we can coordinate two players like the AV framework can coordinate two players, but it's still up to the app to create both players and set it up with the, you know, the, the asset, et cetera, et cetera, and size the players, position the players. And so from that point, given that they're already doing that, it's not that big of a jump for the app to be able to pay attention to. Like you could add, you know, as Julia was saying, metadata with custom date range tags that the application could be paying attention to. And that will give you the signaling you needed or the app needed to say, oh, okay, once we reach this point, I want to move the player over here or I want to hide it or something like that. So it's definitely doable. It's just not automatic.
And I think adding to that a little bit is exactly like the type system. So you could have, for example, completely separate multivariant playlists for each of the players. You could also, since these are just AV players, you could have it pointing to different media renditions within these streams.
So you could have your sign language streams be signaled by the XTX media tag and have those as different languages that you can render or have them as completely separate playlists as well. Yeah, these are great. Alright, let's talk a little bit about HLSJS. You know, people find it useful. So the first question is does HLSJS support multi-view out of the box? So out of the box, as I spoke about earlier, we do support multiple instances, and you can create multiple instances of HLSJS with simultaneous playback. There isn't a coordinator to coordinate synchronization of the content that those players are playing. The same with bandwidth estimation. That's something you'd want to do in your application. So setting a cap on estimates to share more evenly, and using playback rate adjustment, or seeking to synchronize. Or take a look at the browser's media session API. There was an intention to use that for synchronization of content. So that could be a way to achieve that.
It's also worth noting-- you may already know this-- but in a lot of places where HLSJS is deployed, particularly on things like smart televisions, you don't have a lot of video codec available to you and typically it'll only support one protected video codec instance. So I think even a--
- Yeah, so smart TVs will be limited to one decoder, but in the browser on the web you can-
- Right, yeah, if you have to do a lot of Chromium or Safari, yeah, then you're probably good.
- And the next question, you're probably gonna lag the most. Do you plan to support CMCdv1 via query prompts in AVPlayer? Or actually, is this for you? Yeah, this was in AVPlayer, but it was about CMCd. And any use on CMCdv2 plans to support it? So CMCdv1 is supported via request headers. We don't have support for query parameters. And we don't have anything new to announce for CMCdv2 support just yet. So file issues, let us know what your use cases are and what you want to see the most. That's going to help us prioritize that work. Great. Thank you. And let's mix it up a little bit more.
So I want to go back to Roger and talk about his proposal. Or actually, it's not about the proposal. The question is, it's about low latency HLS and regular HLS. ROGER MARTINEZ: Oh, OK. So in case of million users accessing the same stream, do you recommend using low latency or regular HLS? Oh, okay.
Interesting question. The answer, so this is about scale, right? And it may surprise some people to know that if you have the choice and you've got low latency HLS working well in like small number of instances, it actually scales better than regular live HLS. Remember, it's like live HLS is a 15-year-old thing. And when we redid it for low latency, we had the opportunity to say, okay, we're going to take a slightly different trade-off here. we're going to place a little bit more implementation complexity burden on the server. Because in original HLS, you have an HTTP server, you're good, you can do live. And for low latency, we changed that tradeoff a little bit. And we said, if we can take a little more complexity on the server side, we can make a much more efficient delivery protocol for not just low latency, but just live delivery in general. And so a lot of the blocking behaviors and the sort of the preloading that low latency offers allows us to do a much more reliable, much more dependable live experience. And so certainly, you can dial back quote-unquote low latency to give you comparable latencies to regular HLS. And then you should have something which is absolutely, you know, is going to be more reliable, more dependable. You will have to build that on the back end. And so you would have to implement low latency HLS. But if you do that or if you've done that, then I think that that will actually deliver a better outcome for a million users than regular HLS does. It's worth mentioning that low latency HLS is made up of different features. And so they could start by adopting some of them, like Delta Playlists is a great one. - Right, for sure. Yeah, and we've actually, we've had people do that. We've had people say, "I'm still using regular HLS, "but I have like eight hour DVR window, "and so I'm using Delta Playlist updates "so that I'm not paying that cost "of reloading that eight hours every six seconds "or whatever my target duration is." And that's a good point. - Yeah, and thank you for this answer. It's very important statement for the community so they can evaluate without trying it. And let's move on to AV metrics. I have a question for David. Yeah, the question is about interstitial metrics What is currently available for interstitials in AV metrics framework and or what is What will be available soon? sure, so all the AV metric events that were available for HLS are also available on interstitial player items the same way they would be for a primary item if you're asking about events or metrics specific to interstitials. There aren't any, but I'd love to talk to you to understand what might be missing that you'd want to hear about.
Yeah. So if the person asked the question is here in person, please come find David after the Q&A and talk to him about your use case. David, could you maybe talk a little bit about what they would -- because you had some code examples of sort of subscribing to metrics. Can you talk a little bit about how someone might do that so it works with interstitials as well? Well, off the top of my head, when you get the interstitial item out of the interstitial event, then you could do the same logic of extracting a metric sequence out of that item and then you'll get metrics specific to that interstitial item, such as its playlist, its segments, whatever, in the same way that you could out of the primary.
Yeah. So basically watch the interstitial player from the from the control, from the monitor, I can call it and look and keep keep an eye on its on its queue. Exactly. And then subscribe to item by item. Mm hmm. If people are trying that and find it onerous again, let us know. There is scope for improving that, but that's the way you do it today. And just a general comment about interstitial items. I mean, interstitial items are AV player items accessible to you via the interstitial AV players. So you get all of the AV player item facilities for interstitial items out of the box, including AV metrics. Yeah, that's a great clarification. And David, one more question for you. This is a more basic one. Do events come in chronological order? DAVID CHOE: Yeah, actually, they essentially do come in chronological order. And for more precision, there's also a date field on each event that acts as a timestamp. So you could sort them later if needed and so on. So yeah, for all intents and purposes, you're able to get them in chronological order. Let's see what other questions we have. I have a question for Roger about his proposal.
How does live key rotation affect latency or playback stability? - Well, it's supposed to improve, right? The challenge we face with live key rotation rotation and the thundering herd is that in the worst case, everybody piles up against the key server, the key server against backlog, and it's, you know, seconds before you get a key response. In the meantime, your playhead is moving forward and you're, you know, you hit, you load the segment that needs the new key, you don't have it yet. If that happens, ultimately the player is going to have to stall because it's got to wait for the key. And so what that preload hinting is going to do for you, hopefully, is allow you to make that key request further back in advance so that when you reach that transition point where now you need the new key, it's there and things proceed. And so the key requests, generally speaking, from a network point of view are very lightweight and so making that preload hint request isn't going to disturb anything about the current playback. It should just allow things to be set up further in advance so that we can guarantee a smooth transition over that boundary. Now let's talk a bit about keys preloading. So when you preload the keys, does it mean that the key server will have to deterministically Have all the keys ready, already generated to serve the entire session or they should be generated dynamically?
Right. So, that is under the control of the provider. As I said, when I was outlining the proposal, it's up to the packager to put the preload hint tag into the playlist. It should do it in advance of that transition and how far in advance is up to the packager. But at the point where it does, or to turn it around, the packager should not put the preload hint into the playlist until the key server is prepared to handle a request for that particular key. And so all that stuff is under the control of the service provider. That being said, the farther in advance the packager can predict that there's going to be a key rotation and when that key is going to take effect, the longer that interval can be where you can distribute your key requests and so reduce the load on the key server. So, you know, it's under the control of the server, but, you know, doing a better job of prediction is going to mean that you're going to do a better job of scaling your key server. Yeah. Thank you for that.
And there is a more generic question. I think this is for Samaya about preload in general. So Alan is asking, is spreading time of request specific to the key preload? Can it be changed to a generic flag that we can assign any type of preload request to be spreaded in a given date range? Right today. We only have preloads for preload hints for keys and interstitials They're not provided for any other resource But if there is any other use case, please let us know and we can talk about it and see how we can Help with that use case as well Yeah, I mean to expand on that a little bit today The only other types supported by the preload hint are for partial segments and map files Partial segments, generally speaking, you're always hinting the next segment that's coming up.
And so there's no, it's not desirable to delay that, you know, to any extent or to spread it. They're all blocking playlist reloads and so, or blocking segment reloads. And so for partial segments, there's not really a use case. You could perhaps argue that there's a use case for map segments for like to preload hint upcoming maps. Maps, because they come out of the content store, generally scale better from Thundering Herd point of view than keys do. So we haven't heard a great demand for helping scale map loading. If there's a problem, people are having problems with that, we can certainly look into that as a potential application.
Now, one more question on your proposal, Roger. Why should the packager remove the hint tag for the key after the date of first use hits? This seems to be a change of key concept of HLS playlist that once in time something is in the playlist it says. Well, so you're asking the author of the HLS spec about the details of the HLS spec. So the preload hint tag actually, as my daughter might say, is of a class, I think we call it media metadata tags. And so we have very particular rules about the mutability of a playlist. What is allowed to change, what is not allowed to change. Media metadata tags are actually, like date range is an example of another media metadata tag, and those are actually allowed to come and go in a playlist. This is true actually for low latency for the existing preload hints. They can come, they can go. We're not really breaking new ground here. It is with the key preload hinting proposal.
I would say in terms of why, you know, even if, you know, we could leave them in the playlist, would leaving them in the playlist be a benefit? I would argue no, because the key tag is there now. And the key tag is the overriding signal to the client that it needs the key. And let me be clear about how we intend to implement this on AV Player and I believe on HLSJS as well, if Rob gets the time to do it. is that when the player sees the preload hint, it will schedule, you know, it'll figure out when it's going to do it, it'll pick a time, it'll schedule a preload. But if someone made a mistake and that key actually shows up in a key tag in a playlist before that time, at that point, we're going to throw away our current, like if we haven't loaded the key at that point, we're going to throw away our previous plan, we're going to load it right away. It's going to like the existing behavior is going to take over. And so the key tag is going to supersede the preload hint tag. And so I don't see a particular value in leaving the preload hint tag in after the key tag appears. And as I said, the rules allow us to tell people to tell the pack church to remove it. So I think that keeps the playlist small, and there's no benefit in leaving it in. So I think we should just remove it. Yeah, that makes sense. Let's talk a little bit about low latency HLS. Sean, this is a question for you. Is low latency HLS compatible with encrypted streams? Yeah, so I think in general HLS's encryption solution or key solution is available to all formats of HLS playback including video on-demand, live, and low latency. Okay. And now let's go back to Julia with question about AV Player View Controller. So the question is, will AV Player View Controller provide UI to choose the sign language stream to overlay, similarly to how subtitle selection occurs? So currently AV Player View Controller does not support this. We'd love to hear more use cases about what you would like to see for that. So afterwards, I'd love to hear a little bit more, and if you want to file some feedback request.
If you do want to do that outside of the AV Player View Controller, again, there's a lot of this is client-driven. So you could work with the AV players or have some custom player logic in order to do some of that player handling. But the sort of short answer is the AVPlayer view controller doesn't support that, but we'd love to hear some of the use cases for it. Yeah, so come find Julia after this.
Because it would kind of be an extension to because you, it sounds like what they're talking about doing is adding a control, because right now we have pickers in the standard playback UI for like subtitles and audio, and it sounds like they're asking for essentially a picker for video, or for visual, you know, accessibility. Because remember, as we said earlier, what Julia described is still a two-player solution that's managed by the app. It's not managed magically and automatically for you by a single AV player instance. And so if you want to implement this, we've provided a number of tools, but it's still an app lift to build something like this. You're not getting it for free.
And so we think that we provide the tools to allow you to make a good experience, but certainly if you're running into roadblocks, come talk to us. Maybe we can help. Yeah, thank you. And one more for you, Julia. Is there an easy way to distinct and resync two AV players at the same, at the runtime? Distinct, the two players can be independently paused and sync as full independent playback. or sync? Sync is a plain secondary player to sync to a player, a plain main player. Yeah, so there is a mechanism in playback coordination with suspensions where you can temporarily desync two players. So you could have two players in coordination and say for example you want to temporarily not have them coordinated, you can add in a, it's an AV playback coordination suspension. So you add the suspension on one of the players and none of the playback states will affect. So any playback changes to that suspended player won't affect the group and any playback changes in the group won't affect that player. So you can essentially have the players desynced and once you're ready for them to be coordinated again you can remove that suspension from the player and then that player will join the group again. So there's that behavior to separate out to have a separate player that's not coordinated and then have it come back into coordination. Great, thank you. And the other question on coordination, or I guess multi-view, is there a multi-view API that does the same thing and can it be used in Safari? And Roger, maybe you can take this one. Could you repeat the question? I thought you were going to ask Julia. No, the question is about, sorry about that. The question is about whether there is a similar API available in the Safari to do the marketing you coordinate you To instances whether they be sort of the native video element or whether it's or yeah I think Rob already answered for HLSJS is that there are you can have multiple multiple elements or multiple instances Can what you call and try media API? That's a browser API that that might answer right solve that problem But it wouldn't be in the same way that it's not necessarily exposing our multi-view coordination. Yeah. So, right. So I would say the answer is I don't know. Certainly the streaming team or the HLSJS folks are not -- we don't have anything for that. Perhaps there's something in HTML5. Did we mention what we sync on? Do we sync on playing date, program date time? And again, Julie, do you want to sort of take that how we handle syncing for different kind of assets? Yeah Yeah, so for alive we do sync on the program date time so based on that information We know which segments should be synchronized each other for a live stream, but for VOD That's not necessary. We don't use that and use the start time So the initial segment will be time zero and we sync based on the time zeros of those two playlists But then for live, we do have the program date time requirement. There's actually an interesting thing there, which is as an aside, which is that you can have two live streams that are not timestamped synchronized. So the media timestamps aren't necessarily synchronized. So you wouldn't be able to use them for like alternate variants, for instance, in a single MVP. But as long as the dates are coordinated, as long as the dates are in sync, you can use the the multiplayer API's to present them in sync.
The media timestamps do not need to be as long as the dates are. Great. All right let's shift some gears and talk about preload keys. So the question is for Sumaya. Where does the preloaded keys get stored on the client and is there support for that in the CDM already or needs to be added? Right. So clients should be handling these keys just like any regular keys are. They're not much different. The difference with preloaded keys is that they are scheduled at a randomized interval so that they're requested earlier. And maybe Roger can add more about the CDM support there, but clients are going to be the ones storing these keys to answer that question. Yeah. I mean, it's worth knowing that -- worth noting that in both sort of EME and AV player, AV content key session, both offer, like in EME, there's a notion of a key session, media key session, I think is what they call it, and AV, there's AV content key session. Both of those offer the ability for a client to initiate a queue request without, you know, independently of the media. So, and this is used for preloading the key, for accelerating startup, you know, there's a number of different uses. And so all we're doing, and it's, but the same thing under the hood is like you, the client at, you know, something initiates queue request, you go get it and we store it away for later, right? And so this is just another kind of application of that same mechanism. It's just now HLS is codifying, you know, when that, you know, how that request is going to be issued and timed.
So we don't believe that we're going to need any CDM changes or any changes in the HTML5 API in order to support the HLS key preloading. Thank you. One more question for you, Sumayya. What would happen if the encryption key used for HLS media segments is lost or become unavailable to the player? Well, short answer, playback will fail if key request fails. But I'm assuming we're wondering what will the player do? Will there be a retry request? We do not retry requests that have failed. We expect the client to take in the failure and request the key again if necessary. But if a key fails, it essentially fails. That's the end of playback for there. Makes sense. All right, now let's go back to Roger with a trick question. Someone is asking about trade-offs between quality of service and security or scalability. So the question is, assuming clients want to provide the best experience to their users, they will likely just preload the keys as soon as possible, which just leaves us with the same scaling problem. Is there a way to mitigate a client that chooses to immediately request the preload key? Right. So when we, you know, assuming we go forward with the key preload hinting as described, the HLS spec is going to, just as we do with interstitials, the HLS spec and content steering, The HLS spec will sort of say how the client is supposed to behave. Now, of course, a particular client implementation could choose to ignore that. Now you're talking about a client that's no longer following the HLS spec. And in the case of such a client, there's not really much else we can do. We could add more words to the spec, but they're already ignoring what we've got. And so I think the short answer is that we expect clients to follow the rules that we're going to lay out. And so they're not going to behave that way. And we have a fair amount of faith that that will be true, at least for AV player and HLSJS. And hopefully other players will follow suit. Yeah, makes sense. So it was not such a trick question. Next question for a shank can skip control attributes be used with late one. Oh, I think I was ready Yeah, sorry Let's see So the question is what is the recommended DVR window lands to effectively use interstitials on linear streams? We encounter playback issues when the interstitial duration is larger than the DVR window Yeah, so I mean if you're encountering issues with a DVR window being shorter than the interstitial duration Please follow up feedback request. That's probably something we should look into It should work with a shorter DVR window. I mean as a workaround If it is adjustable you might recommend Increasing the DVR window to be the length at least the length of the longest interstitial that you anticipate But regardless, we should expect that the shorter DVR window should work with the longer interstitial. So if it's not, please file a feedback request. Yeah, we don't place a constraint in the spec on the interstitial duration versus the playlist window size, and that was a deliberate choice. So we believe the client should handle it. That said, you know, it is an edge case, and there may well be bugs there.
And we've had people file bugs, and we've fixed some of them. So we think it should work. Please continue to test file bugs. As Ashank said, the workarounds are to make your DVR window longer, potentially if you have the flexibility to break your interstitials up into a smaller number of shorter, or a larger number of shorter interstitials. But the bottom line is we think it should work. We'd like to make it work. Please file bugs. And when you file bugs, please give us sample content. That's always, helps expedite. - This is Diagnose, right? So it's diagnosis, logs, sample content, the whole, yeah. - Yeah, great, thank you. This is great. One more question about sign language for Julia. The question is, is it possible to add multiple sign language streams, like a multi-view?
- Yeah, it is possible. So I know in my presentation I only showed the main player and one sign language player, but you could have multiple. I'm not sure what the use case-
- I mean, I guess there could be use cases where you may want to show multiple. So we do recommend no more than four to five players. There's no really limit. It's more a constraint from the hardware side and what your device can handle. But our recommendation is about maybe four to five players maximum. So you can have multiple sign language streams with a single main stream going if you wanted to. So if you operate an accessibility-enabled sports you add, then this is, you know, for you. Great. And one more for you.
A few blocks away. Yeah. One more for Julia. Was this synchronization accuracy of AVPlayer playback coordinator? Can it handle frame-accurate sync requirement like for LC, EVC, enhancement layer? Yeah. So for playback coordination, it is frame-accurate. So during playback, it will be frame accurate. And then with seeks and pauses too, all of that coordination will be frame accurate.
So you'll have that. Let me just -- because of something you said right at the end there, I want to add a caveat, which is when we say frame accurate, we mean that the synchronization will be within a frame duration. So like for 60p, that's about 60 milliseconds. We do not guarantee that they will frame by frame, like they will be frame locked. So if that's a critical thing, then, you know, we may have to have another conversation. So we, you know, you will have accuracy certainly to a 16 millisecond-ish period, but not necessarily if you want to put one thing on top of the other with alpha or some sort of thing and you want absolute coordinated frame transitions, that I think that's still maybe science fiction. - Makes sense. All right, let's switch gears a little bit and talk about more novel technologies and Apple Vision Pro in particular. The question is for Roger. If we are trying to livestream stereoscopic content, I would currently send left-eye, right-eye packed frames, whether as flat or equirectangular. Is there an advantage, especially thinking of Apple Vision Pro, in transcoding into MVHVC in the pipeline or treating as coordinated streams as left and right eye? Generally speaking, for immersive video content, we really do recommend MVHEVC. Not only are there the benefits you would expect from the compression savings of being able to use material from both left and right eye for the macro block pools and stuff like that, And also the delivery, you know, coordination of making sure that you've got, you know, both left and right eyes both arrive at the same time. But there is also some benefits in the rendering path where Vision Pro can do kind of magical, special stuff when it knows that they're actually part of a single stream. And so for Vision Pro particularly, and I imagine, you know, as we do more immersive video in the future, we strongly recommend MVH-EVC as opposed to trying to roll your own by delivering left eye, right eye, and then sort of doing it yourself. Great. Thank you. So one more question about preloading for you, Roger. For key preloading, is there a consideration for extending data range preload, resonance and preload hint? Benefit is backward compatibility. Older OS can still read data range and call process content key request with ID, plus existing precedent for extension attributes and no multiple tag restrictions. So I may have missed part of that, but I think they're saying, why not do key preloading via date range instead of using the preload hint tag? The biggest argument against that direction is that structurally, date ranges, and there are a number of reasons for this, but the big thing about date ranges is the set of date ranges in a particular rendition, like a variant that you're playing or a language or whatever, must be the same across every rendition, every variant in the playlist. So if you have HD, SD, SDR, English audio, Spanish audio, subtitles, all of those must carry the same set of date range tags. And the main reason for that is so that as a client switches around, They're presented with a consistent set of things.
Now, then we come to keys. Keys are per playlist, really. I mean, in the most general form is different playlists have different keys. And so the biggest argument for structuring the preload hints as a date range tag is it's not specific to the playlist. The secondary, another reason that we went with preload hint as opposed to what we did for interstitials. Because the argument flips around for interstitials, right? Because for interstitials, you actually want that global view across all assets in the MVP. You all want them to be able to see the same set of interstitials. That's important. And for keys, the other thing that sort of led us in this direction is that preload date ranges are, I know that some people like to, in their mental model, they imagine date ranges sort of moving in lockstep with the segments, right? like the position of the date range over, you know, next particular segment means something, but that's actually not true if you kind of look at the spec. The date range tag is actually position independent in the playlist. They can all be at the top, they can all be at the bottom, they can all be in the middle. They can, you know, that -- they don't need to be next to a particular segment in -- as opposed to, like, the map tag or the key tag, where the position of those tags relative to a segment means something, is significant. And the preload hint tag is in that second class where it is actually, it is position dependent. And actually, its position is significant relative to what's coming next to it from a segment point of view. And again, because that's the way the key tags operate as well, that seemed like it was a better fit. And so we looked at the trade-offs, We thought about doing it one way and the other way. And at the end of the day, the balance of the trade-offs made us lean toward preload hint as a better approach for keys. Makes sense. Thank you for clarifying this. I want to ask a question about EBR later to Rob, probably. Maybe, Roger, you can also chime in. If it's determined that EBR switching is happening too frequently, what countermeasures would you recommend? Well, one, make sure that your tiers are not too close to each other. You might want to work on the bitrate ladder itself. That's probably the first thing. It might also be a matter of like with live sort of your sort of target latency or buffer. It all depends on the client, I'd say, in that case. But yeah, maybe you don't want to have tiers that are so close that there's sort of a bouncing between between two Yeah, I'd say that's the main thing that From that that you can do as a content vendor is make sure there's a decent amount of spacing and analytics I mean to to David's, you know talk about metrics, you know You can learn a lot about the the ladder you've chosen and the bit rates you've chosen from metrics And that's a good feedback loop to have to know what your clients are sitting on, how often they're changing variants, et cetera. At the end of the day, ABR is owned by the client.
And in the case of HLS, JS, you have some ability to do some customization of the ABR if you choose to sort of go down that rabbit hole. For AVPlayer, of course, and for some other players, it is purely in the domain of the framework, in our case, iOS, tvOS. And so at the end of the day, if you think you've got a good set of tiers, you think they've got decent spacing, and you're still seeing what seems like situations where it's switching more frequently than you think it ought to be, file bugs. We will look at them, and we agree.
right we don't like seeing that either we think there should be stability and it should be you know it should perform well and so if there are scenarios like that we'll look into it and maybe yeah we'll end up saying oh yeah the reason is because on your four megabit tier you're constantly stalling out when we ask for segments and so what's happening is we're trying to switch to four and we're stalling we're going back to three and then we're going back and forth and back forth and so the answer is fix your four megabyte you know megabit content delivery and then things will be better. But it's so, yeah. Yeah. Yeah, it makes sense. These are great recommendations. So time flies fast. We have three more questions left. So I'll go to David with a question on AV metrics next. Can I use AV metrics to customize my streaming logic? For example, could I condition error handling logic on seeing an error event? Yeah, that's actually a use case we recommend not using AV metrics for. So while metric events come in real time, the timing guarantees aren't intended for app logic and changing what your app does based on that. It's more intended for after the fact analytics uses. So that is much more what Aviometrics is designed for and what I'd recommend you use it for. You could also use it for things like what Roger was just mentioning. If you find that your variants are switching a lot, you can use it to figure out is a particular tier the issue. You could look at the bit rates of the different variants and figure out what's going on. I would not use it for app logic at the time that you receive a metric event.
Got it. Thank you. And next question is on WebVTT subtitles with FMP4. So this one for you, Roger. WebVTT subtitles with FMP4 audio and video is a popular combo. How translate FMP4 media time to WebVTT and PACT-ES anchor time? I have experience that it works to convert the FMP4 timestamp to 90kHz module, modular to the power of 33, MPEG-TS wraparound. Is that guaranteed to work in all Apple HLS implementations, including HLS-JS? Yeah, and so for those who aren't deep into the weeds on the VTT format, So in HLS, VTT subtitles are just little web VTT files. And the one addition that this HLS spec adds is a little bit of metadata at the top of the file, which is a mapping tuple that maps. So VTT is timed in what they call queue times, which look like hours, minutes, seconds, decimal number of seconds. And so the metadata that HLS specifies is a mapping from a queue time, so hours, minutes, seconds, to a media time. And, you know, showing our roots a little bit in MPEG-2 TS, we specified that back in 2011 or whenever it was as an MPEG-2 time, essentially. So a PTS that was using a 90 kilohertz scale, which is what MPEG-2-PEZ uses. And so we kept that when we added FMP4 segments to the spec. And we just said that, yes, it's still going to be a media time. Now it's going to be an FMP4 media time because now you're using TFDT and the MP4 timescale to time your timestamps. So it will still be on that TFDT scale, but for compatibility, what have you, and also for intraoperability in the somewhat edge case, but not, you know, not terribly outlandish case of where you have some segments that are FMP4, perhaps audio because you've got Dolby or something, and some segments that are TS, you know, you may cause your head to explode, but, you know, it can happen. And so we need interoperability for that story also. And so where we landed was to say that, And exactly as the questioner said, so you have that tuple that says, okay, this hour's minute seconds corresponds to this media time. Yes, in the case of FMP4, that media time is an FMP4, so TFDT-style media time, but expressed in a 90 kilohertz clock. And it doesn't need to be 33-bit or, you know, I think it's 33-bit time-stamped wrap. That's not necessary. Although if you're moving beyond 33 bits, well, that's not, you know, that's only a day or two. So it's not necessary to wrap it. And it's not actually necessary to, as long as the origin doesn't change. Like you don't have to advance it every segment if you don't want to. As long as the mapping point remains the same, you can use the same value.
from VTT segment to segment. But it is going to be a 90 kilohertz PTS expressed on the TFDT timeline. And yes, not only should that work in Apple clients, but also in HLSJS as well, because it's the same rules. And hopefully all the other clients out there as well. So yeah, that's the way to do it. Great. Thank you so much. And the last question is on your proposal, Roger. Yeah, I'm not sure if you understand the question, but basically the ask is, will content key rotation and preload work as fair play and static keys that are generated per event? I'm not sure what is meant by static keys that are generated per event. Yeah, right. I mean, so let's break down the question a little bit.
So for sure, key preloading will work with Fairplay. And so a lot of you are probably using Fairplay in your streams. As you know, the flow there is that you get a request from AV content key session that says, hey, guess what? The system wants a key. You respond by asking the system, okay, well, produce a key request for me. And it gives you a bag of bytes. And you send that bag of bytes up to your key server, which then sends you back a different bag of bytes, which has a key response. You give that to AVQ content key session, says thank you very much, moves along. Content key preloading will essentially just do that. It will knock on the door of your app and say, you know, whatever, 40 seconds before the rotation happens, it says, hey, guess what? I'd like this key. Your app says, all right, you can make key requests, blah, blah, blah, here's the key. Life goes on. So that part should work. So then we get to the question of static sort of pre-provisioned keys for an event. That's largely, for the most part, that's an implementation detail between the provider and the CPIC server in terms of how keys are provisioned in advance and how they're scheduled.
So the only thing the preload scheme should change in that is it might move up the timeline of when we want the key server to be able to handle a request. Because normally, you know, if you know that you're going to have a four-hour event, you're scheduling content key rotations every 15 minutes, then in principle, your key server doesn't need to respond to a key request until that 15-minute mark. With preloading, maybe we pull that back. Maybe now it's the 13-minute mark. And so you may need to sort of do some CPIX changes or something to arrange for your key server to be able to handle it a little closer. The other thing that made me think perhaps what they're talking about is, at least on Apple platforms, there's the ability to persist Fair Play keys. So you can actually, and this is usually used for offline. It's usually used for when you download content and you want to be able to watch on the airplane or something.
And so you have the content key. In principle, you could use a similar mechanism to sort of pre-provision keys before an event. In that case, it doesn't seem like you're using preloading at all. Like you're not using the preload hint stuff at all because you've already sort of arranged, you've sort of colluded between your app and your server to have downloaded some keys in advance. and populate the AV content key session with them. So if that's what you're doing, then you don't need preloading because you're essentially doing your own preloading. So hopefully that answers the question. Yeah. All right. So that's about all the questions we are able to answer here today.
Thank you very much to the streaming media team here for all your answers and spending time here presenting your talks. And thank you all to all who joined us today online and in person. We truly appreciate this community. Your input and feedback on HLS and related technologies is very valuable to us. And next, for those here in Cupertino, please join us in the lobby for some refreshments and conversations with Apple engineers. And thank you for joining. I look forward to staying connected with you all in the HLS interest main list. Thank you.