Media • iOS, OS X • 58:47
AV Foundation automatically selects appropriate audio and subtitle tracks in your media. Learn how media selection works and find out how to author media with accessibility features like subtitles and closed captions. Go beyond the basics to use AV Foundation for customizing subtitle appearance and presentation.
Speakers: Kevin Calhoun, Courtney Kennedy, Adam Sonnanstine
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
[Kevin Calhoun]
Hello. This is section 608, Preparing and Presenting Media for Accessibility. Welcome. We think that you're here because you are a content provider involved in the production of content and you're ready to make your content more useful, more accessible to more users in more situations, or you're an app developer and you also want to make your app more useful to more users in more situations.
We think that's great stuff, and we have technology that's going to help, and we're going to cover it in this session right now. We're going to talk about accessibility options for time media that take the form of subtitles and captions. We're going to touch upon accessibility options that take other forms as well, such as audio forms.
We know that some of you may be motivated to hear this talk because you believe that legislation may apply to you, in particular in the U.S., that the 21st Century Communications and Video Accessibility Act and the FCC rules that are established to implement that act may apply to you.
We want to tell you up front that the presenters in this session are not lawyers, and we're not the appropriate people to review this law and tell you how it may apply to you. We encourage you to perform that review on your own. But we believe that we do have technology that can help you meet your responsibility under that law if it applies to you.
So, more about what we will talk about in this session, since we're not lawyers but software developers: we're going to talk about how Apple's platforms, OS X and iOS, support accessibility options for time media with built-in features, support for user preferences and automatic application of those user preferences.
We're going to tell you, if you're one of the developers who needs to - we don't think that's all of you, but for those of you who do need to implement user interface that allows users to select specific media options for playback - we're going to talk about how you can do that, and for content providers, we're going to talk about multiple options that you have to package your media, your audio and video, together with accessibility options. In particular, we're going to talk about HTML5, we're going to talk about HTTP Live Streaming and what you can do there, we're going to talk about the QuickTime movie file format and its closely allied format, MPEG4.
We're even going to go as far as to show the API you need to use to add subtitles to existing movie and MPEG4 files. Finally, for the very few developers who need to mediate the rendering of text during playback, we're going to touch upon the API's we have available if you need to do that. So, lots of content.
Before we move on let's talk about where in the technology stack we're focusing our attention in this session. When we talk about API's here we're going to be talking about application programming interfaces that are available in the AV Foundation framework, Apple's foundational framework for the control of audio visual media.
It sits in iOS below the view system, above some of the core frameworks. One of the frameworks that it supports, I want to call out, is the Media Player framework. That's an important framework to mention in this session as well because it provides standard user interface and standard behavior for the control of multimedia playback on iOS. AV Foundation sits in a similar place on OS X in the technology stack. It's a cross-platform framework. The same set of API's are available on both platforms.
Again it sits underneath the view system and supports a framework that's new in OS X Mavericks called AVKit, which is the analog for OS X to the Media Player framework on iOS. AVKit provides standard user interface and standard behavior for the control of multimedia playback, and if you adopt it all of that standard behavior can be part of your app as well.
So we need to define some terms before we get into the meat, so let's do that real quickly. When I talk about multimedia, timed media, audio visual media in this session, what am I referring to? Well, I'm referring to the big revolution that occurred now over 20 years ago when digital devices first began to play video and audio and even text in synchronization with each other. This is not news to any of us. We take this feature for granted in our digital devices nowadays.
But I want to point it out that there are multiple pieces of media being synchronized together, and therefore when a content provider creates a piece of multimedia, there must implicitly be some decision, some partitioning of the information that's going to be conveyed into the different media types. What information is going to be presented visually and available as part of the video? What information is going to be conveyed aurally or audibly and be part of the soundtrack? And what information is going to be provided in text form? Most content providers understand that even though if they have a default partitioning of this information in mind, that default partitioning is not going to be suitable for all of the users that they want to reach. For example, there might be some users who don't understand the language that's being spoken in the soundtrack who might want subtitles for translation.
Those users might be able to hear the soundtrack and receive all of the other information that's being conveyed, either via music or via sound effects. They can even hear the speech that's being spoken and can identify the speakers who are speaking; they just don't understand the words that are being spoken. They're going to want timed text in translation so that they can understand the speech.
Well, let's go a little bit further and imagine a user who doesn't have access to that audio content at all, either because of deafness, because of hardness of hearing, or because of the situation that the user happens to be in. For example, imagine a user who's sitting in the back of an airplane, near a noisy engine, and has brought an iPad on board and wants to watch a video but doesn't have access to the audio without cranking the volume up really high in the ear buds and doesn't want to do that. How can that information that the content provider originally decided to partition into the audio content, how could that information be made available to users like that? Well typically it's done via an alternative type of media, usually via timed text.
So multimedia can have the ability to carry not just a single text option but multiple timed text options to provide different features for different users. There can be subtitles for translation, for example. There can be subtitles for accessibility. A term that's commonly used for subtitles for accessibility in the Blu-ray domain and others is subtitles for the deaf or hard of hearing, abbreviated as SDH.
So here I've got a picture of a piece of multimedia that has audio, video and multiple text options for different users. It's important to remember the distinction between timed text for accessibility and timed text for translation. The difference is the additional information that the timed text for accessibility carries, exactly those things that we just mentioned.
Identification of speakers is important, description of stuff going on in the soundtrack other than the speech, such as sound effects. Do all the characters in the frame suddenly turn to look because of the sound of a breaking glass, or the thud of a falling body? How would you know if the SDH subtitles don't tell you? It's important to describe those things.
So, the same is true for other types of information in the soundtrack as well. It's not good enough just to supply multiple text options, it's also necessary to make sure that they are appropriately labeled so the user can choose the one that's more appropriate for him or her: the language that the text use, the features that the text carries. Is it just a transcription of the dialog or does it also include this other information as well? Other users might require other accessibility options. For example, imagine a user who lacks access to the visual portion of the multimedia content.
Such a user will want to know what's going on, what's depicted, in the visual portion of the content but requires that information to be conveyed via an alternative means. What can be done in that case? Well what's often done in that case is to include the information in an alternative audio option that includes a narration, a description of the setting and the action that's important for the user to follow what's going on.
So again, the idea is we can supply multiple audio options that are suitable for different users. We can supply dubbed audio for those users who simply want to hear the text, the speech spoken, in translation. A re-recording of the dialog mixed in with the music and the sound effects. But we also might want to provide an audio option that includes what I just mentioned, a narration, descriptive audio, often called DVS or descriptive video services in some parts of the world. This audio will include a narration that makes the visual content accessible.
Again the audio options that are made available would have to be appropriately labeled: what language do they use, what features do they have, so that the user can either choose one manually that best suits his or her needs, or even more conveniently so that the software itself can choose an appropriate option automatically according to stored user preferences.
All right, so that's basically the terminology we're going to use to talk about accessibility options. Let's get into how iOS and OS X are incorporating support for accessibility options in iOS 7and OS X Mavericks. If you look at system preferences in OS X Mavericks you'll note that in the Accessibility pane there is now a Captions area which exposes two different sets of preferences that the user can configure.
The first set is about the appearance of timed text when it's played during multimedia playback. Remember the timed text is essential for conveying vital information to the user, and so it's very important for the user to be able to configure its display so that it's actually legible. If the user needs a larger font it should be possible to specify that.
If the user needs a greater contrast between the color used for the characters and the color used for the backdrop that offsets the text area from whatever is behind it, then it should be possible to configure that as well. Those preferences are here. In fact, if you drill down and look at this in the seed release, you'll note that there's sufficient control to cover all of the characteristics of text that are supposed to be customizable under the FCC rules that I mentioned in connection with U.S. law.
The second preference that's available here pertains to how multimedia content should automatically be configured so that it presents options that are appropriate for the user without the user having to intervene. The option here at the bottom of this panel is "Prefer Closed Captions and SDH". Those are the two common terms for timed text for accessibility that users will recognize from other domains, television, Blu-ray, and DVD.
What this preference allows the user to do is to say, whenever it's available I want timed text for accessibility automatically to be displayed. Similarly the same preferences are now available for the user's configuration on iOS 7. If you go into Settings and drill down through General and Accessibility into Subtitles and Captioning, the very same preferences are there.
So now that they're there, if you're an application developer of course your natural question is, how can my app automatically honor those preferences? And the answer is, well, you don't have to do much to make sure that happens and many of you need not do anything at all.
Let's talk about the two different sets of preferences separately. First, the text styling preferences when timed text is being displayed together with multimedia. Apps don't need to do anything in order for that timed text to honor the user's preferences, except to allow AV Foundation to perform the rendering. So in other words, don't do extra work to do the rendering yourself and you're guaranteed that the user preferences for text styling will be honored.
End of story. There may be other styling information associated with the text, either in the media itself or perhaps applied by an API that we have available for you, but you should note that the most important thing is for the text to be legible by the actual user, and so the user's preferences trump all of the styling information and will be honored when we render it for you.
OK, the other set of preference I mentioned was the preference regarding how media should be configured by default in order to honor the user's preference for a particular type of accessibility option. The "Prefer Closed Captions and SDH" checkbox is what I'm talking about here. To honor that preference, again, many apps don't need to do anything.
If you're using high-level frameworks such as an iOS MPMoviePlayerController to control playback of your media or on OS X Mavericks if you're using AVPlayerView from AVKit, items will automatically be configured to honor that preference and be displayed by default with timed text for accessibility whenever it's present. However, if you are supplying your own user interface or if you are creating your own instances of AVPlayer, you may have a little bit of additional work to do in order to make sure that happens, and we'll talk about that.
In order for you to honor that user preference for that particular media option, timed text for accessibility, "Prefer Closed Captions and SDH", in the seed that you'll take home with you of iOS 7 and also in the seed of OS X Mavericks, you need to do one thing. Call the AVPlayer method -setAppliesMediaSelection CriteriaAutomatically: and pass YES. And what that says is you want to opt in for this automatic selection behavior with respect for criteria derived from user preferences.
Even better news is that after this seed, when we get the next seeds to you and in the shipping versions, you don't even have to do that, and we will opt your application in automatically to this behavior if you merely link against the new versions of the SDK's, the iOS 7 SDK or the OS X 10.9 SDK, and your AVPlayer instances will automatically be opted in to this automatic behavior.
Therefore when we get to that point, if you have an app that doesn't want to offer the typical playback experience, for example, if your app is an authoring app but wants to allow its users to audit the content that's being authored in any number of states including those that don't correspond to current user preference settings, you'll actually have to opt out. You'll have to say -setAppliesMediaSelection CriteriaAutomatically:NO.
Well, to summarize it doesn't do any harm for you to call this method and say what you want. YES, I want that automatic behavior or NO, I don't. But in the subsequent releases and when we get to the shipping release, the GM release, if you want the automatic behavior, there's no need to call anything at all.
Next for API developers, if you need to implement your own user interface to permit the selection of specific options that are present. This is not the automatic configuration of the item, but once something is prepared for playback you want to make available all of the various options that are present, all the audio options and so forth, allow the user to pick one and make that selection effective. How do you do that? Well, let's go back to a picture of a piece of multimedia here with multiple audio options, multiple captioning and subtitle options; some of these options are related to accessibility and some are not.
How do you know that they're there? In AV Foundation, each of the groups of options that are present will be represented by an instance of the class AVMediaSelectionGroup, and each option in the group that's present, intended to be treated as a mutually exclusive option, will be represented by an instance of AVMediaSelectionOption.
And each of those AVMediaSelectionOptions will have properties that describe what the option is all about: what's its media type, what language does it use, what features does it have. Is it for accessibility or just for translation, for example in the case of subtitles? How do you get access to those things? Well, it's pretty simple.
Like most operations in AV Foundation you start with an instance of AVAsset, probably an AVAsset initialized with a URL to the file in the file system or something out on the network, and then to get this information you start by loading the value of the property, availableMediaCharacteristics WithMediaSelectionOptions. You want to find out what groups of options are present.
The value of that property, once it's loaded, is going to be an array of strings, each of which represents a media characteristic that has options associated with it. For example, if AVMediaCharacteristicVisual is present, you know that there are video options. If AVMediaCharacteristicAudible is present you know that there are audio options, and if AVMediaCharacteristicLegible is present, you know that there are captioning or subtitle options or both.
Suppose you want to create a menu that allows the user to choose any specific option to enable it for playback. How do you do that? Well, you want to get the specific AVMediaSelectionGroup for the characteristic you're building the menu for. Suppose it's the Legible characteristic. I'll say, AVAsset, give me the media selection group for the media characteristic AVMediaCharacteristicLegible, and the return value of that method is going to be an instance of AVMediaSelectionGroup which will have one or more options present in it, each of which describes options that are present. You can use those to populate a menu. In fact you can use as the name of each menu item, the displayName of each media selection option.
If you're playing this asset - and of course in order to do so you have to create an instance of AVPlayerItem and an instance of AVPlayer to play it - and the user selects a specific media option, suppose that one on the lower right, how do you make that selection effective? You use the AVPlayerItem method -selectMediaOption: inMediaSelectionGroup:, and the effect of that will be to enable the option that's selected and to disable all of the other options in the group.
So that's a high level description of the API that you would use in order to implement all this. Even better, we're providing a code sample that you can read that goes through exactly that process. And the code sample actually does some interesting things as well, so we're going to demonstrate it for you right now. Courtney Kennedy, one of the Engineering Managers in Media Systems is going to that for us. Courtney?
[Courtney Kennedy]
Thanks Kevin. [Applause] So I have an app and a special movie here that I'm going to use to demonstrate how you can use the Media Selection API's that Kevin has been describing to both inspect the audio and legible options in a particular movie and to select between them.
So this app has four different players in it, one in each quadrant, and I have the same movie loaded up in each of them. In addition this app has a menu on each player that shows both the audio options available and the subtitle options available. So you can see this particular movie has both an English audio and a Hindi audio.
To being with let's select the English, and then we can also select Subtitles for each of the four players. So you see we have quite a list of subtitles to choose from here and just as with audio, they're all listed by language and some of them include the SDH label on them. As Kevin said, that label indicates that there's additional information in those subtitles beyond just a simple translation.
So in this particular movie, for some languages, we have both regular translation subtitles and we have SDH, and for some languages we have only one or the other. So let's go through, for each player, and just select some different subtitles to play. And now let's play the movie and see what we get for audio and subtitles.
[ Movie Playing ]
Okay so we can see that our media selections made on AV Foundation were honored during playback. And let's listen to a bit of the Hindi and look at a few of the other subtitles available in this movie.
[ Background Sounds ]
And play again.
[ Movie Playing ]
So if this is functionality that's interesting to you and something that you want to add to your media player, it's possible to do so using AV Foundation. And if you want to see how we did it as Kevin mentioned, both this app, which is called AV Media Selection Demo, and this movie, which my Apple engineering brethren helped me make and it's great test content, they're both available as part of the sample code for this years' show. [Applause]
[Kevin Calhoun]
Thanks, Courtney. OK, so we've talked a lot about things that application developers need to know in order to expose these options in their applications, but you must be getting nervous, application developers, because we haven't said anything to the content providers and producers sitting around you about how they can provide these options that your apps can actually make available. Don't worry, we've got material for those people as well. In fact we've got multiple choices for those content producers to use to package accessibility options with their media.
So let's go back to a picture of a piece of multimedia with multiple video options, multiple audio options, multiple timed text options in the form of captions or subtitles, some of which have accessibility features associated with them and some of which do not. How do you make media like this? Several choices.
You can choose HTML5 and its support for text tracks. You can choose HTTP Live streaming and its support for what it calls multiple alternative renditions of media. Or you can use the QuickTime movie file format or its closely allied format MPEG-4; they have the same features, and for the purposes of our talks though we're going to treat them as a tandem here.
Let's go through what the features are of each one related to accessibility options. First, HTML5. The current revision of the HTML5 spec allows you to include one or more text tracks in your HTML markup. It does not currently allow you to have multiple audio options declared in your markup or multiple video options, but it does allow it for text.
What does it look like? Well first of all I should mention that in HTML5 parlance text tracks like this are known as "out of band" text tracks, and why are they called "out of band" text tracks? Well simply because the resource that carries the timed text, the subtitles, is a different resource from that that carries the audio and the video.
In the example markup in the lower part of this slide, you'll see that the source of the video element is an M4V file but the source of the captions is a WebVTT file, so that is why these are called "out of band" text tracks. That has the virtue of allowing you to add text options without modifying your audiovisual resources. You just declare more text tracks in your markup as you have the text available.
What does the rest of this markup look like? Well first of all, this example declares timed text for accessibility. It carries the additional information - identification of speakers, description of music and sound - that you would expect from timed text for accessibility. The way that you declare that in HTML5 is to give the text track a kind, called "captions". "Captions" is the kind of text track for accessibility. If it's simply a text track for translation purposes you would give it a kind of "subtitles", so that's how you indicate that.
You want to indicate the source language because it's really useful for the user to pick a language that he or she can actually read. I've mentioned the default attribute here on my slide but I'm not using it in my markup. Why? Well that's because if you recall accessibility options are made available as an alternative means to convey information that otherwise is conveyed in the default partitioning of that information into audio, video and text by the content provider.
Because accessibility options are alternatives, they are not typically designated as something to be displayed by default. So I'm not using the default attribute here. I am providing a user-readable label so that if the user agent or the JavaScript provides some way for the end user to select options there's a useful, identifying label available to use in the user's language.
Okay, so what are the highlights of HTML5? Well as I mentioned there's no need to modify the main media resource that's played by the media element, in my case a video element in the example markup I showed, and because you're using an HTML5-compliant user agent you have all of the other facilities that are familiar from HTML. CSS styling will apply to the elements displaying the timed text just as it applies to other elements as well, so lots of power there. You have JavaScript available for the control of timed text.
There's a rich API defined for JavaScript on these text tracks. You have access to the cues, a cue list, for each track: the text that's going to be displayed, the timing of that text is available to you. And of course because you can display the text in any element in the document you don't have to place it right on top of the video, you can put it somewhere else.
But an important consideration here is that if you choose HTML5 to declare the availability of your timed text, implicitly you require an HTML5-compliant user agent to be available to present that timed text during playback. And you should be aware that HTML5 user agents possibly aren't present in all of the environments in which you may wish your timed text to be made available. An important example is Apple TV, when media resources are played via AirPlay video.
In the implementation of AirPlay video the Apple TV has access only to the main media resource. It does not have access to the surrounding HTML5 markup or to the timed text tracks that are declared there. So if you play a media resource via AirPlay video to an Apple TV, and you're only making your timed text available via HTML5, those text tracks are not going to be available on the Apple TV.
OK, another choice that we want to review, HTTP Live Streaming. Very flexible. HTTP Live Streaming supports multiple alternative renditions for each of the media types it supports. You can have multiple video renditions, audio, subtitles, and, now for the first time in iOS 7 and OS X Mavericks, you can have multiple closed-caption renditions as well. Some of them for accessibility purposes, some of them for translation and so forth.
What does it look like in your master playlist? Well like any other rendition that you make available for HTTP Live Streaming, you simply have to declare it in your master playlist. You do that by declaring that you have something of a particular media type. If it's subtitles, you say its TYPE is SUBTITLES and usually, though not always depending on the media type, you have to declare the media playlist that has references to the media segments that actually carry the media data. Here in this example I'm saying that the media segments that carry the subtitles are in the URI, "webvtt/session.m3u8".
This example, again, is an example of subtitles for accessibility and so we want to declare that these subtitles have the accessibility characteristics. Those are, "transcribes spoken dialog for accessibility", in the parlance of this dot language, "public.accessibility. transcribes-spoken-dialog", and also you want to declare that these subtitles "describe music and sound", or "public.accessibility.
describes-music-and-sound". And that's how, in your master playlist, you say, "I have SDH subtitles, I have timed text for accessibility purposes". Let's just run through some of the other attributes for completeness. Declare the LANGUAGE of the subtitles. It's written text. Users are going to want to know. Don't forget.
Accessibility options, again, are not typically marked as to be displayed by default since they're normally constructed as an alternative means to convey information, but we definitely do want to make them available for automatic selection in case an accessibility option matches the users preference best. We want the software to be able to configure the display of the item to honor those preferences automatically, so make your accessibility options automatically selectable, please. Also for a manual selection, for a user interface that allows the user to do that, it's handy to supply a user-readable NAME.
Similarly for closed-captions. First of all you want to mention that you have closed-captioned media and you do so by declaring the TYPE as CLOSED-CAPTIONS. But in the case of CLOSED-CAPTIONS, that media is not actually carried in independent media segments. Closed-captions in HTTP Live Streaming are carried in the video substream, in fact according to the ATSC specification.
So how do we declare their presence and where they are? What you say is not what the URI or the media playlist is - you don't need one because they're going to be carried in the video and the media playlist for the video is going to tell you where that is. But you do want to declare what closed-caption channel you're making this declaration for, closed-caption channel 1, 2, 3 or 4. In this case I've got English closed-captions in CC channel 1 so that's what I'm declaring using the INSTREAM-ID attribute.
I want to mention also that closed-captions, because that they were devised specifically for the purpose of accessibility, our software assumes that all closed-caption media has the characteristics "transcribes spoken dialog for accessibility" and "describes music and sound for accessibility". So this declaration is here for illustrative purposes only. It's closed-captions, it's intended for accessibility, so we assume they have those characteristics.
Declare the LANGUAGE, don't forget. Again, accessibility options not typically the DEFAULT, but we do want them to be automatically selectable, and a name is handy. Finally, we haven't seen this before, an example of this before; what about audio? I mentioned earlier that accessibility options that take the form of audio media are greatly appreciated when present.
Here's an example of one such accessibility option in audio. This particular one, this audio rendition that I'm declaring with a TYPE of AUDIO, and of course I'm also declaring its media playlist, has the characteristic "public.accessibility. describes-video". That means that that narration, that description of the setting and action that's depicted visually, is present in this audio rendition. It's mixed in together with the speech and the music and the sound effects. Again I want to declare the LANGUAGE, accessibility option not typically the DEFAULT, but do make it selectable automatically, and a name can be handy.
So there you go for audio, closed-captions and subtitles. By the way, the additions to the HTTP Live Streaming specification that you've witnessed here, the ability to declare closed-captions, those are in this draft of the specification that's available through the developer program that we intend to finalize by fall.
So highlights of HTTP Live Streaming. It has a similar virtue to that of HTML5. You can add accessibility options without modifying existing media, without modifying existing portions of your master playlist. You simply declare the additional renditions that you have. You provide the media playlist where appropriate, and you're done.
Another point about HTTP Live Streaming, though we do have a feature that allows you to specify styling of subtitles in HTTP Live Streaming that take the form of WebVTT documents, your styling that you apply by the API that I'm mentioning here, textStyleRules, remember that will be overridden if the user has set preference for the appearance of text.
All right, so let's turn our attention to the third of these three options. From HTML5 to HTTP Live Streaming, now to the QuickTime Movie file format and MPEG-4 file format. The good news about these file formats is that they have had the ability since their inception to represent all of the information that's necessary to include multiple alternative groups and to identify the purpose of any member of the group. You can have multiple video tracks that are intended to be chosen mutually exclusively to each other. You can have multiple audio tracks as well. Multiple captioning and subtitle tracks. It's all built into the file format and has been forever.
Let's talk about where that stuff shows up in the binary file format, and I'm going to defer the discussion of how you can create movie files with these features to a colleague of mine who's going to come up and describe the API that you can use for that purpose.
But let's just quickly review what's in the movie file format that you need to specify the presence of accessibility options. First of all, every track in a movie file has a language setting. It's right there in the media header atom of the media that's associated with that track. That accommodates an ISO-639 code.
If you need a larger declaration of the language, if you need to declare the writing system that's used or the regional variant of the language that's used in addition to the language code, there's also accommodation for storage of a BCP 47 language tag. Every track in these file formats has an alternate track group setting right there in the track header atom. I stole these graphics, by the way, from the online version of the QuickTime File Format documentation.
It was easy to make these slides, in other words. All this stuff is already out there. But let's talk about what the alternate track group setting means. By default for a track its value is 0, meaning that the track is not a member of a group; it will be enabled and disabled by a well-behaved implementation independently of all of the tracks.
But, if the alternate track group setting is something other than 0, that means it's a member of a group with all of the other tracks that have the same value for their alternate track group setting. So for example I can have multiple audio tracks all of which have an alternate track group setting of 1, which indicates in a well-behaved implementation, those audio tracks will be selectable mutually exclusively. When one is selected, the others will automatically be disabled.
There's also an accommodation for track references, if different tracks have to be associated with each other in some way. This is particularly useful, for example, if you are carrying forced subtitles in your movie file and you want them to be associated with a particular audio track. I'm not going to define what I mean by "forced subtitles" here but I'm aware that some of you may be interested. If you need more details about that or other cases in which track references are handy, come to the AV Foundation lab tomorrow and we'll go through that with you.
Finally, in order to describe the specific features of a track - are these subtitles for accessibility or just subtitles for translation, for example. It's possible to attach to a track the information that makes that distinction. Every track has what's known as a user data container, a place to store additional descriptive information about the track, and we've defined a new user data type called the "tagged media characteristic" that carries exactly the information that we need in order to identify accessibility features when present. And the value of these tagged media characteristics is exactly the same as the media characteristics that we declare in our HTTP Live Streaming master playlist when the features are present, "public.accessibility. transcribes-spoken-dialog", "public.accessibility. describes-music-and-sound".
Adam has more details in a moment of how you would attach those to a track, but just note it's possible for you to tag a track with the features that the user may be interested in so the user can recognize the value of that particular choice. So in summary, there's nothing new here at all. The QuickTime Movie file format and the MPEG-4 file format accommodate multiple options and accommodate accessibility options natively, built into the file format. That's a very convenient way to package accessibility options if you want to be able to transmit a lot of options together in a single file.
However, you should be aware that the implementations of support for [QuickTime] Movie files and MPEG-4 files on iOS 7 and OS X Mavericks support only those files that are self-contained. In other words, they include all of the media they require; even if there are multiple options available, the media for each will be stored in the same file.
So, you have additional work that you need to do if you want to add for example an accessibility option with timed text for accessibility to an existing file. The good news is that we have API in AV Foundation available that makes that possible, and we're going to describe exactly what it is in just a moment.
One last note about subtitles in QuickTime Movie files and in ISO files such as MPEG-4 files - we're aware of, in fact we're actively participating in, work within MPEG, the Motion Picture Experts Group, to define the standard carriage of WebVTT for timed text in ISO files such as MPEG-4.
And when that specification is final and our implementation is ready, we expect to announce to you that we recommend the use of WebVTT for subtitles for translation and subtitles for accessibility in MPEG-4 and QuickTime Movie files. That's a, we'll have a really great story when we get there because the same format will be supported by all three of the packaging options that we've talked about today, HTML5, HTTP Live Streaming and then when we get there MPEG-4 and [QuickTime] Movie files as well. However in the interim, until that specification is final, if you want to add subtitles to [QuickTime] Movie files or M4V files, we continue to recommend the media type, AVMediaTypeSubtitle, - the four character code is 'sbtl' - and the particular format represented by the media subtype 'tx3g'.
OK, one last note. Supposing you really liked the features of HTML5 for controlling timed text in JavaScript and applying CSS, but you also really like the ability of, say, HTTP Live Streaming to be able to conduct those subtitles to the Apple TV when played by AirPlay video. And you're asking, Can I have the best of both worlds? When my content is played in the web browser, I want to apply the features of the web browser to it but when it's played in another environment I want it to be just as rich.
And the answer is: yes, you can have the best of both worlds. HTML5 defines not only what it calls "out of band" text tracks, for which you declare tracks in your HTML markup, it also supports what are known as "in band" text tracks for cases in which text is carried within the main resource itself, for example, in an HTTP Live Stream that has a subtitle rendition for accessibility or in a QuickTime Movie file that has a subtitle track for accessibility.
We've worked with the WebKit team to ensure that the versions of WebKit that will be made available with iOS 7 and OS X Mavericks support "in band" text tracks when these media types are played via a video element. Therefore when you play a movie file with subtitles or a HTTP Live Stream with closed-captions, in those versions of WebKit you have the full ability to use JavaScript and CSS to apply to that text just as you would if the text tracks were declared as "out of band". So that's a good story for integration.
Finally to review the choices that we have talked about here regarding packaging of media with accessibility options, here are some of the highlight decision points, not all of them that may pertain to you. All three of the things that I mentioned today, HTML5, HTTP Live Streaming and Movie and MPEG-4, support selectable captions and subtitles.
HTTP Live Streaming and QuickTime Movie and MPEG-4 support selectable audio options, not currently possible in the current version of HTML5 spec. HTML5 and HTTP Live Streaming support references to external media so that you can add timed text options without modifying your main media resource; can't do that with Movie and MPEG-4.
And AirPlay video supports the timed text options that are carried within HTTP Live Streaming and Movie and MPEG-4. So, just a summary to give you an idea of how you might choose a particular format. Well I promised there was more information available about how to add subtitles to Movie and MPEG-4 files and here comes Adam Sonnanstine now, a colleague of mine from the Media Systems Group, in fact from the AV Foundation team, who's going to introduce you to exactly the API that you need to use. Thanks. [Applause]
[Adam Sonnanstine]
Thank you, Kevin! So as we've seen, the QuickTime file format can carry subtitles as well as all of the accessibility features that we've been talking about. That's the alternate track groups, the tagged characteristics and the track associations. So I'm going to spend a few minutes talking about the specific API's that you can use in your applications to create Movie files that have these features. So I'm going to start by talking through some of the basics, some of the specific data structures and classes you're going to use, and then talk through all of those specific accessibility features and the API's for each one of them.
And we're going to motivate the whole thing using a scenario where we add a new subtitle track to an existing Movie file. So to see that scenario visually, we start out with a movie that has a few tracks: a video track, an audio track, and a single subtitle track.
In this case we're going to use a Spanish track, and we're going to create a new Movie and pull all of those tracks right over just as they were. And then we're going to create our own subtitles, our own translation into English, and push that into an additional subtitle track in the output file. So to see this in action I'm inviting Courtney back up stage for a demo.
Alright, so here you see the demo app. We have a Movie loaded up. It's just as I described. If we open up the Subtitle menu here, we see that we have that Spanish subtitle track. So we can go ahead and select that and start playing until we find a subtitle that we'd like to translate.
[Movie playing] That one's kind of tough. Ah, there we go. That's an easy one. So let's do that. So Courtney will type in the translation which is "Why" and then what we're going to do is hit this button over here which is going to do this process I just described, where we create the new Movie file, and it's going to open it up in QuickTime Player. So here's our new file. We can examine the subtitles by opening up that menu again and you can see that, in addition to the existing Spanish subtitle track, we also have the new English subtitle track that we just added.
And QuickTime Player knew to put them in this menu because we used an alternate track group as Kevin described. I'll talk more about that in just a moment, but let's examine our handiwork. So let's check out the English subtitle track and start playing. [Movie playing] There it is. Alright so we just typed that in, it's in the Movie file and that's a great demonstration of how easy it can be to add subtitles to a Movie file using AV Foundation.
Thank you, Courtney! Alright, so let's take another look at that diagram here. We're going to zoom in just on the part where we're adding the subtitle into a new track, and the first thing that we're going to do is wrap that in an instance of CMSampleBuffer. In order to get that sample buffer into the output file we're going to be using a couple classes. The first is AVAssetWriter and the second is AVAssetWriterInput. They work together and you'll get one track in the output file for every asset writer input that you have.
Now for more detail on the specific format that the data should be in when you put it into a sample buffer, check out the "QuickTime File Format Specification," which is available at that URL. For more detail on CMSampleBuffer itself, it has its own documentation. But we also gave it a short introduction in 2010, in the session called "Using the Camera with AV Foundation." I think that happened around the 15 minute mark if you're pressed for time.
And lastly, for more information about AVAssetWriter, we gave it a very detailed overview in 2011, in the session called "Working with Media in AV Foundation," and this one you can actually watch directly from the WWDC app from this year. But tying it all together we do have sample code. It's called "avsubtitleswriter for OS X," so go ahead and download that to see all the details and all of these concepts working together in perfect harmony.
Alright, so now that we've seen sort of the basics, let's take a look at those accessibility features, starting with track groups. So I mentioned in the demo that we want a track group in the output file so that QuickTime Player knows how to put it in the menu to select between the two subtitle tracks, and the way you do this with AVAssetWriter is very similar. You create an input group that references the two Asset Writer Inputs that you're going to have.
To see that in code: You start out by creating an array that wraps the two Asset Writer Inputs, and then you use that array to initialize an instance of AVAssetWriterInputGroup, and then you add that into your Asset Writer. And you see here that we nominated the Spanish input to be the default, but if you do not want to specify a default you can pass in nil here.
So that's track groups. The next is track associations. We mentioned that you might want to use a track association if you want your subtitle track to get selected automatically when the audio track of the same language is selected by the user, and in the output file that looks like this. You have this association between the two tracks.
In this case it's a "selection follower" association, and you might not be surprised that with the Asset Writer we just create that same association between the Asset Writer Inputs. In code this is very simple: It's just one method, -addTrackAssociation WithTrackOfInput:, and you're going to use this to associate the subtitle input with the audio input using the type "selection follower." Alright, that's track associations. The last one I want to talk about is tagged characteristics.
We've talked in some length about SDH subtitles. If you're going to be authoring a subtitle track that contains all that extra information, the way you get it to be labeled as SDH is to use a couple tagged characteristics, one of them being "transcribes spoken dialog for accessibility." And, as you might have guessed, to do that with Asset Writer Input you just tag your corresponding Asset Writer Input with that same characteristic.
In code, it's a little bit more involved. We're using the general purpose metadata API, but here you just create a metadata item, you set its key space to "QuickTime user data" and it's key to "tagged characteristic," and then you set the value to be the actual characteristic we want to use.
In this case it's "transcribes spoken dialog for accessibility." And then you just set that on the Asset Writer Input using the -setMetadata: method. Now, we've mentioned that there are two characteristics for SDH. To review, the second one is "describes music and sound for accessibility." Alright, so that is all of our accessibility features and how to author them. The last thing I want to mention in this section is closed captions.
QuickTime Movie files can carry closed captions just as well as subtitles. Most of the concepts and the API's you're going to use are exactly the same. The main difference is how you're going to format the data. So for more details on that, once again, see the "QuickTime File Format Specification." Alright so that's how you can author subtitles and closed captions into QuickTime movie files.
Let's move on to our last topic, which is "accessing the subtitle text during playback." Now why might you want to do this? Well, we'll take a quick look at the default experience you get if you adopt AVPlayerLayer or AVPlayerView. It's a great experience for most apps: You get the subtitles drawn on top of the video, they're styled appropriately according to the content and the user preferences that we've been talking about, but the one thing you don't have control over is where the subtitles are placed. After all if your video isn't taking up the whole screen, you might want to put the subtitles somewhere else. So, if you had access to the actual text of the subtitles while the movie is playing, you could do the drawing wherever you like, yourself.
Another thing you might want to do is to allow your users to interact with the text. Maybe get a dictionary definition for a really difficult word like this one, and if you had access to the text then of course you'd be able to implement this quite easily: Just slap it into a text view of some sort.
So the way you're going to do this is with a new class called AVPlayerItemLegibleOutput. This is new in iOS 7 and OS X Mavericks, and it's a companion to the Video Output class that we introduced last year. And just like the Video Output, it allows you to access that media data while the movie is playing, in real time.
In this case, instead of video frames, we are getting the text of each subtitle. And this is actually the technology that WebKit is using to do the integration that Kevin described, where the out-of-band subtitles in HTML5 text tracks and the in-band subtitles carried in HLS Streams and QuickTime Movie files can be integrated together. So I'm just going to spend a few minutes introducing this class to you.
We're going to start with the basics of how you'd use it, and then I'm going to talk in some depth about some things to keep in mind if you're going to try to draw the subtitles yourself, and finally a few scenarios we're going to highlight where you might want to use a different solution than Legible Output. It's not a "one size fits all" kind of tool.
Alright, to see this as a diagram: Just like the Video Output, you're going to attach your Legible Output right onto your player item that you're already using to play back your content, and the Legible Output is going to interact with a delegate that you provide, sending it the string for each subtitle as it goes by. The string is going to be in the form of an NSAttributedString object. It's going to have styling information and positioning information attached to it.
And you might be wondering: If you have more than one subtitle track in your Movie file, which one is going to be delivered to the delegate? Well, it's the same story as if we were drawing it ourselves. It's the one that corresponds to the currently-selected media selection option.
To see a little bit of code: You create one using +alloc and -init. Nothing fancy here. You're going to attach it to your player item using the -addOutput: method and then you're going to create a delegate that conforms to the AVPlayerItemLegible OutputPushDelegate protocol and set that onto the Legible Output. Now you notice that the -setDelegate: call takes a dispatch queue.
You can specify any queue you want and the Legible Output will invoke all of your delegate methods on that queue. If you're going to be interacting with AppKit or UIKit in response to these delegate callbacks, you're going to want to use the main queue, as I do on this slide, to make sure that that is a safe interaction. A little bit more detail on the delegate protocol itself: This is the main method that you'll be using.
The second parameter is the one that you'll be getting the text from, those attributed strings. The third parameter, the Native Sample Buffers, is a more advanced use; most people won't need to use it. Just know that by default you'll always get an empty array for this parameter. And of course tying it all together we have sample code for Legible Output as well.
It's called "AVLegibleMeanings." So go ahead and download this sample to see Legible Output in context and in full detail. So that's sort of the basics of using Legible Output. Let's move on now to "things to think about if you're going to be drawing the subtitles yourself." We don't expect that everyone is going to want to do that, in fact probably very few, but if you do, remember that you have the styling and positioning attributes attached to each attributed string. They look something like this.
They're all declared and defined in CMTextMarkup.h, but I'm just going to highlight a few things: The first is that it the attributes are loosely based on the WebVTT model for styling and positioning, so if you're used to that model already then this should be pretty familiar. The second thing is that this is a common format.
We're going to give you the same set of attributes no matter where the subtitles came from, whether it was TX3G or WebVTT or CEA-608 or any of those other acronyms. So you only have to understand these attributes and not all of those individual formats. And the last thing I want to mention is this whole topic of the user preferences for subtitle styling.
We've talked about how important it is that these are the most important sets of styling information. The great news is that when we're populating these attributes, the Legible Output takes into account those user preferences. So just by following these values you don't have to do any extra work to make sure that you're respecting the user's preferences.
Of course we do expect that some people will want to modify the styling a little bit to suit their needs. If you are going to do that you're going to want to make sure that you're not accidentally overriding the user's preferences, and the way you can do that is to look at the MediaAccessibility framework. This is going to provide two services.
You can both access the current set of user preferences and it will also give you a notification if the user changes those preferences while your application is running. So that's the story on drawing subtitles yourself. The last thing I want to mention is when you might not want to use Legible Output.
Well, if you're going to do any sort of indexing operation over all the subtitle text in a Movie file, you should take a look at AVAssetReader instead. It's going to be a better way to do that for pretty much any case. If you're just going to be doing some custom styling of WebVTT content at a basic level, maybe to match the look and feel of your application, you're going to want to take a look at the textStyleRules property on AVPlayerItem that we've mentioned before.
That's going to be much easier to use. You won't have to do the drawing yourself. And finally, if you think that the Legible Output sounds like a great way to get the subtitles and draw them with whatever styling you want without regard to the user preferences: By now you can probably guess that we don't want you to do that, but more importantly your users don't want you do to that either.
Some of your users might have very specific needs for how the subtitles should look, so for the sake of your users, make sure that their needs are taken into account as the most important thing. Alright, that's Legible Output. That was our last topic. Let's see what we learned today.
So, we started out by learning how we can support accessibility during playback either by doing very little work, maybe by adopting AVKit and letting it handle the details, or, in a more advanced case, using the Media Selection API's to manage all the details yourself. We talked about some of your choices for content delivery and their pros and cons. That was the HTML5 text tracks, HTTP Live Streaming and QuickTime Movie files. And we talked about how to add subtitles to those Movie files using AVAssetWriter, and finally we talked about how to access the subtitle text during playback using AVPlayerItemLegibleOutput.
For more information we have our documentation, and you can check out the Apple Development forums to get your questions answered. There are some related sessions. If you stick around in this room, coming up right up next, if they don't kick you out first, is the "What's New in Camera Capture" session. It's always a fun one.
And tomorrow morning we have a session on "Advanced Editing in AV Foundation." Alright, so that's all. Thank you very much for coming, have a great rest of your week, and if you do one thing on the way out of this session: Download our sample code and see how you can make your applications and your content accessible to the broadest range of people. Thank you!
[ Silence ]