Integrate SiriKit Media Intents with HomePod - Tech Talks

2020 • 39:29

Bring your music service to HomePod and help users enjoy your content hands-free throughout their home. Discover how to build a great integration from start to finish. We'll take you through how Media Intents work, configuring your app as a preferred music service on HomePod and integrating directly with your cloud backend, providing a personalized playback experience for individual accounts with the help of voice recognition (where available), and more. To learn more about Media Intents, check out "Expand your SiriKit Media Intents to more platforms" from WWDC20.

Speaker: Jeri Mason

Open in Apple Developer site

Downloads from Apple

HD Video (121.7 MB)
SD Video (60.5 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, I'm Jeri Mason. I'm a software engineer on the HomePod team, and I'm going to walk you through bringing your media service into the home with SiriKit Media Intents, now powered by cloud extensions. Let's start with everything you need to start developing. First, we'll cover some important prerequisites for developing a service effectively.

How to configure a test home. How to set up your HomePod for testing. How to set up your iOS device for development. And finally, adopting the media setup framework in your app for onboarding your service into a home. Once you've taken these steps, you're well on your way to creating an awesome playback experience on HomePod for the people who love your content.

Let's dive into the prerequisites. To support SiriKit Media Intense on HomePod, your service must already support Media Intense in your iOS app. You'll need to apply for the program on developer.apple.com/siri. After you apply, you'll receive more details about accessing the Cloud Extension API specification, which your service must implement.

Then you can register your service with us to get entitlements and profiles to begin development. And finally, before a HomePod will be able to reach out to your service via the Cloud Extension APIs, you must implement an OAuth flow and a configuration web service defined in our specification.

Let's talk more about registration. To register your service, you'll need to provide your service name, a public sign-in key that you create to verify the service on devices, a static URL to retrieve an icon for your service from iOS devices in the home, and a list of bundle IDs of apps which are using the service. This information will be used to create a development profile you can install on your test and development devices. Only devices with a profile will be able to test your service.

It should come as no surprise, since the cloud extension APIs use web technology, we've adopted authentication using OAuth 2.0 endpoint. But since your app takes the role of a trusted client, a service that implements client credentials flow is required. A unique client ID and client secret is required for each service account that is added to a home.

Even though it's uncommon for this flow, the issuing of renewal tokens is recommended so as to avoid a common reuse of these credentials provided directly to the OAuth service. This is because without a renewal token flow, the client ID and client secret could be long-lived. As long as someone is using your service from a home, either a renewal token or the original authentication must be supported. Your iOS app must be able to supply these credentials by requesting them from your service and providing them to iOS Media Setup APIs.

HomePods reach out to your service to resolve intents. On iOS, the details of what intents you support are part of your app's bundle. In Cloud extension, you must implement one or more configuration resource endpoints, which identify supported endpoints and what URLs you use to reach them. There's a lot of flexibility here. You can have multiple static URLs or implement an endpoint that creates them dynamically, all of which is useful for development in staging test configurations. Your endpoint can even check a request authorization header, identifying the account retrieving the configuration.

The body of the configuration is encoded using JavaScript object signing and encryption standards, and it must be properly signed for the device to accept it. However you serve the resource, it must provide standard HTTP cache control headers so that clients do not have to retrieve it often. Every time your app adds your service to a home, it provides the URL for this resource, as well as the client credentials. But these prerequisites met, let's talk about configuring a test home.

While you're developing and testing your service, having a dedicated test home in the Apple Home app is strongly recommended. That way, you won't affect any HomeKit homes you may use personally, and you only have to manage the configuration of the devices in your test home. You'll also want to have multiple test iCloud accounts, so you can test your service with different accounts in the same home.

And naturally, you want to make sure that you're using the same accounts on the HomePod and your iOS test devices. Creating a new home in the Home app is easy. First, tap Home Settings, add a home, type in the name of the home. Now you can add people who are your other test users.

Okay, now that you've got your test home set up, you'll need to add a HomePod to it and then set it up. You'll usually want to install the latest tvOS developer beta release on your HomePod, or a release Apple currently advises for development. It's important to note that you cannot use a regular tvOS release until you're ready to submit your app.

Until then, you'll also need to install a development profile provided by Apple for your service in the home app on iOS. Be sure to repeat this process for each device you plan to test in your home. When you update to a new version of tvOS, you may have to reinstall the profile. To install a profile on a HomePod, first make sure your test home is selected in the Home app.

Open the profile on an iOS device. Select HomePod as the device type to install. Tap "Install" Enter your passcode Consent to the agreements Tap "Install" to complete installation You can then choose which HomePod in your test home to install the profile to. That's it. You'll need to set up your iOS devices for testing and development as well.

You'll need to be using a developer beta on your iOS device. Usually you want the version that was released at the same time as tvOS, but check with Apple for the latest recommended version. Like the HomePod, you'll need to install your developer profile. Tap on the mobile config bundle in an email or iMessage attachment.

Choose to install it on your iOS device and follow the typical flow for accepting and installing profiles in Settings. It's a good idea to check that your iOS device and your HomePod devices are running the expected releases at this point. You can repeat the setup for any test devices you may need. They can be iPhone or iPad devices.

With your devices configured, you can now adopt the Media Setup Framework in your app. This is what ultimately will allow people to set up your service for use in their home. To use Media Setup, you'll need to implement functionality in your app to retrieve or create OAuth material, choose a configuration resource URL appropriate for the account being onboarded, and then you'll have to add the entitlement you received after registering your service with Apple. Otherwise, you won't be able to use the framework. Make sure to add a prominent discoverable onboarding presentation context in your app to trigger the account installation flow using MS Setup Session.

People will be guided through the home account setup process. The setup process will use your OAuth service to retrieve API tokens and verify your configuration resource URL is properly signed and parsable. So all the prerequisites I mentioned earlier must be met before a successful setup will complete. You can get more details at developer.apple.com.

Now that your development and test environments are ready, a HomePod in your test panel can reach out to your service. Let's go over the process. The system architecture of handling intents in the cloud. On iPhones, iPads, or Apple TV, when you ask Siri to play media in your iOS app, you either handle intents directly or via an extension in your app bundle. Typically, when someone asks Siri to play content from your app, Siri processes this request, identifies your app bundle as the target of the intent, and loads your app extension, which implements the intent resolution.

Your extension can communicate with your services, and the handling of the intent will trigger launching of your app to begin playback, or just signal it to handle changes. With cloud extension, the process of resolving those intents moves from your extension implementation on device to an API on your service, as the HomePod directly reaches out to your cloud extension intent endpoints instead.

The cloud extension API is a web API specification designed for intent handling. It uses open API schemas to define objects that closely follow the SiriKit interface. It uses a lot of the same code that you use to create content APIs you're already familiar with on iOS. The configuration URLs I discussed as we were getting started are also part of this specification, and their content is defined in it. Each SiriKit media intent type has a protocol endpoint configuration, and we can reach out to different URLs as necessary for your implementation. The protocols supported are play media, add media, and update media affinity.

You may notice the search protocols are not the same as the previous ones. They are the same as the previous ones. The search protocol is not provided in HomePod, because there's no context to display the results on HomePod, unlike on iOS devices. Let's look at what happens when I tell Siri to play content. When I tell Siri, play strange by Celeste, this triggers the play media intent handling flow. When I tell Siri, play strange by Celeste, this triggers the play media intent handling flow.

The HomePod starts resolving a Play Media Intent with a Media Search object. It's serialized in JSON and sent to your service. Once you match the Media Search to something appropriate in your catalog, you'll create a response object. The Media Item has an identifier, type, and additional metadata that Siri may use to formulate dialogue to tell the user, "Now playing 'Strange' by Celeste." Additionally, this protocol has a handle method that is called after resolving other fields on the intent. If the fields on the intent are already acceptable, there's no reason to delay returning the final handle method, so include it in your response to the initial request.

Not doing so causes extra round trips to your service unnecessarily and slows down the intent response time. Inside a handle response, you can include an object called "UserActivity," which contains a persistent identifier and user info that will be returned to your service when the HomePod starts playing your content. Your service replies with these results and may begin to formulate a queue of content to play based on the request as the HomePod processes these results and starts a playback session.

Now, when media is playing from your service, let's consider what happens when I tell Siri, "Add this to my library." The add media intent handling flow starts with resolving media items as before, except now that Siri knows what's currently playing from your service, an identifier for the currently playing content is also part of the media search parameters.

As before, your service must identify the content playing and add it to the library of the count used to make the request. You'll then create an object to represent the resolved media item and bundle together the handle method result that indicates this operation was successful. Then you'll return that response to the HomePod, where Siri will confirm the result.

And finally, when I tell Siri I like this, the update media affinity intent handling starts again with resolving media items as before, and your service will identify the content, create a resolve media item result, register this affinity against the taste profile the account used to authenticate the request, and like the previous examples, include the handle method result.

These context-specific requests will only include a content identifier if your service is the service playing on the HomePod when the request is made. Also, don't forget that listeners can just say that they want to add something that's not currently playing to their library. In this case, you'll be provided with media names and artist names to resolve. This is a less common use case, though.

Let's see a basic play media request and response as it looks in the protocol, as JSON. If I ask my HomePod the most basic request to play music on your service, every request to the intent endpoints on your service will provide a session description before the first item to resolve. Most objects in the specification include the Cloud extension version number of the object being presented. Your service should support older versions of these objects when possible. Newer versions also should be backward compatible up to a new major version number.

The intent resolution session has an identifier, which is also carried in a header on HTTPS requests. It will remain the same during resolution of the intent interaction, and if multiple requests are made to resolve this intent, this identifier will be consistent between them. Two timestamps are provided: the UTC time when the request was prepared and a deadline time, after which the client will time out the request.

If the deadline is missed, Siri will explain the service cannot resolve the intent. Providing a timely response is important. Timeouts happen after 10 seconds. So consider how you can break up the work of returning intent results and preparing content for playback, so that your service can take advantage of the time between requests from the HomePod. These constraints and player context objects will be discussed in more depth later. They provide important information about what's happening on the HomePod when the request was made.

Most requests start with a resolveMediaItems method on the protocol. The most important part of this request is in the parameter, an intent object. This is a playMedia intent, and all playMedia intent objects have a media search that defines attributes of media parsed by Siri from speech. Other parameters of the intent include information about whether I asked to play the content shuffled or repeating, or to resume playing the content, or whether I said to play something next or later instead of now. This object should appear familiar to you if you've already handled playMediaIntents on iOS.

The response you return from this request should look familiar too, but there are a few new twists. Inside an array of results, you provide the method name, a result object, in this case a resolutionResult array with one item, the resolutionResult class appropriate for this method, which is in this case a success property, and as you can see, this resolveMediaItem represents your favorites, a station based on my taste profile.

Here is something new for cloud extension. Each method you process should return a metrics object that describes how long in seconds it took for your service to provide results. Apple may use this data in aggregate to roughly determine the experience of listeners and better evaluate when delays are caused by network connectivity. You can see the next result item is peeking around the break here: the handle method for this protocol.

The result of a handle method is a response object appropriate for that intent, which has a status code and then a description of a user activity object. Think of user activities as a memento representing the intent result. For Play Media Inteint results, the user activity is used when starting or modifying playback of your content. Whereas on iOS devices, the user activity object is made available to your app when it's launched. On HomePod, your service defines the listening experience with another cloud extension API for MediaCube playback.

Up to this point, intent handling should have been super familiar if you've implemented SiriKit Media Inteints on iOS previously. Applying media from your cloud service is the driving feature of cloud extension, and in this part, I'll provide you the concepts and specifics to implement media queues in the cloud.

Let's review again what happens when a person asks Siri to play media on a HomePod. The request is processed by Siri, and our Cloud Media Service Extension has chosen to handle it on behalf of your service. The HomePod uses the Cloud Extension APIs implemented by your service to resolve an intent. Once we've resolved the intent, a Cloud Media Player component will contact your service's queue endpoints. And it's those queue endpoints which determine what content to play.

In most services, a content distribution service often provides the actual audio files to play. You might notice that there are two HomePods depicted here. In certain situations, the HomePod that resolves the intent may not be the same HomePod that is playing your content. If a family has several HomePods, the HomePod closest to someone in the living room might respond to their request, even if your service had already been playing to multiple locations in the house and that session had started in the kitchen.

Let's see how your service provides content for your cues. I'm going to walk you through some high-level concepts of media cues, including various idioms and use cases for different listening experiences. And then dive into details of managing cues and their lifetime. And finally, I'll explain how your service defines playback controls on HomePod. Okay, let's explore what it means to provide a cue from your service.

All listening experiences supported by Cloud Extension are defined by playing a cue of content items. Cues are created or modified by submitting user activity objects resulting from intent resolution to your service. Your service returns a cue of content items all at once or more likely in segments. Let's look at some use cases and how to conceptualize them as a cue. Internet radio content often starts with a promotion track, then many songs, quite often an advertisement, more songs, and then it continues indefinitely until a listener stops playback.

On-demand music requests can resolve to a single album or other collection like a playlist and be constrained to a finite cue. And live streaming content will often have a single content item which forms a minimal cue response that has one piece of content referencing the live stream source.

Content items in a queue can be described with an identifier, a URL, and various descriptive attributes like the name of the content, the artist, and so on. Content selection can take any number of algorithmic forms in your service, from a rigid album track order to a completely dynamic selection.

Queues are intended to support all these use cases. Now let's talk about the various ways you can represent your sequence of content items in a queue idiomatically. The idioms I'll be describing are variations on using a cue broken into segments for different purposes. So here we have our content conceptually realized over time, described by objects.

Say these four pieces of content are the complete listening experience you want to represent as the result of an intent resolution. You can package them into a cue object, which has its own identifier and a simple array of all these content items. That's a pretty basic cue object.

This is the static content idiom, where a cue contains a finite, non-changing list of content. It may fit into one cue segment, or it might not, but conceptually, it is a rigid list of content items that can be played in order. Use cases for this idiom include the on-demand music request for an album or a playlist, or the live stream use case where the cue is a simple wrapper for one piece of content that streams indefinitely.

Your service can have more complicated queues, however. Queues can be split into segments, and they internally link to the next content and previous content items via URL references. The HomePod will retrieve these segments as your service defines them, and any particular segment can be dynamically generated and return different results as the listening session moves to the new segment, either by transitioning naturally or by skipping forward or backwards through your content items.

This particular example might fit a use case where a service provides an advertisement inside a set of fixed content segments, and each time the player encounters queue segment B, a different advertisement might play. This means your service queue content is retrieved dynamically, so your service will experience more requests for each queue segment as it is entered.

Your service may want to have the most discrete control over content and playback options. And in the fully dynamic idiom, every cue segment contains one content item and a new content for the cue can be retrieved dynamically as the listener proceeds through it. This is a popular idiom for internet radio use cases because the cue is typically of indefinite size and also because the cue content itself is a function of listening behavior. Skipping tracks and providing content affinity through Siri Intents may change the taste profile associated with the account, and therefore upcoming content items may be selected differently. Naturally, this is also the idiom that represents the most request processing on your service.

Since these idioms are different ways of using the same cue architecture, your service can provide cues using any combination of these idioms, and you can even choose alternate idioms to implement for a particular kind of request. They're useful for considering trade-offs in complexity and capability. Enough theory. Let's talk about how your service creates and manages these cues. When a Play Media Intent Handling request returns from your intent endpoint, your service should start creating your representation of a cue, because shortly afterward, a request on your cue Play Media Endpoint will be made to retrieve the initial or complete cue segment.

Each segment can link to a next and previous segment as your service desires, providing a set of content that is potentially endless or circular. Usually, a newly retrieved cue starts playing from the first content item. To support resuming playback from a prior session, a Play Pointer object can be provided in the initial cue segment, and playback begins from the content item indicated, at a time offset you provide. To support smooth, gapless playback, as the listening point is near the end of the last content item in the cue segment, the next segment will be retrieved and appended at a configurable pre-roll interval.

[Transcript missing]

I hope you have a good grasp of the rich set of queue management tools your service can use to provide flexible and timely content retrieval and updates. Finally, in this section, I'll expand on controlling playback and defining how listeners interact with your service. Not every kind of cue is the same, and not all content in a cue can be treated the same.

For example, the internet radio use case typically supports only forward motion through the cue. Additionally, when an advertisement is playing, you may not want to support skipping that content at all. Skipping ahead in radio station cues may be limited by many variables. Premium subscriptions, freemium models, other more esoteric limitations. The fully dynamic idiom is often the only way to support these requirements. Moving between different content in a cue may be limited arbitrarily. Let's talk about how you define this interaction.

Cloud extension Q APIs let your service define interaction with PlayMedia Control objects you provide in the Q object. Each PlayMedia Control object can choose a predefined scheme which supports several common use cases. This set of defined interactions can be specified on each content item with a control attribute.

Let's look at an example control definition which might be appropriate for an on-demand service with advertisements interspersed with songs. You include these definitions inside a controls attribute for every cue object. In this example, there are two definitions: the required "default" definition, which is used when any content item does not specify a control, and an "add" definition, that is specified on all content items that are advertisements.

All control definitions have a prototype scheme. By default, the content items in the cue are going to be the on-demand scheme, and for the add content, the advertisement scheme. The scheme modifies the Siri dialog when a user asks to move to the next content item. Siri may then reply that skipping advertisements is not permitted. The scheme also defaults all navigation commands to be disabled, though you can selectively enable them again.

An example of enabling commands is also provided here, where the dislike and like track commands, normally disabled by default, are being enabled for your on-demand content. You can see how with a small set of control definitions linked to items in your cue, you can substantially customize interactions with your content. Finally, the control definitions also allow specifying options for user activity reporting, which is how your service receives optional command events, navigation information, and play status, and play status of your content. These are all topics I'm about to cover next.

So far, we've seen how a HomePod reaches out to your service to resolve intents and manage a queue of content to play. Now, I'll talk about how your service is made aware of playback events and can contextualize requests over time, starting with how listening activity can be reported to your service, how playback context is provided when already playing any queue from your service, and finally, how your service must respect certain playback constraints. So let's start with reporting.

Listener activity is reported using the Media Queue Update Activity Endpoint in the Cloud Extension API. This endpoint is technically optional, but most services implement it to get status reports from the HomePod as people listen to content. Your service receives reports for transitions and media control events such as skips, pausing, continuing playback, and scrubbing in a track.

Your queue content can be affected by these reports, especially if you've implemented a partially or fully dynamic queue idiom. Your service can even return new queue segments as a response. Since update activity requests are made asynchronously from content retrieval for your queue, returning a new queue on the report of a track skip may not present your queue content cleanly and might fight the normal transition logic. Some audible jump in playback would likely occur.

Use queue responses from reporting only to completely change the playback or insert new content. Use content URL templating to get more context of why a queue segment was retrieved. Let's look at an example request made to the update activity endpoint. In this example, I asked Siri to skip to the next track.

These simple interactions are not intents but media controls. In JSON, the report looks like this. As you can see, the type of report being made is tagged as a "Skip Next." The full set of reporting types is defined in the specification and includes natural transitions between tracks, scrubbing within a track, and when a new queue replaces the existing queue.

The now playing context is provided, including the activity, queue, and content identifiers, and an offset into the content that has currently been played. For transitions, a previously playing context is also provided. It represents the identifiers associated with the last content being played and the offset inside the content when playback transitioned.

You might use this to update your listener's taste profile if they skipped quickly. But in this case, I listened to a good portion of that song, as you can tell in the offset in millis value. So your service may assume I probably liked it. At least enough not to lower its relative rating.

And like other endpoints in the queue API, the user activity object is presented in full, in case the user info data is useful for your service to recover state and quickly process this information. There are more advanced capabilities your service can use as well. For example, you can request playback interval reports as content is playing, to know that a certain point has been reached, or each time a particular repeating interval has elapsed. You can enable and receive reporting of certain MP Remote Command Center commands when iOS devices control playback on the HomePod.

Note that the remote commands for like track and dislike track are not the same as update media affinity intent handling. They are not voice intents, but simple taps from an iOS device anywhere in the house. While I discussed returning new queues previously, you can also update user activity objects during playback by returning an updated version of the object.

Here is an example we saw earlier of custom control schemes which include activity definitions. The default control includes a play elapsed interval of 60 seconds, which means that a report is generated every 60 seconds as content items that have no control attribute are played. But for items with a control attribute of add, as soon as 5 seconds of the content have elapsed, a report is generated, and then another report is generated every 10 seconds. There are several limits on reporting intervals defined in the specification.

Reporting provides your service a lot of data, but tracking that context in your service may be expensive as millions of listeners process intents. A session context can be provided to your service to help bridge the queue playback and intent handling endpoints context. The session context is only available if the HomePod is currently playing media from your service. When I tell Siri I like a particular piece of content, your intent service endpoint receives a player context object, much like how the "now playing" object discussed in the reporting section is sent on every report.

Player context includes content, queue, and user activity persistent identifier values. These encompass three levels of definition I have introduced previously: the activity persistent identifier, which defines a potentially long-lived listening session, the queue, which defines the current listener's experience, and the particular content, which is now playing at the time of the intent resolution.

This item is also carried in the intent media search object, and you may be familiar with using it on iOS devices already. Consider whether session context is enough information for your service to avoid the need of implementing the update activity endpoint at all, and handle more complex reporting events.

In this example request for update media affinity intent handling, the HomePod was already playing a queue from your service so a player context object is provided to the intent endpoint. Inside the session object, you will find player context, or a null value if your service is not currently playing.

If your service's taste profiles are explicitly driven only by media affinity and play media requests, player context may handle all of your requirements without implementing full reporting. But before your service uses taste profile information from these tools, it also must honor constraints on usage sent with every request. In a home, there is a different kind of contextual awareness your service must know about and honor.

Each HomePod can be configured with certain options that form constraints on your service. For example, a HomePod in a child's room might have an option to exclude explicit content from results. Whenever a request is made on this HomePod, it should not return explicit content. But a request to a HomePod in a bedroom that does not have this setting enabled should be allowed to return explicit content.

[Transcript missing]