AVSpeechSynthesizer: Making iOS Talk - WWDC 2018

App Store and Distribution • iOS • 8:48

Speech can enhance the audio experience of your app, whether you are generating spoken feedback for accessibility, or providing critical information beyond simple alerts or notifications. AVSpeechSynthesizer produces synthesized speech from text and allows you to control and monitor the progress of ongoing speech. Learn the ins and outs of AVSpeechSynthesizer and how to add computer-generated speech output to your app.

Speaker: Chris Fleizach

Unlisted on Apple Developer site

Downloads from Apple

HD Video (272.8 MB)
SD Video (70.4 MB)
PDF Slides (504 kB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Hello. My name is Chris Fleizach and let's talk about making iOS talk using AVSpeechSynthesis. So our agenda for today. What and why is AVSpeechSynthesis? Let's talk about some of the basics. We'll talk about choosing the right voice, some properties that are available like rate, pitch, and volume, and finally attributed strings.

AVSpeechSynthesis is an API for generating computer synthesized speech on your iOS devices. It has many uses. For example, you might want to post announcements within your app. You might be creating an interface that's not meant to be looked at. Or you might be creating an education app where having synthesized speech provides a good reinforcement for learning materials. One example, providing audio updates during a workout can be communicated effectively using synthesized speech.

A note about synthesis and accessibility. So speech synthesis is a powerful tool for helping many users with many disabilities. For example, cognitive users can get reinforcement about the output that they are experiencing. Users who have trouble vocalizing speech can use synthesis to generate speech for them. Non-sighted users use speech synthesis to consume their interfaces.

However, it's important to note that it is not a replacement for VoiceOver or other screen reader technologies. For example, speech can overlap with what VoiceOver is speaking at the same time. And most of the speech won't be available to any devices that are connected such as a Braille device.

Instead, you should make your app accessible using the UIAccessibility API. Okay. So let's get to the basics. First step, create an AVSpeechSynthesizer. You can use that with this code snippet. One thing to note is that you want to make sure that it's retained until speech is done. If the synthesizer goes out of scope, any synthesis in-flight will be canceled. Now that we've created a synthesis, the next job is to create an utterance. We can then dispatch that to the synthesizer. So, in this example, we create an utterance with the string hello and then we dispatch it using the speak method.

A note about audio sessions. So when AVSpeechSynthesis is activated using speak, the audio session will automatically set to active. If you want to mix with other audio, you can use the setCategory with options on your shared AVAudioSession mixed with others. If you wanted to duck other audio so that your speech is primary but then other audio becomes lower in volume, you can use the duckOthers option.

Another note is that the audio session will not be set to inactive after speech is done. That's because it's a shared session only so that if there is any other audio playing at the same time, we would not want to stop that. So if you want to set the audio session to inactive, you would do that yourself.

There are some callback methods available in AVSpeechSynthesis that will help inform about the life cycle of an utterance. These are optional methods in the synthesizer delegate protocol. For example, you can know when speech starts, speech finishes, even the character ranges that will be spoken. You can also know when speech is paused or continued.

So an example of this in code snippet form, you can set your synthesizer delegate to an object and then implement these methods, for example the didStart in which case you'll get the utterance back, the didFinish which will do the appropriate thing, and then the willSpeakRangeOfSpeechString which will give you back an NSRange which then you can convert into a string range to use that in your string. Now a little demo.

Hello, my friends. I'm inside the iPhone.
Great. Those are the basics of AVSpeechSynthesis and, as you can see, it's very simple to get off the ground. The next topic is how do you choose a right voice? There are many built-in voices, one for each supported language. Something to note is Siri voices are not available through this API. Users can, however, download higher-quality voices. And when those are downloaded, they will appear in the list of voices.

So you can select a voice using either an identifier or a language. If you select a voice by language, it will select the user's default voice. So how do we do that? First you make your utterance and then you set the voice property using AVSpeechSynthesisVoice. As an example of using the identifier initializer, you could get all the speech voices which is an array of speech voice objects and then choose the first one, for example, and then pass in the identifier property. Here are some of the languages that are supported on iOS by default.

Now let's talk about some of the properties that you can use on AVSpeechSynthesizer. Speech rate can be controlled using the rate property, and it goes from 0 to 1. Now this will scale speech to go slower or faster. Speech will be scaled from about 0 to 1 with values from 0 to .5. So that means if you pass in .5, you will get the default normal speaking rate.

If you want to go faster, you can pass in values from .5 to 1 and that will scale speech from normal speaking rates at 1x all the way up to 4x. To do that, you can create a speech utterance and then set the rate to a float value or you can use some of these constants which are default rate maximum or minimum.

Let's talk about pitch and volume. These are also properties you set on the speech utterance. Pitch controls how high the voice is or how low. Volume controls the volume of the voice itself. So, for example, we can set pitch to a very high pitch voice using a one value. We can set volume to a lower volume using .25. Note that lowering the speech volume does not affect the system volume.

Finally, let's talk about attributed strings. Attributed strings allow us to customize how speech sounds using different attributes. One available attribute that I'd like to discuss is IPA notation. IPA stands for International Phonetic Alphabet, and it allows us to customize how specialized names, nouns, and other things are pronounced. It's available in these languages.

And here is an example of what it might look like. To pronounce the iPhone in the way that we are used to, you would use this IPA notation. Now the obvious question is how do you generate this IPA notation? It is not easy to type with a keyboard.

One way to do that is to use the pronunciation editor in accessibility settings. So when you go to the settings app, go into accessibility and speech, you will find pronunciations and find a screen like this. You can enter in the phrase that you want to find the correct pronunciation for, tap the microphone button, and then speak it.

After you're done, you'll be presented with options on possible variations. As you tap those, it will say those out loud. You can choose the one that sounds correct and then copy this value that you get into your code. So let's see how that's done. First, we make an attributed string then we can add the attribute for the speech IPA notation using the value that we got from the settings.

Finally, we can create a speech utterance with the attributed string. So in summary, AVSpeechSynthesis is a great way to augment your app experience if you add speech at the right time. Multiple languages and voices are available and can be dynamic. Finally, you can customize pronunciation for the way words sound using the IPA notation attribute. For more information, visit the session URL and thanks for watching.