Text-to-Speech: Adventures with Alex - WWDC 2009

Mac • 57:30

Speech technologies have matured and are now mainstream, pervading personal computing, cell phones, cars, and telephone networks. Bring your application into the 21st century by adopting these technologies in your application. Speech synthesis, also called text-to-speech (TTS), converts text into human-sounding audible speech, delivering information to users without requiring them to shift attention from their current task. Learn about new speech synthesis features and APIs on Mac OS X Snow Leopard that help you easily incorporate speech synthesis into your application.

Speakers: Matthias Neeracher, Kim Silverman, Kevin Aitken

Unlisted on Apple Developer site

Downloads from Apple

SD Video (122.6 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Good morning.

So it's 9 AM on Friday morning, last night was the party. Who stayed up too late? Yeah, thank you so much for coming. My name is Kim Silverman. I'm manager of the Spoken Language Technologies and principle research scientist here at Apple. So who are we? We are a team of research scientists and software engineers, organizationally we are part of the Mac OS X division, but we work on technologies that are used across a wide range of Apple products.

Our technologies include Text-to-Speech synthesis, Speech Recognition, Latent Semantic Analysis, which is used among other things for the junk mail filter. And integration of these technologies into quite a number of Apple products. Today, we're going to focus on Text-to-Speech synthesis. This technology has been around for quite some time, but it's become mainstream in the last year or so.

The New York Times published for example that Text-to-Speech is the new black, which was, we thought was kind of cool. So what Text-to-Speech is there in OS X? At this stage, I'd like to point out 55% of the people attending this conference are at WWDC for the first time. A good number of the people who are here are new to the Mac.

So for that subset of you who've been programming with our APIs for years, some of the things I'm going to say might already be familiar with, to you. If so, please bear with us. We've been shipping OS X since the Mac first came out, sorry shipping speech synthesis since the Mac first came out. There are a range of different voices available in OS X which contain different technologies representing different tradeoffs of quality and footprint. Although people are not aware of this, the quality of all of the voices improves with every release of the operating system. Rather, oh thank you.

Rather than talk you through it, let me just let some of the voices introduce themselves. We'll just do a quick sampling of some of them.

I can read books out loud.

We do not communicate with carbon-based life forms.

Speech synthesis music.

Give me the plan captain.

You have mail from your boss.

Relax, don't do it.

Hi, I'm Alex. Some people recognize me by my voice.

Thank you.

[ Applause ]

So Alex is our highest quality voice. He's got very natural articulation. The intonation, the way he says what he says conveys the topic structure of the text that he's speaking. And as you may have noticed, Alex breathes. Those breath sounds are not just audio candy. They help you to understand the structure of the text, they make the speech more intelligible and they make listening, the task of listening more pleasant. So what can be done with our speech synthesis?

Well, as you probably know, it's used a lot in accessibility. For example, VoiceOver that we ship is a screen reader that uses our speech synthesis to enable people who are blind to use the Mac. A lot of other applications that you folk are writing also help out folk with disabilities. For example, people with dyslexia, people who have trouble communicating.

The point is that technologies that are developed for people with disabilities often become mainstream because people without those disabilities also say that's cool, I want some of that. For example, the smooth ramps that go down from footpaths onto the road were designed originally for people with wheelchairs, but we all like them. So the things that you see being done for people with disabilities using speech, you might want to think about can I do some of that in my application?

From accessibility, we then generalized to using speech in general to enhance the user interface. For example last year we shipped the iPod nano, which added speech so that you could control it without looking at it. If you're in a situation where it's not convenient to see the screen, like driving your car or you're out jogging, you can still know where you're navigating. You hear the song titles, you hear all the menu items spoken by our speech synthesis.

We won awards for that product. And so we then took that to the next level and rather than just enhancing the user interface, speech became the sole user interface. In the iPod Shuffle, which we shipped earlier this year, there is no screen. Speech synthesis is the only way the device can tell you what status, tell you what you're navigating to and so on.

Speech is used in other areas as well. It's used, for example, in the creative arts, in the broadcast media there are lots of radio ads, TV ads, movies that use our speech synthesis. One case that we're particularly proud of is the award-winning feature, animated feature Wall-E used our speech synthesis for one of the major characters, Auto, the ship's computer. The tool that they used to add emotion to that voice is the very same tool that you all have on your developer CDs.

So you can, on your developer tools. So you can play with exactly the same thing that was used and that you'll see in the making-of videos that accompany the Wall-E DVD. It's called Repeat After Me. And it lets you say something the way you want it spoken, with all of the emotion and interest that you're being concerned with and then it will copy the pitch, analyze and find the pitch, the phonemes and the durations in your recording and impose them onto the synthesizer, so the synthesizer will say the phrase, the sentence the way you said it.

We've used speech in a number of ways in the Mac OS X user interface. For example, you go to the system preferences, find the speech preferences, from there you can go to the text-to-speech panel where there are some check boxes that allow you to turn on a couple of simple features. For example, every time your application puts up an alert, if the user doesn't respond to it after a certain amount of time, say, 30 seconds, then0 we will read it out. This is useful when a person is not attending to the screen.

For example, I have received email from a user who said I was crawling around on my hands and knees under my desk and rearranging my cables and suddenly this voice came out from the computer behind me saying, the network has been disconnected. I turned around and sure enough, I had unwittingly kicked the Ethernet cable out of my computer.

It's useful if your application is in the background and the user has something else on the screen. Safari or Xcode, you can still attract their attention. You put up a dialogue and even though they don't see it, they'll still hear it and know whether they want to respond to it or not.

And it's useful in K12 education. This is funny. Educators tell me this. When we read out the contents of your dialogues, we precede them by a short phrase. You can edit that phrase. Kids in schools find out that you can do that and they replace those phrases that we ship with some rather creative alternatives as you can imagine.

The point is now they care about the spelling. Because if they don't spell the words correctly, they aren't spoken correctly. Now they care about the punctuation. The teachers love this because it's what we call guerilla literacy.

[ Laughter ]

And teachers tell me if you have an application for education and they sit a kid down in a class with that application, the kid will stay engaged for about 5 minutes.

But if they turn on these interactive speech features, the kids stay engaged with your application for 20 to 30 minutes. That's 20 minutes of extra time that the teacher has to attend to other kids in the class. It's possible to select any text and have it spoken with a hotkey.

This is useful for example with your viewing your news article on the web. You can select it and have the computer read it out to you while you do some other task. That also saves eyestrain, so you don't have to squint at the screen to try to read some of those tiny fonts. It's useful for proofing messages that you type.

A guy stuck his head into my office soon after we shipped that and said, Kim, your speech synthesis saved my butt. You see, when we type email messages, we can see the typing errors that create non-words. But it's much harder to become aware of our incorrect grammar, our poor word choice, our incorrectly assembled sentences. But when you hear it spoken, it's glaringly obvious. Some parents tell their kids to have the computer read out their homework before they submit it because otherwise as you know, kids just don't care about spelling and sentence length and so on.

But when they hear it, they care. There are a lot of people out there in the world who want to improve their pronunciation of English and of other languages. And so with this feature, if they see a word and want to know how's that pronounced, they can just hit the hotkey and listen to it. Language education is most effective when the students can bring their own material.

[ Period of silence ]

In iChat, it's now possible for you to configure it so that when a person sends you a text message, you are, iChat will tell you with our speech synthesis. In fact it can be configured even to read out the message. So imagine the scenario is, you send a text message to somebody, they don't respond, so you bring up Xcode and you start typing away, five or ten minutes later, they respond.

iChat's in the background, but you hear the content of what they've typed and you know what they said and you don't have to do a context switch. Some people each morning turn their email into podcasts, download them into iTunes using our speech synthesis, and have their email read out on their iPod on the way to work so that by the time they get to work, they're already ahead.

So what do we do for Snow Leopard? Well, one thing we did, worked on was the size of Alex. If you look at all the files in Leopard, the biggest single file is the Alex voice file, 670 megabytes. So we asked ourselves, can we try to get the size down while at the same time minimizing any negative impact on quality? Well, we got the size down to 382 megabytes, a little over half, and not only did we maintain the quality, but the actually on average, the quality of the Alex has increased in Snow Leopard relative to Leopard. How did we do that?

Well, a number of things. We made the intonation much more natural. So the way Alex speaks more directly conveys the meaning of the texts. His rhythm is more even as well. We significantly improved the pronunciations of things like names, technical terms, ambiguous words. And we significantly reduced the dropouts, the discontinuities, the missed articulations, the disfluencies, the hesitations, the glugs, the pops, the clicks, the gurgles.

This was a lot of work. It was not trivial. And we added a few good ideas for Snow Leopard that we're talking about today. We've added more to the API so that you can do more with your code with less code. OS X has been shipping for a couple of years, several years now with rich audio processing.

It's now possible to redirect the Apple to speech synthesis straight into that audio processing. We'll demonstrate that for you. And because we have a plug-in architecture, other companies who have speech synthesizers have ported their synthesis to the speech manager, so you call our API and lots of extra voices in other languages become available. Let me introduce some of them to you.

From Japan, we have Create System.

[ Foreign language ]

Thank you.

[ Foreign language ]

Some of you know of Cepstral who have some mainstream voices and some novelty voices.

[ Period of silence ]

I'm Lawrence.

My name is Millie.

I'm David.

I'm Kelly, the steady choice.

And last, I'm Alison, welcome Mac Users.

[ Foreign language ]

I'm the Kevin voice. Where's the party headed later?

I am Conrad.

Unfortunately, there is no party later. Just work.

I'm Tamika, an urban voice, word.

And I'm Princess. OK, like what's urban actually mean?

I'm a wise guy, I'll be seeing you in Jersey.

And I'm Top Hat, an old school gangster type.

If you live on the wild side, I'm Evil Genius.

I'm French Fry, a voice with an amusing French accent.

I'm Vlad, the voice that talks with a Siberian accent.

[ Applause ]

From Europe, our friends in the Acapela Group have brought over their voices. Here's a subset of them.

I'm Peter and this is an advertisement for my voice.

I'm Heather and this is an advertisement for my voice.

[ Foreign language ]

From Edinburg, CereProc brought over a few other varieties of English.

Hello, I'm Heather. I'm a Scottish voice.

My favorite foods are tatties, mince, haggis and neeps. Now to introduce Sarah. She's a bonnie wee lassie.

Thanks Heather. My name's Sarah. I'm a voice from London, England. Sometimes I have tea with the Queen.

My name's Sue. I am a Black Country accent. [Foreign language] stop your [inaudible] now and get on with it.

Hi there, I'm sorry, I didn't mean to stare. My name's Katherine. I'm an American voice. Have a nice day.

Hi there, Apple guys. It's a bit lonely being the ex-president of the United States of America. However, at least I have more time to listen to my iPod.

[ Laughter ]

Thanks.

[ Applause ]

So you can call the speech synthesis APIs confident that your app can ship in lots of countries and that your users can have a rich experience. So how do you get started? To show you more about the new API, let me ask our senior software engineer, Kevin Aitkin to come up on the stage. Kevin.

[ Applause ]

Thanks.

OK. Good morning. So hopefully we're going to get into some code here. So hopefully I'll be exciting. So as Kim said, I'm Kevin Aitkin, one of the engineers in the Spoken Language Technology Group. And what I'm going to do is give you a number of examples of the various ways that you can access the text-to-speech system in Mac OS X whether you're writing a simple script or a full blown Cocoa application. So before we get into the code, let's, let me give you a quick summary of our text-to-speech system.

So as Kim mentioned and played for you, there are a number of third party synthesizers out there that plug into our system. The great part of it is you really only need to worry about a single API, regardless of which synthesizer you're talking to. So that greatly simplifies that.

And the synthesizer in the language is actually determined by the voice. Whether you set the voice specifically using the API or you just use a default system voice that the user has selected. Now our API accepts just a simple string. But you can add a number of embedded commands in there to control exactly how it's spoken.

And we have a plethora of options for setting the speaking rate, the volume, the pitch, creating audio files and sending it to other places then just out the speakers. And our system is naturally asynchronous. That means that your application can be speaking while the user continues to use it and you could even have several voices speaking at the same time if you choose. So what I'm going to do is I'm going to take you through the example in five areas that are mainly different languages or environments that you can access text-to-speech system.

So let's start with AppleScript. So as you might know, AppleScript is a great language for combining the functionality of a number of applications together in a single workflow solution. It's been used for a number of years by the publishing industry and others that work with a plethora of graphics files and others.

The great thing is if an application can run an AppleScript then it can speak using the say command. So for example in Mail you can have an AppleScript run when you receive mail from a particular person. AppleScript say command also integrates well with another command that we provide for listening to spoken commands. So you can create a simple question and answer scenario that the user can use to communicate with the computer.

As I said, there's a number of options that you can specify. You can specify a voice or you can say the audio to a file. And since the AppleScript say command is part of the standard edition's OSAX, you can just go into script editor, look at its dictionary to see all of the options. So let's see a couple of examples of the ApScript Say command. In the simplest case, you can just give a single string. Here, we might want to provide some progress during a lengthy AppleScript.

Now compressing image files.

And it can also specify a specific voice to use. So maybe you want to be warned when you get email from your boss.

You have mail from your boss.

And you can also specify a file path, so that you can send the audio to a file that you can listen to later on your iPod.

More drivel from yours truly.

So that's a quick sample of the AppleScript say command. We also provide a say command line tool that you can use in terminal and simple scripts. It's just a simple application to start speaking and it waits until it's finished and then quits. Allows you on the command line to specify the voice. You can have it read from a file.

It also has a number of flexible arguments for sending the audio to something other than just the speakers. And in terminal for the documentation, you can just easily type man say. So let's look at some examples of this. In a simple case, we'll just give it a string.

It's 12:30, time for lunch.

Or it can use the -v option to specify a particular voice.

Your status report is due.

Or we can read the text from a file using the -f option and then send the audio to an audio file using the -o option.

It was the best of times, it was the worst of times.

Now something we do in the group every once in a while when we start off a lengthy command such as a GREP or a BUILD or whatever, we'll just add the say command after it so that if we're off working in another application, we'll know when it's finished.

And we hear something like this.

BUILD complete.

So, I've given you an example of the AppleScript say command and say command line tool. These are great, they're flexible tools that gets you speaking really fast. But they have their limitations. So if you want the full access to the synthesizers, then we encourage you to go use the Cocoa or the Core Foundation APIs that we provide. They allow you to directly set and get all the synthesizer properties that are provided. You can receive a number of notifications during speaking, such as when speaking is finished or when a word or a phoneme is about to be spoken.

You also have access to a list of voices that are installed in your system along with their properties, such as the name, age, gender of that voice. And we also have some specialty routines for converting text into the phonemes or for adding a custom dictionary. So you can fine-tune how a particular word is spoken, like your company name. So let me talk about the Cocoa class NSSpeechSynthesizer.

So all you need to do is link with the AppKit.framework, or the Cocoa.framework. You should use a RunLoop so you'll get any kind of messages back. So let's see a quick example of this. So we'll start off by creating an instance of our NSSpeechSynthesizer class. We just use a default initializer here so that means we're going to use a voice the user has selected in his system preferences. We're going to set the delegate object and that's the object that's going to receive a notification when speaking is finished. And then we're going to begin speaking by calling startSpeakingString.

So it sounds a little bit like this.

Don't object to my objects.

So, as I said, we want to be notified when the speaking has finished. And so how we do that is we implement the didFinishSpeaking method in our delegate object. So this is handy for cases where you want to update your interface. So for example in system preferences, when you click the play button it turns to stop and then when speaking is finished, I turn that back to play.

And so that's where I do that in a routine like this. Now as I mentioned you'll have access to all available voices in the system in the NSSpeechSynthesizer class. And so what's really nice, what you may want to do in your application is provide a popup menu. In the case of Apple's chess program, it enables you to speak the plays as they're performed. And so what it does in this preferences dialogue is it allows you to select a voice for both players.

So let me show you how you populate this. It's really easy. You'll call availableVoices, which will give you a list of the voices installed in the system. And then using fast numeration, we're just going to ask to for the attributes of each one of those voices. We'll get a simple NS Dictionary back. And then from that dictionary, we're going to ask for the name value for each one of those. And for that name string, we're just going to add it to the popup menu.

So it's really quite simple. So if you're writing a C or a C++ application, then you'll probably want to use our Core Foundation-based API. You'll need a link with ApplicationServices.framework and although a running loop is a good idea, it's not required in every situation. So let's see a simple example of this. We'll start by calling NewSpeechChannel.

That'll give us an instance of the speech channel. That's just kind of like an open connection with the synthesizer that you'll normally dispose of later when you're finished of it. We're passing in NULL as a voice so that we can use the voice the user has chosen in the system preferences.

And imagine we're going to write an application for kids in the classroom to learn speech, learn language better. So we're going to highlight the words as they're spoken on the screen. So to do that, we're going to define a word callback routine called HighlightSpokenWord. And before we pass that in, we're just going to take that reference and convert it into a CFNumber object.

And then we're going to set our word callback by calling SetSpeechProperty. And then finally we're going to begin speaking by calling SpeakCFString. So it sounds a little bit like this.

See Spot code.

So let's look at the implementation for the word callback when just before each word is about to be spoken. It passes us four parameters.

The first is the speech channel that we set up earlier. The next is an optional refcon. That's kind of like a context or user data. You can use it for whatever you want. But it's optional. Then the string that we originally started speaking and then finally a CFRange which describes the location and the length of the word within that string that we're about to speak.

So that's a glimpse of using the word callback. So let's talk about one other example of accessing the text-to-speech system. And that's from languages that use Mac OS X's bridge support, such as Ruby, Python and Lua. And so what they allow you to do is actually access Cocoa's NSSpeechSynthesizer class from a language other than Objective-C.

So let's look at a quick Ruby example. So we initialize or instantiate our NSSpeechSynthesizer object for the default voice. Then we start speaking.

Calling Cocoa from Ruby is easy.

Now instead of using a notification to tell us when speaking is finished, what we're going to do is we're going to call the synthesizer's isSpeaking method in a loop.

And then we're going to call NSRunLoop to wait for a second each time we call that. And so that's one way of determining when speaking is finished. So as Kim mentioned, we made a number of refinements in Snow Leopard. And we did add a few new APIs. So let me talk about one specifically. We're calling it On-the-Fly Voice/Synthesizer Loading.

And so it's a set of APIs that allows you to access voices and synthesizers outside of the standard directories. So what this allows you to do is bundle a third party synthesizer inside your application and then ship that separately. So this greatly simplifies installation so that your user doesn't have to install that third party synthesizer before they can begin using your application. It also really simplifies licensing issues because oftentimes you're going to license that third party synthesizer just for your application.

And before this, it was kind of confusing because the user would see those voices listed in system preferences and be confused why they couldn't use them in other applications. So how do you do that? Well it's quite simple. You'll just start with your application, then you'll go license one of those third party synthesizers and a voice. And then you're going to embed that synthesizer and voice inside your application bundle.

It's easy to do this with Xcode and set it up. And then you'll just need to add a couple of lines of code to register the location of that synthesizer and voice so that they'll show up in the listed voices for your application. So the third party synthesizer developers will have to make some minor changes to support this, so they should be rolling this out in the next year. We've been working with the Acapela Group, which you heard some of those examples. And they've given us a test French synthesizer that we would like to demo today. So let me just give you a quick demo of this.

I'm going to come here to demo 2. OK, great. So before I begin, let me just show that on this system, we have just a stock Apple voices. So there they are, nothing else. So what I've done is I've created an application. Actually I used, you'll see in a minute, one of our example applications. I've gone through the steps that I described and embedded their synthesizer. And so I stuck it on the flash drive here, so let me stick it in the machine.

[ Period of silence ]

It should show up here any moment, there it is.

[ Period of silence ]

And so here, it is. So let's go choose the voice and so we should have the Bruno voice.

[ Period of silence ]

And we'll play some French text here that we have.

[ Foreign language ]

So that's an example of how we took one of our examples. We just embedded their synthesizer, added those two lines of code that you showed and created something we can easily distribute on to users. You also saw an example here where we were highlighting the words using the word callback. And also, the little guy here animates this mouth using the phoneme callback.

So I'll go back to the slides here and so I'll just finish up by saying that we've showed you several examples of numerous ways you can actually access the text-to-speech system in Mac OS X. We have a number of examples that we ship with Snow Leopard. You can access those at /developers/examples/speech. And of course, all of our documentation is online at developer.apple.com.

Now I showed you a couple of examples of where I was sending speech out to an audio file, but there's a number of ways that you can take the audio from a synthesizer and massage it and do fun things with it. So with that, I'd like to bring up Matthias Neerarcher, senior software engineer in our group. And he's going to show you some really cool things to do with the audio output.

[ Applause ]

Thank you Kevin. Good morning everybody. So far, we've mostly seen speech that goes through different, through the default speakers attached to the system. And that's of course what in many situations is what you want to do. But there are also a number of other output methods.

And I'm going to go through these now. So as we said, by default you just go through the default sound output device. And in the Core Foundation based API, that's just once you have established your channel, you just call SpeakCFString for instance. On the command line you would just use say and then give it the text you want to be spoken.

It's going to be sent to your default sound output device. But we've also seen is sending the speech to another AIFF file instead of speaking it directly. And you will actually find that this is a lot quicker than, it doesn't take us a minute to, to send a minute of text, spoken text to an AIFF file.

To do that, all you need to do is add one extra line before to the CF API, before speaking on the channel, you set the properties speech output to file URL property on a channel to a URL pointing at the file you want the output to be sent to. And then you can speak on the channel just like you did before.

On the command line, you would give the -o for output option and then a file name that you would want to speak to. So those were other two traditional methods for speech output. And over the years we've added some more. One of the most sophisticated one is post processing of the speech output. Let's say you want to write the next great railroad simulator. So you want to have a station announcement that sounds really, like it's bouncing off the walls everywhere. You want it to sound something like this.

[ Inaudible ]

You do this by sending your speech output to, here we've used a matrix reverb audio unit and then going to the default output unit. Now, to set up an audio unit, you generally want more than just a speech output. Or more than just a speech audio unit. You also want to have an audio unit that processes the speech and generally, you want some output audio unit or unit to pick up the sound yourself. Here I'm only going to show you how to set up the speech audio unit itself.

You start by defining an audio component description, stating that you want the speech synthesis generator audio unit made by Apple. You then instantiate the component you find with that description, which gives you the speech audio unit and speech unit. And now the big difference between this output method and any other method that I'm going to show you is that here you don't create the speech channel yourself.

The audio unit has created that speech channel and it's what you do instead of calling new speech channel, is you retrieve the speech channel from the audio unit with the AudioUnitGetProperty call. And the property AudioUnitProperty_SpeechChannel, you will get that speech channel that has already been created. And once you have that channel and once you have hooked up your speech audio unit to other audio units, you use it just like you would use any other speech channel. Now obviously for prototyping, we also would like to make this available from the command line. But how do you specify a graph of audio, arbitrary complex graphs of audio units on the command line? What we do is something simple yet suitable for prototyping.

We allow you to set the speech output to an AUNetSend audio unit, which is an audio unit that feeds into a pipeline, feeds your audio into a pipeline. And then you can hook up another application on the receiving end of that pipeline and define a more complex audio graph then. Now, this doesn't have a way of synchronizing the audio so you want to test this with a lengthy file. And I'm going to demo a little bit what you can do with these.

[ Period of silence ]

So here we have already a command line available for use say -n for send to network, we can give any name, a colon is kind of the simplest name for a, for a locally named pipeline. And then you say we want an input file and I have something lengthy enough available right now. So we've hooked up the sending end. We don't hear anything because right now that pipeline is just going into the big void. We want to hook up the receiving end.

And Apple has a convenient application available for this, AU Lab which you will find in the developer/applications/audio/AULab. And it's an incredibly versatile application for testing audio units. Let's first create a document by hand. The output set up like this is good. We don't really need an input set up. Audio device, you generally want built in output in this room here.

We need line out. And you have your output set up and now you want to hook up the speech that's coming in by adding a generator which in this case is AUNetReceive which is the receiving end of the pipeline where the audio has been going in and luckily not really piling up over time. And now guess what happens if I hit connect.

Man in their degree.

Sometime or other, Jerry very nearly had the same feelings towards the ocean with Nate.

Now, you're hearing unadulterated audio.

Of the man handles fell to ground by war of.

[ Inaudible ]

Now if we pause this process by adding an audio affect. Let's start with a matrix reverb. You can try all these sliders yourself or you can try some of the presets. Let's say you wanted, here we are in a large hall, so let's add some extra reverb.

[ Background noise ]

Then northward.

What do you see?

[ Inaudible ]

So you can see that with a reverb, you already have quite a few options and if you go for the big reverb, obviously you're going to get rather bombastic effects. But sometimes it's sufficient to add just a tiny bit of reverb to achieve a somewhat different effect. Oh, let's say you want to, to alter your voice, make it somewhat deeper.

[ Period of silence ]

Let's first compare again how Alex sounds by default.

[ Period of silence ]

East, south and west. Yet here they all.

So let's make Alex a little bit deeper.

The compasses of all those ships attract them thither? Once more say you are in the country in some high land of lakes.

Take almost any path you please and.

So here, we added a pitch change audio unit in the post processing. And finally if you have let's say you want to write the next big violent computer game and you want to have some kind of a combat oriented radio transmissions with all sorts of distortion going in.

You might want to go for something like this.

American desert, try this experiment if your caravan happens to be supplied with a metaphysical professor. [beep] Yes, as everyone knows, [beep] meditation and water are wedded forever. [beep] But--

So here we are, [beep] so here we add a beep in the pauses of speech.

We add a band pass filter to, to limit the range of audio that's transmitted and we throw in some extra distortion as a bonus. So you can see you can play with AU Lab and get a rather wide range of audio effects that you can all play in real time with your speech output.

[ Applause ]

Thank you.

Now in the Snow Leopard we have, we've added two more output methods. One of them is we noticed that of those people who use the speak to AIFF file output path, many of them then immediately turned around and put that AIFF file into another encoder because if you download to your iPod you don't really want AIFF.

You would have your audio books taking up more space than, than the rest of your music library. You want to encode it somehow. So in Snow Leopard we are offering you a way of directly encoding your speech output without going through an intermediate step. And we do that by allowing you to pass in a reference to an open Extended Audio File. Extended Audio File is an API in the audio toolbox framework. So what you do here is you use the Audio Toolbox framework to open an extended audio file with a URL you're passing, a file format which is container format.

Let's say you want a Core Audio file. You want an MP4 container. You're passing a data format, let's say you want Apple Lossless, so you want AAC. You can specify any codec that is installed in the system. And that gives you an audio file. You pass that audio file to the speech property, speech output to Extended Audio file, and then when you use the channel, the output gets encoded to that audio file.

We also allow this on the command line. We allow you to access pretty much any codec installed on the system. Again, with a number of options if you want to get really into very specific PCM encoding like 16 bit little engine floating point or whatever then you might have to play with the options a bit and look at the man page. Here we want to store Apple Lossless into a core audio file which we achieve by giving just the right extension to the output option and then specifying the data format stored in it because there are many encodings you can store in a Core Audio file.

And one final point here is it doesn't need to be a file. If you look at the audio file API in Audio Toolbox, you can see that you can also open an audio file simply by passing in a whole bunch of callbacks for what is supposed to be happening, happening if you write your file and such.

You can then wrap that audio file in an extended audio file and pass that to us which would allow you for instance to render speech to a memory buffer and then process it in memory if that's what you desire. And the final output method is we've generalized the default output method a little bit that you can now also speak to the audio devices other than the default.

If you for instance have a mixing desk clocked into your computer where you would like to send your audio for post processing or if you want to go some speech to headphones or-- and some speech to a speaker, you simply find out the audio device ID of the speakers you want to go to and then pass that audio device ID to the OutputToAudioDeviceProperty.

And again, you then speak. So now I've shown you all these methods, you might want to ask yourself, we here at Apple usually have the luxury of just writing our codes to run on Snow Leopard and later. You generally want to be backward compatible somewhat. So it's important for you to know when these methods are available. Obviously speaking to the default audio device has been available all the way back to 10.0.

Does anybody here still supporting 10.0 in their apps? Raise your hands. Speaking to an AIFF file has been available all the way back to 10.2. Now speaking to audio units is available in Tiger or later, 10.4. Speaking to an extended audio file and speaking to an audio device, those two are available in 10.6. When it comes to third party synthesizers, the picture is a little bit more complex.

Obviously they all support speaking to the default audio device. And I'm pretty sure they all support speaking to an AIFF file. However, the other methods shown here might not be available yet with third party synthesizers. And I hope they will be more widely adopted in the future. With that, I would like to bring back Kim to wrap up this presentation.

[ Applause ]

So, I don't know about you, but I think that stuff is pretty cool. What do you think?

[ Applause ]

Yeah.

[ Applause ]

So you've got three steps. Step one, there is no step one. Your application already speaks. I showed you all those items that are in the user interface.

That means all of your user interface controls, all of your alerts, all of your dialogues will already be spoken. It's up to you to make sure that your app doesn't sound stupid. Make sure they're spoken well. We haven't talked about speech recognition today, but speakable items allows users to control your application by speaking the names of your controls.

Again, make sure that that works well. So since your application already speaks, make sure it sounds good. Step two, differentiate your application. By that, I mean use speech in ways that are specifically appropriate to how you know your users interact with your application. The things that we do in the user interface are very general. You can do better. Let me give you a few guidelines about where you might want to do that. One is asynchronous notifications.

When something happens outside of the user's immediate control, tell them. For example if they're doing some action that's timed and they run out of time, let them know.

Time's up.

If you've got some task that's been happening that will take a while, when it completes, let them know.

Data backup is complete.

The point is, up until now, we have tended to encode messages like this to our user by playing arbitrary sounds. The burden is on the user to learn what message is associated with what sound. If your application knows why it wants to get a person's attention, tell them. If something happens that you need their attention really quickly, let them know.

Urgent email.

You can even do this in games.

Enemy planes at 12 o'clock.

So, after asynchronous notifications, think about other ways of giving feedback to the user when the user does something with your application. For example, when they're playing a game and it finishes, let them know.

Game over, game over, game over.

That by the way was one of Matthias's audio effects.

If you have a user who's doing some interaction with an application at the point that they make some choice, you need to feed that back to them. For example, I've noticed in first person shooting games, often you'll need to change weapons as you enter a new phase in the game where a different weapon would be appropriate.

So you can have the application speak them. Here's an example.

[ Background noise ]

M16, machine gun. Magnum, semi automatic, plasma rifle, boomerang, water pistol.

[ Laughter ]

The water. Thank you.

[ Applause ]

The water pistol was for those who didn't get the joke about the boomerang.

[ Laughter ]

Yeah, I'm an Aussie.

That also was using Matthias's audio effects. If a user interacts with your application in a way that takes a number of steps, you might want to summarize all of those into a single sentence. For-- suppose for example I'm working in a finance application and I'm organizing a payment.

So I have to specify the from account and the to account and who's the, who the payee is and the amount of money to transfer. So just before it finishes, you can actually say back to the user, "Sending a check to AT&T for $60." Just summarize it. If, say that again? Oh all right. If a user's entering application-- entering information you can use this to help them.

For example, it's tax time. I have a pile of receipts from charitable donations. I laboriously read each receipt and type it into my application. It would be great if I could just press a button and have it read back the column of the information that I've typed in without me having to look at the screen, so I could just look at my receipts as it reads it back and make sure I've got the amounts right.

[ Period of silence ]

And there are other ways that you can add speech that's specific to how you know your users interact. For example, suppose you have an application that allows users to write a script for a play. At some stage, they might want to hear it read back. Well using our hotkey, they could select all the text and have it all read back. But you can do better.

For example, you could read each character with a different voice. You could know that you don't want to read back the acting directions, the blocking queues, the lighting instructions. And here's an idea for those of you who have such an application that you might want to go and implement and this idea came from Kevin Aitkin. And that is allow users to select just one character to not be read out.

So then, you could use that application for rehearsal. So the third step is customize. Sometimes a speech could say things better than the way it does by default. So what kinds of things would you want to customize? Well, what words are spoken? If you just take the strings in your application and pass them to our speech synthesizer, it will speak them. That may not be what you really want. Here for example is the name of a weapon in a common first person shooter game. You pass that to the speech synthesizer, it will dutifully read it out.

USC.44, Magnum mega class A1.

But wouldn't it be better if you could just say.

Magnum.

After you've figured out what words you really want to speak, you might want to check how they're pronounced. Sometimes there might be words that are specific to your application or company names that are not pronounced the way you want them to be pronounced. For example, Raiden in, which is actually I believe a character in a game, is not pronounced the best way that we'd like by our synthesizer.

So you can specify the phonemes to indicate precisely how you would like it pronounced.

Don't say Raydon, instead pronounce it Raiden.

If the synthesizer does not say things the way you want them to, you'll be tempted to use funny spelling. Spell things differently. Well, we advise against that because we cannot guarantee that something that's spelt with nonstatic spelling will be pronounced the same way with the next release of our synthesizer.

However, if you use phonemes like this, then you can be absolutely sure that all future versions of our synthesizer will maintain the correct pronunciation. After you've specified how the words are pronounced, then you want to customize the speaking rate.

Sometimes, the default speaking rate might make the speech sound somewhat tedious and pedantic.

But you can control the rate, as I have there, with a rate command embedded into the text-to-speech.

The general principle is, the more predictable something is, the faster you want to speak it. And then customize the intonation. Now the intonation is the way we say what we say. You know the expression "It's not what you said, it's the way that she said it." By intonation we mean the pauses, the, the pitch, where the pitch goes up, where the pitch goes down, which words are accented, which words are not accented.

One piece of low hanging fruit that is very effective and is easy is liberally sprinkle punctuation through your strings. Things like commas, periods, question marks, double quotes, are all paid attention to by our synthesizer and will change the way it reads things out. There are other commands that you can embed into the text that you send to the speaker, to the synthesizer to further control the intonation.

One of those is the emph - command, which takes the stress off the next word. This is particularly useful for noun compounds. Rather than me giving you a linguistics lesson, let me just illustrate it.

[ Period of silence ]

Flame thrower, flamethrower, plasma grenade, plasma grenade, project list, project list, street view, street view.

Do you hear a difference?

Yeah, emph - is your friend. So let's do a Venn diagram. This circle represents the set of all the things that could be done with speech in the user interface of your application. This smaller circle represents the tiny subset of those things that we have done by the general speech user interface features in OS X.

The rest is what you can do. At least test your application with speech turned on so that you don't embarrass yourself. Then add extra speech in ways that are specific to how you creatively think about the way your users interact with your application. And finally customize. Do all of those things and you'll be building speech into your application in a way that gives your users a more engaging, engulfing experience, multimedia, and differentiates your application from some of your competitors. So that's what we have to say today about text-to-speech synthesis. At this stage, I'm going to invite the rest of the team back onto the stage and we'll take any questions you have. Thanks for your time.