Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2003-402
$eventId
ID of event: wwdc2003
$eventContentId
ID of session without event part: 402
$eventShortId
Shortened ID of event: wwdc03
$year
Year of session: 2003
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC03 • Session 402

Mac OS X Speech Technologies

Application Frameworks • 1:08:39

This session provides an overview of speech technologies on Mac OS X and introduces significant enhancements to the quality of Text to Speech. Learn about Cocoa classes that enable quick and easy integration of speech within your application, find out about a new tool enabling the speech synthesizer to copy the emotion and personality of a recorded human voice, and understand what speech functionality you get for free in Mac OS X and what you need to do to deliver the best speech experience for users--a key component in fulfilling accessibility requirements.

Speakers: Kim Silverman, Jack Minsky, Kevin Aitken

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So we're going to talk about the OS X Speech Technologies today. To introduce myself, as you just heard, I am Principal Research Scientist and Manager of the Spoken Language Technologies. Spoken Language Technologies means the speech technologies that you've heard of, the speech recognition and the speech synthesis. And also we have language technologies because we believe you can't deal with speech without dealing with language.

So for example, the junk mail filter is one of our spoken language technologies. And if you heard the State of the Union address about OS X, the Japanese input method is now using our speech technologies. And that's why they're doing so much better than Windows at the moment.

So before we start, I think I'd go straight into a demo. I'm going to show you some of the way speech is usable in Panther, and you guys being developers, I'm allowed to give you the caveat-- this is beta software, right? So I've been through this demo lots of times, and it always works. But because the speech sits on top of every other component of the operating system, it means that if anything goes wrong, something could break here. So I have to ask you to bear with me.

The first thing you should do when turning on speech recognition is let the machine adjust to the acoustic environment in which it's being used. We have speaker independent recognizer. I'll talk about what that means a bit more later. But even though it's independent of who you are, it does need to adapt, sample, and adapt the acoustic environment. We have made it track the acoustic characteristics of most places where people would use it.

But this is outside of the parameters for which we've developed it. The distance between these walls combined with the positions of the PA system speakers mean the durations of the echoes are a little bit outside of the spectral range that we're looking at. So I'm going to need to adapt it. So what you do, any time you use speech recognition in a new place, is go to the speech preferences, to the speech recognition tab, listening, and click on. And then you can go to the volume.

This ostensibly lets you set the volume on the microphone and also gives you a chance to make sure that you've got the right microphone connected. Our speech recognition works with a far-field microphone. That means a desktop microphone that's about this far from you. In this particular situation, because of these echoes, I'm going to use the alternative, which is a headset microphone, which I'll put on now. You can purchase these. Pretty cheaply from the Apple stores, there are a few brands. This particular one is by VXI.

Your task is just to read down this list of commands. As each command is recognized, it will flash. If it doesn't flash, you just repeat it until it is. If it doesn't flash after one or two repetitions, go on. So I'll do that now. It's actually sampling my voice in this environment while I'm talking right now. What time is it? Quit this application. Quit this--there we go. Quit this application. Open a document. Show me what to say. Make this page speakable. Move page down. Move page down.

Hide this application. Switch to Finder. I'll go through that again. What time is it? Quit this application? Open a document. Show me what to say. Make this page speakable. Move page down. Hide this application. Switch to finder. Good. So let's try it. What time is it? The audio is not plugged in. Just a moment while I plug the audio in. Guys, this might go bang. I'm still here even though you can't see my face.

Okay, audio is plugged in. Let's try that again. What time is it? . What day is it? Quit this application. Quit this application. Open my browser. Close this window. Close this window. Go to Google News. Go to Google News. Go to Google News. You can set up any web page to be speakable. Quit this application. Quit this application. Get my mail.

So suppose I've got a message here and I see some text that I'd like to send to somebody else. I can do that by speech and we do that by integrating with the address book to find out the person's email address. The whole address book is speakable. So I can say, for example, "Send this to speech dude." Send. Send. And this should complain. Ooh, I love that sound. OK. Hide this application. Hide this application.

Switch to chess. Open chess. He's the man down there, Matthias. Pawn D2 to D4. Pawn D2 to D4. Knight B1 to C3 Knight B1 to C3 Knight G8 to R6 You see these ghost pieces wandering around, it's great. Quit this application. Phone for Tom Bonura. Phone for Tom Bonura. Thanks.

OK, so for those of you who know about our speech, there's some stuff that you've seen before and there's some stuff that's new there. Can I ask how many people have not sat down and used speakable items on the OS X before? Okay, so if you guys--I'll walk through quickly what we've done here. What we've seen is a bunch of different things.

We have a speech recognition engine, which has a robust API. You can call that API from your applications. Many of you do already, and more are all the time. In addition, we ship a couple of applications that also use that API. One of those is called Speakable Items, and that's largely what I've been showing you now.

You turn that on from the system preferences, from the speech preferences, speech recognition in the on/off window pane. Speakable Items is a very simple idea that's been developing over the years, and it's turning out to be quite powerful. When you turn it on-- We create a folder in your home directory called the Speakable Items folder. Open the Speakable Items folder.

Anything that's in that folder can be launched by speaking it. It's just the same as double-clicking on it. Applications, documents, templates, aliases, stationary, URLs--anything that you can launch in the Finder by double-clicking, you can launch by speech just by putting it into that folder. Most of the commands that I use now are part of a kind of a starter kit that we ship.

We pre-populate that folder with a few items that do generally useful things, and that's what I've been using. So for example, there's "What day is it?" The real power of this is that you can add your own items to that folder and make your own things speakable, so you can customize the speech recognition according to the kinds of things that you do and how you work. Within that folder, the Speakable Items folder, there is itself another folder called Application Speakable Items. That folder contains folders named by applications. The items in those folders are only speakable when that application is in the foreground.

That provides a framework where you developers can ship commands that are specific to your applications, and you don't have to worry about accidentally using the same wording as somebody else with a different application, because if you put them into that framework, that folder, then they'll only be speakable when your application is in the foreground. And in fact, you don't need to install them into this folder. You can put them into your own application bundle, and speakable items will find them. We have documentation on how to do that.

So the kinds of things you can put in here are scripts that send Apple script commands, Apple events to your application, or other keyboard shortcuts. What do I mean by that? What I mean is that anything that is speakable as--sorry, anything that's a menu item with a keyboard shortcut can have a spoken command associated with it. So let me give you an example of that. First of all, we'll go back to the web. Open my browser.

What have we got here? I'm looking for a picture. Of course, the web is slow. One of the accessibility features is that you can zoom in on the screen, and that's a keyboard shortcut. So I've attached a spoken command to that, so now I can say, "Zoom in." Zoom in. Zoom out a bit.

So anything that has a keyboard shortcut you can attach speech to, and that's one easy way that you can have speech control of your application without doing a lot of extra work. Speech is really important for disability solutions. Let me just give you a story. I just heard this just as I was sitting up here--I kid you not, this is absolutely true--just as I was sitting up here about half an hour ago, I think his name's Bill. Bill, was it you? A guy from the projection behind came out and said, look, I've just got to tell you, excuse me, he interrupted me while I was setting up.

He said, I saw you give a demo of this stuff a couple of years ago at Macworld, and so I showed it to a blind friend. I turned on his machine, and I said, open my browser, and the browser opened, and we got his mail, and we got it to read his mail out to him. He said he was just blown away. Well, it turns out that this guy teaches an exercise class full of blind people.

And so they were doing their exercise class, and they had an iMac over on the side, and then they would go over to the iMac after their exercise class and surf the web, a bunch of blind people. And they do this every Monday, have their exercise, and then go and surf the web by voice using just the things. that you're seeing here.

Yeah, I was touched, really. I wanted to share that with you because that means that you can make your applications available to folk with disabilities through technologies like this. The Section 508 rule, as I understand it, of the American Disabilities Act, is that everything that can be done with your application must be possible to do without requiring the keyboard or the mouse or that you see the screen. And we have an accessibility API that lets you get at screen controls and provide alternative ways of controlling those things. We ship one method already built in, and that's speech. So with very little effort, you can have stories like that circulating about your applications, too.

Within the speakable items framework, you can choose what kinds of commands you can give. This is controlled by the Commands tab here. And the reason I want to tell you about this is because of this particular guy, who's off by default, which is front window commands. This lets you speak any of the controls in the front-most window of whatever application's in the foreground. So with this on, I can navigate the preferences here. For example, speech recognition. Speech recognition. Well, we're already there, so that doesn't show much. Default voice. Default voice. Spoken user interface.

Speak the phrase. Speak the phrase. See, I'm going down the check boxes here. Speak the alert text. Speech Recognition. We did not build this specially into the speech preferences. This is just the general accessibility features that use speech. So as long as you use standard Apple controls, you get that for free.

In addition, you can have the computer read out text that appears on the screen, and there are a few different ways of doing this. One that I just turned on is talking alerts. Here's the mentality we have here. The philosophy is when you're interacting with a computer, sometimes the computer needs to tell you things, and the standard way that we'll do that is by putting up a sheet or an alert dialogue in front of you. You should be able to read that, think about it, and respond to it.

But sometimes your attention is elsewhere. I'm typing on my machine, I type quit, I turn around and have a conversation with somebody, and I don't realise that there's an alert saying that I need to save my changes before quitting. Perhaps I go away to lunch and I come back and I find that my work wasn't saved and somebody couldn't get at it. So what happens is if the alert has been up for a certain amount of time and you haven't responded to it, then we read it out to you in order to get your attention back. So let me demonstrate that. Switch to TextEdit.

So I'll type something here and then try to quit from it, close it. Oh, before I do, we ship it with a delay of about 20 seconds by default between the alert appearing and it being spoken. I'll put that back to zero for now so that you guys don't have to wait around so long. So now we'll quit or close this document.

So that's Talking Alerts. People tell us that they love it. The kinds of scenarios I hear feedback from users are, um... Some guy said to me he was crawling around on his hands and knees under his computer--under his desk and accidentally kicked out the ethernet cable and he didn't know that he had done that and then he heard this voice come from his computer saying, "The network has been disconnected." And he turned around and thought, "Oh, yes, indeed it has." I was giving a keynote at an international conference in Berlin.

I arrived the night before. Actually, I wrote my talk in the flight lounge in Heathrow on the way over there. And when I arrived, I thought I'd better get my presentation printed out on transparencies just in case my computer won't plug into their projection system. So I went to a print shop that had maybe two dozen different kinds of printers.

They had big printers and little slide printers, high-quality stuff, kind of like Kinko's on steroids. And we started printing these things off. And while we were doing that, I demoed this to them, and they were delighted. So they had two dozen Power Macs. They put talking alerts onto each of their Power Macs because their typical business model is somebody arrives at about 9 o'clock in the morning with a CD containing a large image that needs to be printed out on a large poster, camera-ready, and it will take about two hours.

So they put it onto an appropriate machine and start printing. And then at about 11 o'clock, they go to make sure it has finished okay. And there's a message on the screen that came up five minutes into the printing saying that there was some problem with the cyan ink.

And so they lose the business because they don't have it done in time. And they found that this was great. With 24 machines going all the time, there were voices coming up from all over the place saying the printer's out of paper, the network is down. And this saved them a lot of time. But there was a problem.

They all had the same voice. So a voice would come out of this massive machine saying, the printer is out of paper. And who said that? Who said that? You can set the default voice. So they put a different voice onto each machine. And you can set what phrase is spoken before the alert is read out. By default, we choose stepping through a small list that we have here. And so they made each machine announce itself by name. So we think talking alerts is useful.

I'll put this delay back so it doesn't keep talking all the time here. Another disability feature that we have is the ability to speak any text that's under the mouse. This again uses the accessibility API. So if I turn this on and slide the mouse around-- - Slide the mouse. - Yep. So if I--as I slide the mouse around, you'll hear text being read out. - Announcement of an application requires your attention. - I'm not clicking here.

Okay, so if you use standard controls, you'll get that for free as well. Okay, so let's go back and talk a bit more about what we've actually been seeing here. We'll go back to the main machine now. So I want to talk a little bit about why you should adopt speech, things to think about about the reasons for putting it in there rather than just being a novelty.

First of all, speech gives you a way to get beyond the limits of a graphical user interface. Graphical user interfaces are mature, and they present well the information that they can present, but there are limits to what they can do. Screen real estate is at a premium, and so we have all sorts of technologies to try to squeeze a little bit more power out of our screen real estate, but no matter how big my screen, no matter how many monitors I have in front of them, there are always things that I need to get to or see that are behind other things, and I can't see them. And, of course, there's always the issue of what happens when the user's not attending. No matter how cool the graphics are, the user just might be staring out the window. Speech gives you an extra modality.

It gives your user more choices about ways of interacting with a computer. It's more natural. That is, we've all been speaking and listening since we were about two years old, and it's something that comes to us without too much thinking, whereas working through a typical computer user interface requires some training. So if you put speech into your applications as an alternative control modality, you'll find that some users who are new to computing will be more likely to try out your application.

It's particularly appropriate in an eyes busy, hands busy scenario. So think about your application. Think if there's any time where the user is looking at something on the screen and their hands are busy. For example, they're drawing something, and they need to control the computer. For example, I'm drawing a line, and I want to change the brush size or increase the amount of blur.

Normally, with our graphical user interfaces, I'd have to stop drawing, go up to a menu, pull up a dialog, set some settings, and then click out of that, then return to drawing. So speech is good when the hands and the eyes need to keep busy with what they're doing. Think about that in your applications. And finally, speech gives us a way to move out of the 1980s.

Back in the 1980s, computers had a little weak speaker, sold it onto the motherboard, and all it could do was go beep. And so we got into the habit of writing our programs with a beep written into them. Whenever we needed to get the user's attention, we put up an alert, we go beep.

Well, I like to think that life has moved forward somewhat since then. Of course, now, instead of just going beep, we play lots of different sounds, but the burden is still on the user to understand what all those sounds mean. So, for example, if we want to let the user know that mail has been sent, we play one sound.

If we want to let the user know that somebody has logged on to iChat, we play a different sound. It seems to me we should be able to do better than that. The application developer who is playing a sound knows the meaning, knows what information he's trying to convey to the user, so why not just say it? And think about the mouse.

The mouse is essentially the equivalent of doing this. It's such a narrow, I guess, 1.2-bit interface. We should be able to do better than that. If I want to do something with a computer, rather than just poking and grunting, I should be able to say what I want to do.

So we have a couple of engines that I've mentioned and shown you already. The speech recognition is speaker independent. That means you don't have to train it to your voice. There are speaker dependent speech recognizers around that have got different characteristics. And one of those characteristics is that you need to spend at least four hours of speaking to train it to your voice, which takes more than four hours, and you get tired at the end of it. Then at the end of that four hours, you've still got to use them for a month or two before they can finally adapt to your particular voice. We think that the kinds of users that buy Macintoshes expect to just walk up to it and have it work.

So we make it speaker-independent. It works with a far-field microphone. We have layers of software that are tracking, adapting to, and compensating for the background acoustics and the microphone characteristics. You can also use it with a head-mounted microphone, as you just saw me do over there. It's robust against background noise.

I use it now at Apple at the cafeteria at lunchtime, and to my delight, it works. The secret there is that the kind of noise that's easiest to compensate for is noise that's steady-state. So in the cafeteria, when there are hundreds of people talking, the overall spectrum tends to be fairly constant.

What's a situation that we have not solved is if I'm in front of the computer trying to talk to it, and right next to me there's somebody else talking, because then there's two voices at once, and the spectrum of the distracting voice is changing all the time. So we don't claim to have solved that one yet.

It's a large vocabulary speech recognizer. We have over 121,000 words in the dictionary, and we have layers of software to figure out how to pronounce words that aren't in the dictionary. And it's a continuous speech recognizer. You don't have to pause between words, which is a great relief. It's driven by a finite state grammar. That's how your application tells the recognizer what to listen for.

Why you should use speech recognition? Well, as I've mentioned, speech is a very natural way of controlling a computer. It gets you beyond the limits of point and click because you can't click on what you can't see to point to. And conversation is a particularly appropriate modality for delegating goals to a computer. You can tell a computer what you want to do.

If you haven't specified enough, it can then come back and ask you questions to refine the nature of the goal and can then do what it's good at, which is figuring out the steps necessary along the way to get there. And of course, speech recognition is great for accessibility, the latest story being that one that's only 30 minutes old.

Okay, we have speech synthesis in there. It takes any text and converts it into American English speech. I have to say that because I'm getting requests all the time for other varieties of English and other languages. There's a range of different voices. You can control the speaking rate, and that is important because there is no correct answer.

or no single answer to the question, "What's the appropriate speaking rate for speech synthesizer?" The rate at which the speech synthesizer speaks should depend on why it is speaking. We'll talk more about that later. I do want to let you know that we are working steadily all the time on improving the quality and the naturalness of the speech synthesis. We did a lot going from Puma to Jaguar and we got good feedback from folks who listened to it and said, "Oh, wow, that's a lot better now." And we're still doing a lot more work.

So when should you use speech synthesis? There's a bunch of different areas. I won't go through all of these now, but one thing that I think is useful is when something happens inside the computer that's outside of the user's control or not directly relevant to the current task at hand, then speech is an appropriate modality for letting them know. For example, you have new mail from your boss or your compile failed.

Another area is proofreading. Creation of documents used to be an art form, and people would spend a lot of time crafting them, but the world's got too busy for that. We don't have the time. So we have tools like spellcheckers and grammar checkers. Well, grammar checkers don't do very well. Often they don't catch awkward constructs, and the constructs that they do catch, we don't always agree with them or incorrect.

And spellcheckers can only find a word that is not in the dictionary. Often when we make typing mistakes, the psychologists will tell us there is good evidence on this, when we make a typing mistake, we are much more likely to transpose letters if it creates another real word. And spellingcheckers can never catch that. But if you have text read out to you, you immediately spot it. It just becomes so painfully obvious.

People have asked me to talk a little bit about why should I use speech synthesis versus recorded speech. There are a few reasons. If you only have a small amount of things that you need to say to your user, then I'd say go ahead and record them. Get your voice talent. But sometimes recording is impractical; for example, if you have a huge amount to read out or to say to your users, then it takes ages to record it and it takes a huge amount of storage.

The average CD is about 640 megabytes and usually about two-thirds of that are media content. So if you can reduce the audio by a factor of typically about 80 by going from audio recordings down to text, then you have much more space for real content on your titles, on your CDs.

You also get a consistent voice. If you record voice talent, then later on you bring them back to record some more or even from one day to another, their voices tend to become inconsistent. They're speaking louder one day, they're a bit more relaxed the next day, and then in the user interaction the voice sounds like it's going up and down. With speech synthesis you get a consistent voice. You can save costs because you don't have to hire a voice talent, you don't have to hire a recording studio. It's flexible. I don't know whether this has ever happened before.

I don't know if it's ever happened before. I don't know if it's ever happened before. It's never happened to you. It's happened to me a lot. You're working on an application, you're about to ship it, and just as you're about to go GM, somebody says, "Oh, we have to change some of the strings." So you have to call up the voice talent to get them back into the studio to record something different, but no, they're on vacation in Brazil now or they've got a sore throat. It's real pain. With speech synthesis, you just type in the new strings and you're done.

Another important reason for using speech synthesis is if the things that you're saying to your users are longer than a single short sentence, then you need to control the intonation to make sure they're spoken in a way that people can track the meaning across the longer sentences. And you can't do that if you're piecing together sentences that were recorded at different times and you're just concatenating them together.

And you get lip synchronization for free. All right, at this stage, I want to invite up Jack Minsky, who is the president of Software Mac Kiev. Jack has--his company has produced World Book, which, as you may have seen, is this wonderful application. I think it's about the best OS X UI on any application that I've seen. He'll show it to you. It's gorgeous. And these guys have been using speech, and he's going to tell you about it.

Can you swap my headphones? Yes, I'll swap the headphones. Thank you. Good morning. We had a pretty simple goal in mind at the Creative Labs of Software McKiev when we set out to build a World Book speech edition, and that's that we wanted virtually impaired users or even blind users to be able to use the World Book, to be able to search all 22 volumes, 18,000 articles, on their own without assistance.

And that meant we really had to be kind of creative, not just have text be able to be read by passing a cursor over it or highlighting something, but build in the kind of interaction that would allow a user really to be able to do this on their own. And I'd like to show it to you. First, I'm going to do as Kim did to adjust the speech here.

I'm going to let this Mac adjust to my voice in this room. What time is it? Quit this application. Open a document. Open a document. Show me what to say. Make this page speakable. Move page down. Hide this application. Switch to finder. So that's done. And then just to get started, we wanted the user even to be able to launch this from the Finder. And we were thinking launch, start. And we were going to use something even friendlier.

So we chose Hello World Book as our starting one. Let's try and see if that works. Hello World Book. And immediately they get the feedback of the music of World Book starting up, so a blind person already knows they're in. We'll let this go by just for a second.

And then the next step would be to go and-- So you can see there's all kinds of sounds and things built in there, so even someone who can't actually see the screen can hear some of the things going on. So the next step was to be able to get them to be able to search through the encyclopedia for a particular article that they're looking for. And here you're going to hear me say, "Search, please. A window will open, and blind users can touch type things in." So let's see that work. Search, please.

So now I have a window open. And if you heard that, it said, ready to search, letting the user, again, have feedback to know that the thing is working. I'm going to type in a simple word here, horse. Picking horse in particular, a whole bunch of articles, what we've done is to embed sounds at the top of the articles.

Again, oral feedback so they know when they reach that article, they'll actually hear the animal noise or whatever that's going on. You'll also hear more feedback. When I hit return, because we're dealing with a blind user who might not be able to type in successfully the right word, we wanted to give them feedback. So it will actually say, searching for horse. And then at the end, if the horse article is found, it will say, search complete. So let's try that. When the horse stops running, they can now simply again ask the computer to read to them.

So I say, "Read to me." So in this way, assuming they've typed incorrectly, they can get to any article that they can think of the name for. Now, of course, we thought people aren't going to be able to necessarily type incorrectly. I know I mistype all the time, and I can see just fine. So what we did was to build in a catch for that. So I'm going to type in and misspell Apple Computer here. And you're going to see on this one that it's going to come up with a series of suggested alternative words.

And we thought even beyond that, as you'll see from this example, it will read through the instructions first once and then go through the list one by one, pronouncing the alternatives that the user might have meant to type in the first place. It will then pause briefly at the end of the list, assume that the user didn't hear what they wanted, or maybe they did and weren't sure, start the list again, but without that long intro introduction of explanation of what they need to do, simply repeat the words again. So let's try that.

And there I go, and I've got my Apple computer article, so the user can do that quite nicely. We also built in a lot of other speech technologies to try and go to the maximum of what Mac OS X has to offer. Kim showed the speech on the mouse, so I won't show that, but we have custom controls in some places here. It comes for free if you simply enable that for all the dialogues with normal tabs and so forth. But if you build custom controls, you can go the extra step of making sure those will also work with text under mouse.

And then we've done one more thing. I'm going to pull up another page here. It's already set up. And that's to take a bunch of the abbreviations that are very common in an encyclopedia, which won't mean anything to a blind person. For example, population is a very common thing in an article about cities. So what we've done here is made it so that population will not read as "pop" here, but read out the word, and I'll just show you that.

Let's try that again with that on. Or something like this. instead of food. So all of those things have been, for us, the way that you can see someone who is unsighted could navigate this encyclopedia, really use it on their own without assistance, without someone standing over their shoulder.

We've gotten a lot of recognition for this. This is the first encyclopedia, and only one, that's fully ADA compliant with section 508, and that's resulted in a number of magazine articles written in the education space about this application. Also, just three weeks ago, we had the great honor of the American Association of Education Publishers voted this one the best children's software of the past year.

That's the first time that a Mac-only application has ever been nominated for this. This is Windows cross-platform everything for the prize, but a Macintosh-only product won that category. And also, another great reason to do this, Apple has put this application on every eMac, iMac, and iBook they sell.

And probably the best thing of all is that we know at Software Mechia, because of the work we did in implementing the speech technologies, which were already set up for us with all the things that are built into Mac OS X, that there are literally tens of thousands of visually impaired users and even blind people out there who now have a whole new world opened up to them to be able to explore independently the World Book Encyclopedia, and we feel really great about that. - Thanks, Jeff.

Thank you. Thank you. You can purchase World Book on the Apple Store online or in the retail stores. Check it out. Alright, so now it's your turn. We want to talk to you a little bit about what you can do to incorporate speech into your applications. And we'll start with talking about customizing speech synthesis.

What I mean by this is that when we send text to a speech synthesizer, the speech synthesizer looks at each sentence, scratches its head, and says, hmm, how should I speak this? The answer is, the way a sentence is spoken depends on why it's being spoken and the intention of what it's conveying to the user.

The problem is difficult in the general case, but you guys have an advantage. Your application knows a lot more about how things should be spoken than the text-to-speech engine does. For example, Jack's application knew that pop, when it was in brackets followed by digits, should not be spoken as pop, but should be expanded to population. The speech synthesizer could never figure that out by itself. So there are three things that you can do.

One is filter the text the way the Matkiew guys did. Another example would be stock quote abbreviations. Then you can customize the pronunciations, and you can customize the intonation. Let's talk about that in a little bit more detail. To customize the pronunciation, you're dealing with the problem that the way the synthesizer pronounces a word is not the way that you want it pronounced. This is most often a problem with names or invented names of characters, if you have a fantasy game, I'm sure you've got some character names in there that are written to be difficult to pronounce.

Some developers send special strings to the synthesizer that just use funny spelling. We don't recommend that, because the way we pronounce unorthodox spelling might change from version to version. Instead, we recommend that you use what we call phoneme input, which looks obscure, but is actually very quick to learn, as a totally precise, unambiguous way to specify how words ought to be pronounced. You can embed phonemes like this into the text, or you can load a custom dictionary to the synthesizer that has these mappings already in it.

Then you should customize the intonation. The intonation is the pitch and the timing that we use when we speak. It's not what we say, it's the way that we say it. And the problem is that once you've synthesized the words so that they are clear, you've not synthesized enough. Consider the sentence, "John only introduced Mary to Bill." Now if I say it like that, it means he didn't introduce Mary to anybody else. John only introduced Mary to Bill.

But suppose I say, "John only introduced Mary to Bill." Then he might have introduced her to other people as well, but to Bill, he only introduced Mary. Quite a different meaning. And if I say, "John only introduced Mary to Bill," then it means he didn't encourage them to go off and form a partnership together.

So the problem is the meaning of a sentence depends crucially on the intonation. It's difficult to generate in the general case because we need to know what's the intended meaning. But your application often knows that. And so your developers can employ your domain knowledge within your application to do a better job. Let me work through an example.

So here is a text from an application that people are using to book flights. And here's the confirmation that's being sent to the user. I will read this out first by just passing the text as you see it directly to the speech synthesizer. And it will sort of do OK, but it won't sound all that great. Here we go. Oh, the audio is not coming out from your demo machine. Is there a reason for that? All right. Well, what can you do? Can you put a microphone on it? Yeah. Is this mic working? This is high tech. Let's see if this works.

Your first flight is with Alaska Airlines flight 2762 departing from San Jose on Monday, May 24th at 6:10 PM, landing in San Francisco at 7:10 PM. Thank you for choosing TTS Travel. Whoa. Okay. Was that hearable? I bet it didn't sound that good. So let's talk about what you can do about that. There are commands that you can embed into the text that you send to the synthesizer. You can embed those commands by rule, and they'll give the synthesizer hints about how to speak the text.

Last year, in prior developers' conferences, we've given some instruction on how to use some of these commands, and according to those, this would be the kind of way that you would annotate the text to that synthesizer. I've put the embedded commands into a smaller font so that you can see them.

But we've been working on the front end of the synthesizer, and some of this information we can now infer because we're now tracking the topic as we go through texts and modifying the way we say it according to the topic structure and the block structure. That means that some of these are no longer needed.

So those ones you get for free. But there are others here that I've left behind which do depend on domain knowledge. Let's take some examples. I'm going to go through these by laying out some simple principles you can use. The first one is I'm calling let the user catch up. What you should do is add pauses at major sense units where pieces of information that seem to cohere together make sure that they are separated from other pieces of information.

And you can do that just by sprinkling punctuation around there. If you want to increase a pause longer, you can add the embedded command I've got there, slnc, which means add, in this case, 500 milliseconds of silence. You can also adjust the speaking rate to be appropriate for the purpose of the speech. In this particular case, the user needs to transcribe the information, and so you want to read it out a little more slowly.

If the user already knew that information and you were just reading it back for confirmation, then he would read it back more quickly. So here, for example, is one of those sentences. With just the plain text, it sounds like this. I'll play this out through the demo machine again.

Oh, hang on, it didn't play. All right. So what I've done here is added a command to slow down the rate a little bit and added some colons and commas that you can see at the ends of those lines and a little bit of extra silence. Let's see if this one will play out.

That's unadorned text. Now with those commands that you can see, it sounds like this. Do you hear a difference? Okay, let's go on. Second principle is familiar things go in the background. When we speak, we don't equally highlight every word. We mark for our listeners which things, what we're saying that's referring to what they already know and what things that we're saying are new and important. And the way we do that is by reducing the emphasis on things that listeners already know. So you can do that by de-emphasizing repeated words. For example, departing at 6.10, landing at 7.10.

But that one you now get for free because we're tracking things like that in the synthesizer. But in addition, you can de-emphasize words that could be inferred from the overall application scenario. So for example, in this case, the text started with "Your first flight is..." but the user already knows that it's talking about flights, and so it's appropriate to de-emphasize flight.

So it should be spoken as "Your first flight is..." I'll play that first of all without that embedded command, and you'll hear there's equal emphasis on the words "first" and "flight." Then I'll play it immediately afterwards with this embedded command, which takes the emphasis off the word "flight." See if you hear a difference. Did you, uh, was the audio off again? Could we have the audio on? All right, we'll have to go through this again. Go on.

Okay, first without that embedded command and then with. Your first flight. Then with. Your first flight. Do you hear a difference? Okay. Third principle is liven it up. If you add an exclamation mark at the end of a sentence, then that stops us from gradually rolling the pitch off all the way through the sentence, and it makes it sound a little bit more involved, a little bit more lively. So if you're hearing your synthesizer having a kind of a bored sound, this is one way that you can reduce that. Don't use it everywhere. Use it judiciously.

Then you can focus the user's attention on what's important by adding extra emphasis on the most important words by embedding "Enf+" just before them. And finally, we suggest using what we call paragraph intonation. When we speak, we don't string all of our sentences together into one long, undifferentiated stream of speech, but rather we group our sentences together into larger units that span multiple sentences that relate to the topic structure, and we mark that for our users.

For example, when I start talking about a new topic, I raise my voice just a little bit, and then as I talk about that topic, I lower my voice down to its normal voice range, And then towards the end of that topic, I kind of roll my voice off.

Then for the next topic, I raise my voice again. You hear that? We all do this. Listen to people at lunchtime. You'll hear it's going up and down all the time to signal the topic structure. So you can do that. We have told people that you can--you-- should raise the pitch range at the first sentence of a paragraph by some embedded commands and then lower the pitch range at each subsequent sentence and then put extra silence in. Well, now you get all that for free.

So what you need to do is put in a blank line between sentences and we will do the rest. So in this particular case, the last sentence, thank you for choosing TTS travel, is not related to the topic of the previous information. And so we can separate it just by a blank line. And that now sounds like this.

Alaska Airlines, flight 2762 departing from San Jose on Monday, May 24th at 6:10 PM, landing in San Francisco at 7:10 PM. Thank you for choosing TTS Travel. For comparison, I'll just play that text again unadorned so you can see where we've come by accumulating all these commands. Oh no, I went flat. All right, we'll go on. All right, enough of this. So to summarize, customize the pronunciations when you're using speech synthesis, customize the intonation using those principles, and together those things will help you to give your users a better experience.

Now I'd like to introduce a new tool that we're making available to you guys starting today to further customize the intonation. I'd like to go back over to the demo machine, please. The problem that we're addressing is that sometimes, no matter How many embedded commands you put in the text, you can't quite get it to be spoken the way you want it to be spoken, with the personality or the emotion that you want. Wouldn't it be great if you could just record yourself saying a sentence the way you'd like the synthesizer to say it and have it copy you? Well, that's what this tool does. Let's start up over here. Oh, it's not there.

All right. are called Repeat After Me. This tool we've had going in the lab for quite some time and was an internal tool that ran on Mac OS 9. It has been ported to Mac OS X and a new user interface has been put onto it to make it easier to use and more consistent with Mac OS X. And this work was done for us by the folk at Software Mac Kiev, and so we're very grateful to Jack. Let's give him a hand for doing this.

and Sainz down here who did the work. So you can type in some text. We are at WWDC. And this will tell you, first of all, the phonemes that the synthesizer used to pronounce it. There's the "we," here's the "are." Then down here, it plots, with time going this way and pitch going this way, the fundamental frequency, the tune that's generated by the synthesizer for that sentence. So if I speak it, it will sound like this. Is this machine going through the sound system now? Okay, here we go.

I'll play that again. We are on a double-double UDC. Now suppose I think that that's all spoken a little bit too quickly and I'd like to slow it down, perhaps to time it with an animation that I have. Well, I can just click on the end up here and drag it out and make it take longer. We are on a double-double UDC.

Or if my animation is really quick, so I can have it quite short. We are going to put you to put you to sleep. Let's go back to the default. If I want to emphasize the "we," which is--here's the W and here's the E of "we." Let's zoom in on that a little bit more. Then I can just raise the pitch.

[Transcript missing]

And make it take longer. And now that will sound like this.

[Transcript missing]

and have it give me a recording. All right, let's try this. Hello? Hello? Is the audio input working? One, two, three. Let's check the sound preferences. Sound, ta-da. Sound input. Let's plug this microphone into the correct computer. I'm going to disappear again, but I'm actually still here. So you can't escape yet.

and the connection. Aha. All right. Oh, there it is. See, sound input works on OS X. Hello, there we go. We at--oh, let's try that again. We are at WWDC. You save that. My audio file comes up. Where is it? All right, audio isn't completely working. As I said, this is Panther, bear with it.

But I've got some pre-prepared ones here to show you just in case that happened. So my recording--my sound wave should have come up down here. So we'll show you one that I've previously prepared, which is-- Here we go. So here's the original signal. I'm sorry, Dave. I'm afraid I can't do that. And here's Victoria now copying that. I'm sorry, Dave.

We're going to make this available to developers. Watch the speech developers mailing list to find out the method by which you can get hold of this. We're also planning on running a kitchen on it. Can I see a show of hands of people who would be interested in coming to a kitchen to learn how to use this? Define Kitchen. A bunch of developers--that's a good question.

A bunch of you come along to Apple as our guests and sit there and we teach you how to use it. You bring along text from your application and sit down with some machines and we sit with you all day and teach you how to use it. So who would like to come along and do that? Quite a number of you. Okay, cool. Let me just give you a couple of examples of what you can do with this.

I've queued these up in iTunes. Well, our buddies in World Book used this for the speech that you heard, although it wasn't very loud, that's spoken back while you're doing a search. For example, you type in "panther" and the computer will say, "Searching for panther." If you just send the text to the synthesizer, it sounds like this. But with customization using this tool, they got it to sound like this. Searching for panther. Do you hear the difference? When the search is complete, it would say... Search complete. which didn't sound that natural, so they used this tool and now it sounds like this. Search complete.

Yeah. I had an application where people would call up an information system, type in their ID number, and it would then read out news and email and so on to them, and it would greet them by name. The developers of this system got a voice talent to record greetings to about 5,000 different names, and they found to their dismay that this has very little coverage for names. We have actually 65,000 names in our dictionary. That gives us about 80% coverage of English names. If we increase it by another 65,000, that would put it up to about 89% coverage.

So names are difficult. Right, right. That's the statistics of names. So they used our speech synthesizer, and when they would pass text to the synthesizer, it didn't sound the way they wanted it to sound. Here's an example of some names being spoken just from text. Dr. Scott, Kai Jin, Kai Junet and Robin Lewis. Good afternoon, Dr. Livingston. A bit tedious. So, using this tool, we've got it to now sound like this.

and Dr. Lippen's tool. So that's the tool that you guys can use. Okay, let's go back to the main machine. Another thing we want to introduce for you today is Cocoa classes. And our philosophy here is that they should be simple to use. Inspired by Alan Kay, we think simple things should be simple, and complex things should be possible. And so here to tell you about them is Kevin Aiitken, who is the author, and you can blame him.

And yes, feel free. There's definitely a lot of people who contributed to this, but I'm willing to take the blame, I guess. So let me get started into this. First of all, we've worked really hard on this. Panther now offers Cocoa developers the ability to easily access the most popular features of our speech engines.

So over the next few slides, I'm going to take you through the NS Speech Recognizer class, which allows you to listen to and respond to the user's spoken commands. And then the NS Speech Synthesizer class, which will allow you to generate synthesized speech either through the computer speaker or to a file. So let's get started with the NS Speech Recognizer class.

So first of all, we designed this to be really easy. So all you do virtually, you just give it a list of strings and tell it to start speaking. That means that you don't need to understand concepts like language models and recognition results just to get started. But we've made sure that it's dynamic, that you can change it on the fly, and you can have several recognition objects running at the same time. So it's very flexible.

So what I'm going to take you through is a couple of coding examples for this example with NS Speech Recognizer. Think of your writing application, a game that allows the user to move through a maze using four commands: north, south, east, and west. And so let's get started. So I've broken these in kind of two sections. First section will just get us listening, and the second section will handle the result.

So first thing we're going to do is we're going to create a Recognizer object. And then we're going to set the delegate. Remember, a delegate object is just a helper object. In this case, it's going to receive the message when the Recognizer system has heard something. Then we're going to set the commands. And as I said before, this is just a simple array of strings, in this case, north, south, east, west.

And then we're going to start listening. So now your application is listening for those four commands. So the user starts using your navigating through that maze. And so they say one of those. So what happens? So what's going to happen is your delegate object is going to receive a did recognize command message. And as the command parameter, you're just going to receive one of those strings that you originally gave it.

So you can use that string to compare to one of your known strings. I've just used a simple if then else. I'm sure there's much more efficient, more exciting ways to do it. And then it converts that into some action. Okay. So that's pretty easy. So let's go on and talk about NSSpeechSynthesizer. So it's going to allow you to speak asynchronously either to the computer's speaker or to a file.

Because it's speaking asynchronously, you can handle certain events during the speech generation process. Specifically, you can get notification when the speech is finished. You can get notification when a phoneme is about to be spoken and when a word is about to be spoken. We give you access to all the voices that are installed in the system, so you can get information about each one of those and create a pop-up for the user to select one.

And finally, you can combine both the NSSpeechSynthesizer and NSSpeechRecognizer class to create spoken user interactions. Those are kind of dialogues between your application and the user. So let's look at a coding example of that. We're just going to instantiate our synthesizer object using the default initializer here. So it's going to use the default voice the user has chosen in the speech preference panel. We're going to set that delegate object, and then we're going to start speaking by calling startSpeakingString. Now, alternatively, this is going to be coming out of the default output device. Alternatively, we can call startSpeakingString to URL to have it written to a file.

And then now your application is speaking away, you can handle some of those events. So we can implement the didFinishSpeaking method in our delegate object so we know when it's finished speaking. So you could, say, update your user interface. You can be notified when it's about to speak a word so that you could follow the bouncing ball on the screen or highlight a word on screen as it's being spoken.

And you can also find out when it's about to speak a phoneme, so you can animate a mouth on screen or avatar or some character or whatever you like. So anyway, that's a wrap-up of the speech classes. I'm going to go over to the demo machine really quick and show you the example.

Is this guy up and going? Oops. Everyone? All right. So let me quickly take you and show you where this is. So we have example applications in here. Under Speech, we've added some for recognition, and we've added the NS Speech Synthesizer example here. So let me show you what we built with this. So using most of the callbacks. So let me choose a voice here and start him speaking.

So that's what we created using the NS Speech Synthesizer class, and it was really fast, really easy, and hopefully you'll find that as well. So that example is on your Panther CD, and there's some other examples in there, so go take a look. Do you want to finish that? Okay, good.

So that brings us to the end of the material we've prepared for you today. To summarize, we've introduced the speech technologies for those who aren't quite familiar with them. We've introduced a tool for customizing speech synthesis, which we're going to make available to all developers. We've introduced Cocoa classes, and we have given some guidelines about when you should use speech and what kinds of principles are behind your adoption of it.

For those that are interested in more background information about this, you might want to look at the introduction to developing applications with Cocoa to find out about Cocoa programming. Or you might want to see the AppleScript update, because speech and AppleScript have such a strong synergy that many folk at Apple say those two together are the two most strategic technologies at Apple. And you can find out more about the accessibility API at the Mac OS X accessibility session.

We don't have time for questions now, but we, the team, will be gathered just outside there and happy to stay as long as any of you would like to answer any questions. If you have any questions subsequently, we suggest you contact John Galenze. He's the manager of software evangelism, and his email address is up there. If it's hard to read, it's galenze, G-E-L-E-Y-N-S-E, at apple.com. and go to the speech web page to find out about the speech developers list and documentation of all the things that we've shown you and more. The URL is up there. Thanks a lot.