Speech Recognition & Synthesis In-Depth - WWDC 2001

Mac OS • 59:34

This session provides an overview of current speech technologies, describes what's new in Mac OS X, and goes in-depth on lots of material. Featured topics include new APIs, ideas for using speech, our new pronunciation subsystem, how to customize speech for your application, and guidelines and tips for getting the best possible performance and sound.

Speakers: Kim Silverman, Sal Soghoian, Matthias Neeracher, Kevin Aitken

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Thank you. Thank you. So the good news is that--can you all hear me? Yep. The spoken language technologies, in case you haven't noticed, are in Mac OS X. We got them there. Thank you. What we're going to do today is describe briefly the speech recognition and the speech synthesis that are there.

We're going to give you guidelines about where to use them in your applications and why you would want to use them in your applications. And we're going to actually lead you through the process of getting your applications talking and listening. So let's start off by doing a demo of what we've got over here.

Can we switch to-- oh, I'm supposed to do it, aren't I? I'm supposed to do it. There we go. The user interface to speech recognition is this round window that you may have seen. This replaces the face that some of you might be familiar with in OS 9. It consists of three parts.

The middle, you might be able to see, says Escape. This shows you the listening mode. There are two different listening modes with speech recognition, what we call Push-to-Talk mode and Continuous Listening. In Push-to-Talk mode, it's only listening when you hold down a key. It says ESC, which means that by default it's the escape key.

Users can configure that. In Continuous Listening mode, it's listening all the time, and optionally you can have it wait for a keyword, like computer, before you speak your commands. I am using Push-to-Talk mode here, and I recommend that you do as well when you're demoing. So that when you're explaining things to people, it's not try to recognize commands, that you're actually intending to other people and not to the computer.

What time is it? It's 10:30. What day is it? It's Thursday, May 24th. Show me what to say. Okay, the other part of the feedback is this speech commands window, which has two halves. The top half, which is scrollable, shows what it has recognized, and if it speaks back to you, what it says to you. The bottom half, which also is scrollable, and all of which is resizable by the user, at last. Thank you, Cocoa. Shows what you can say. There are now disclosure triangles, so that it no longer scrolls off the bottom of the screen.

The middle item there, I don't know whether you can see, it says Speakable Items. That shows the commands that can be spoken all of the time, no matter what application is running. I mentioned at the start, I spoke what time is it, what day is it. There are items down in there. These are actually kept in the Speakable Items folder. Let's take a look at it. Open the Speakable Items folder.

There it is. So any item that is in this Speakable Items folder can be launched by speaking it. And it's just the same as double-clicking on it. Applications, aliases, documents, servers, URLs, anything that you can launch by double-clicking, you can now launch by speech on OS X. The real power of this is that users can customize it to the way they work by dragging their own items into the Speakable Items folder.

In addition, just like in OS 9 now, the Speakable Items folder itself contains a folder called Application Speakable Items. This contains folders which are named by applications. The items in those folders are only speakable when that application is in the foreground. That's shown in the speech commands window in the top item, the top disclosure triangle. So you see at the moment-- let me close the others-- that it says finder, and the finder is in the foreground. Let me demonstrate this in action.

I'll switch to my browser, and as I do, you watch the items there change. Open my browser. There, so it now says Internet Explorer, and there are some different items there. This is an opportunity for you developers. You can make application-specific folders for your applications and populate those folders with scripts or other commands that control just your application. They won't be speakable when your application itself is not in the foreground. Oh, I'd like to encourage you to explore this by yourselves. I just want to show you one other thing.

Hide this application. And that is that you may have noticed we shipped one game, and that is chess. Oh, in telling you this, I should point out, we keep track of all the applications that you've launched since we started--actually, restarted the machine, and you can switch to any application, running or not, just by saying its name.

Switch to So the chess application you might have seen mentioned and demoed in the keynote on Monday, one of the keynotes, as being an example of a good user interface, Aqua quality. And one of the things about it is that you can control it by speech. Porn D2 to D4.

Knight B1 to C3. What's it going to do to that? Let's see. It's thinking. It's thinking harder than I am. But it's a better player than I am, too. One of the things I like about this is there's one thing I can do with this that I can't do when I'm playing with real people, and that's the following. I can say, take back move. Whoa. My 12-year-old son does not let me do that when I'm playing with him. Okay, let's move on.

So what speech is there in Mac OS X? Bear with me. There are a lot of people at this conference this year who are new to the Mac OS platform. Some of you who are already familiar with the platform will know already what we have. So here's just a brief mention for those that aren't yet familiar with it. We have Speech Synthesis and Speech Recognition. The Speech Synthesis will take any text and convert it into audible speech.

There are 22 different voices. They range from adult male and adult female through to voices that sing and sound like aliens and novelty voices. And we have Speech Recognition. There are a number of characteristics about the Speech Recognition that are important. First is it's speaker independent. That means you don't have to train it to your voice.

You just take it out of the box and it just works. The Mac was and remains the only computer platform I know of that you can just take out of the box, put on the desk, and command by voice. It's continuous speech. You don't have to pause between words.

It uses a far-field microphone. This is a technical point, but a very important point. Speech recognition is very sensitive to background noise. So sensitive that all other recognizers that I know of require that the users purchase and use a noise-canceling, close-talking, head-mounted microphone. We tune up our recognizer to work with the in-built microphones that are built into the iMacs and the other CPUs that Apple delivers. That means that we are getting the background noise that other recognizers aren't getting. So we have several layers of software to adaptively model and subtract and deal with that background noise.

But there are limits on how much background noise we can deal with. So we've tuned it up to work well in the situations where most of our users are using their computers-- in the office, at home. If you have users or customers who are using speech recognition in noisy environments, such as in classrooms, then it might be pushing the limits of the head-mounted microphones a little bit too far. One such environment, for example, is giving a presentation in an auditorium with a 300-watt sound system that's providing a slapback echo from the back of the hall.

And so this is one of those environments where I've been using this head-mounted microphone. This one is produced by VXI. They have worked quite a lot with us over the last couple of years to optimize their microphones to work with our speech recognition. And so you can direct your users, if you want to, those as an alternative solution.

Our speech technologies currently are US English only on OS X. So what's new? In speech synthesis, we've done a few things. We merged the Macintalk 3 and the Macintalk Pro codebases into a single codebase. This has got a couple of advantages. One is it means that we have, at last, a totally new codebase. We have divested ourselves of deep, intricate, uninterpretable legacy code and positioned ourselves to, at last, be able to fold in the research improvements that we've been making in speech synthesis in the last few years.

So we've got ourselves into a new platform ready to go forward. An immediate benefit right now is that you get consistent behavior across voices. A lot of developers have said to me that they change from one voice to another, and words are pronounced differently, or the intonation is different, or that will no longer happen.

All the pro voices are now just the high-quality versions. So what was Victoria high-quality in OS 9 is now called Victoria, and there is no lower-quality version in OS X. That means the speech is crisper, it's easier to understand, it's more robust in the presence of background noise. People can understand it without it being so loud. And we've improved the pronunciation in a number of ways. We've enlarged the dictionary from about 20,000 words to about, what, 120,000?

Jerome's work down here. The morphological decomposition is now recursive. If you want to know what that means, ask us in the question time. It's fascinating. Really, really, it's way cool. It's a multiplier on the effectiveness of the dictionary. And the letter-to-sound rules are now automatically trained based on the new large dictionary, rather than handwritten based on one linguist's intuitions.

For speech recognition, we also have a benefit here. We have factored out that pronunciation subsystem from the speech synthesis and made it a separate subsystem that's now shared between speech synthesis and speech recognition. This means, first of all, it reduces the overall RAM, which improves performance of everything running on the platform. It means the recognition is more accurate because the pronunciation subsystem is expecting the correct pronunciations for more words.

And it gives consistent behavior across speech recognition and speech synthesis. Developers have said to me, I was prototyping some spoken commands for my application, and one of the commands was not recognized very well. And I thought that perhaps the recognizer was not listening for the correct pronunciation. So to find out what the recognizer was listening for, I typed my command into your text-to-speech system, and sure enough, it was spoken with an incorrectly pronounced word. Well, up until today, that's been irrelevant. Now it is relevant.

Now, if you want to know whether the recognizer is expecting the correct pronunciation, then type the word or the command into the speech synthesis, and it will tell you the way the recognizer thinks people will pronounce that word. The user interface has been completely revised, as I showed you. And there's an improvement to speakable items. We've added XML-based command files that you associate a spoken command with a keystroke sequence. Kevin Ai can later in this presentation show you that in some more detail.

I want to talk about why you should use speech in your applications. There are two classes of applications, I think, from our perspective. There are applications that are centered around speech, where speech technology is central to the user's experience and central to the value that application delivers. And then there's a huge number of applications, I think most of you write, for which speech is not centrally relevant at all.

And I'd like you to think about places that you can use speech in those applications as well. Chess is an example. Speech is not really relevant to chess, but we added it and people say, "Hey, that's cool." If you add speech to your application, then you'll increase the number of potential users and that increases your market. For example, younger users will find your application more approachable. People with disabilities will be able to use your applications. And people who are less familiar with computation will be less scared of trying out your application.

Speech is a very natural form of communication. We've all been talking and listening since we were, what, two years old? Wait, one and a half years old? Speech enables you in your application to move beyond the limits of point and click. There's nothing wrong with point and click. In fact, it's very good at letting you control things that you can see on the user interface and reach with a single gesture.

But there are lots of things that you want people to control that they can't see to point to to click on. Speech gives you a way to get past that. If you think about it, clicking-- It's rather like grunting. We're turning our back on, what, about 200,000 years of human evolution. Because when I click on things, I'm just going, uh, uh, uh.

That's true. I'd like to think that we've come somewhat further than that. Similarly, speech output, I think, can be a lot better than just beep. So many of us are still using alert sounds. Well, beep was the mentality of 1960s, when all the computers had was this tiny little speaker. We've come forward since then. Speech gives you a way to bring yourselves into the 21st century.

And conversation is an appropriate modality for delegating tasks to a computer. We'll illustrate that a bit more shortly. So what are some of the ways that you can use speech synthesis in your application? One is notifications. We recommend that you judiciously get the user's attention back if the user's attention has wandered away from your applications. Some of you may have known about or experienced talking alerts that we did in OS 9.

Now, there's a slider that lets you set how long between the delay between an alert coming up and it being spoken. You may want that longer or shorter than the default, but the point is you normally should not hear any speech, except when your attention wanders and the computer wants your attention, and then if you don't respond, it gets your attention back. You can use notifications for asynchronous events. For example, in the AOL Instant Messenger, if your buddies enter your chat room, then it will announce that with our speech synthesis. It will say, Buddy Smith just entered the room.

You can use speech synthesis to give additional feedback for younger users, and we have found, and our users have found, that applications for the 6 to 12 space become accessible for the K3 to 6 space if they don't change anything but just read out the text messages that they're putting up on the screens.

And you can use speech synthesis for proofreading. For example, you have an application where people are entering data, say a spreadsheet, and people are entering budget figures in a column. Allow them to select that column and have it read back so they can just check what they were entering to make sure that there aren't any errors.

Speech will give your application more accessibility for those with disabilities. I think that's pretty obvious. And it's really cool in games. You saw that we did it with chess. There are a lot of games already that are using speech for things like cheat codes and changing weapons. Use speech for non-time-dependent control. I don't think it would be appropriate to use speech recognition in your game to say, "Fire, now!

Quick, quick, left, quick, right! No, no, left!" But you can do it for that as well if you want to try it. In noisy games, these headsets are probably the right thing to do so the speech recognition doesn't confuse your commands with . There are a lot of people who are successfully using speech in education applications. I think there are a lot more of you with education applications that could take advantage of All right. One example would be the DyneEd product, which is using speech recognition for pronunciation correction for adults learning English. It's way cool. Check it out.

You can use speech to enhance the web browsing experience. For navigating within a browser, if those of you who've explored will see that we actually ship a couple of spoken commands for Internet Explorer that let you do some simple navigation. If you have a browser, you can do a much better job than we do by working within it. For example, people could speak the links, jump to pages by topic, read out web pages.

And there's a big opportunity with voice XML. The enterprise industry that's moving more and more of its information onto web access is now doing two or three different versions of all of their websites. They're doing the HTML version, they're doing a WAP version for portable digital assistants to access their stuff via wireless, and they're making voice XML versions so that people can ring up the web page and have the information read out over the phone.

The way this is done is by... an extra set of tags that are in the web pages that are just the thing that you need to interpret for speech access. And you can do that on the desktop with a Mac using our APIs. It would be pretty straightforward because the infrastructure and the hard work has all been done for you by the web developers.

We recommend that you think about using speech for form filling as an alternative to people filling out things with pop-up menus. You can now have people speak the contents of each field, and you can use a constrained language model for each field to increase the recognition accuracy. For example, the person could say, create a new customer record, and the computer could then respond, what type? And then narrow down its search to just the alternative customer records.

The person could say, corporate account. And then the next field would be payment schedule, and the person could then say, 30 days. And at that stage, the recognition model was changed again to just listen for possible payment schedules. Okay. There are a lot of tasks where people's eyes are busy, their hands are busy. For example, you're in a graphics program. You're drawing. You've got the mouse down. You're putting a line across an object.

And you want to move it around, send it to the back, or change the brush size. When the eyes are busy and the hands are busy, speech gives you another way for your users to control your app. So at this stage, I'd like to invite Sal Soghoian up. Sal Soghoian is the AppleScript product manager, and he's going to show you some way cool ways that he's been using speech. Sal.

Is this on? Great. Hi. This is amazing that I got up this early for this whole thing. Those that know me know that it's an impressive feat. What we're going to be showing today is, can you switch me to this, or me switch me to this? is how to use AppleScript with speech.

That's one of the best integrations on Mac OS is the ability to use these two technologies together. And on Mac OS 9, we introduced the ability to have a script listen for a response and based upon the user's response perform a different set of actions. And incorporated a technology called the SpeechListener.

So I'm going to show a couple scripts today that use this technology on Mac OS X. And both scripts will involve a conversation to get a task done. The first one is a rather straightforward example where I ask the script for some music and it prompts me with a series of questions and we have some music played. So let's see if my voice is back where it should be and we'll try this out. Some music please.

[Transcript missing]

So in this example, the script starts up, it got the information about which artists were available, held it into memory, and said which artist or category. When I said Christine Kane, it matched that, then queried and found out which songs were available by Christine Kane, held that into memory, and then said which song I said Tucson, that matched, and then it had the song play with iTunes. So this is a simple example.

You've got to program in a certain amount of grief with these things, just to keep you honest. So there's an example of being able to carry on a conversation with a script. It's a limited conversation, but it is a way of gathering information and moving forward. And the, uh, hide this application.

In the next example, I'm going to use a program from BiApp Systems. It's called GoTreef, and one of the things that this application does is it accesses information over the Internet. And I'm going to use a script acting as a person, a personality called Victoria. And Victoria will act independently of the speech recognition speakable items in that she will have her own set of scripts that she's going to use in conversing with me. So here we go. Let's try this out and see if she's awake too. Victoria? Yes, Sal?

Show my newspaper. Here you go. Something else? Clear all stories. All stories have been removed from your newspaper. Add multiples. Anything else? Add multiple stories. Ready. Motley Fool. Adding Motley Fool. Ready. Apple Stock Quote. Adding Apple Stock Quote. Ready. Apple Top Story. Adding Apple Top Story.

[Transcript missing]

Not right now. Not right now. Goodbye.

So in this instance, when the script was called Victoria, which exists in the Speakable Items, when the script loads up, it goes to a subfolder in the Speakable Items folder called Tasks. And within that folder are individual scripts, the names of which she holds into memory. When I say show my newspaper, she loads that script and then executes that.

So you have a script running a script. And all of the commands that Victoria were doing are not included in the standard Speakable Items commands. They are subsets. So you can create these individual personalities. So that's just two examples of how you can use AppleScript in speech. And if you're interested in how to do this, the AppleScript website has a complete overview. It has an AppleScript guidebook of how to use speech and AppleScript together. Thank you.

Some issues. If you're going to include speech in your application, then there's a few things that you need to keep in mind. Educate your users about how to speak. A good example would be go to the speech preference panel and turn on speakable items. And you'll see a sheet come down, and you'll see how we explained to users that they shouldn't pause and so on. Let them know about background noise being a problem. You might want to refer them to head-mounted microphones.

We train the speech recognizer on North American English. And so officially, that's what we say we support. That happens that it is somewhat forgiving. And so I'm Australian. In our group, we have Jerome from France. We have Devang. Where are you, Devang? Native Gujarati speaker. We have Matthias, who speaks Swiss German as native language.

We even have Tom from the Bronx and understands all of us. But again, there are limits. And localization is an issue. Currently, as I said, we are US only. So it's time to code. For Speech Programming 101, I'd like to invite Matthias Neeracher up onto the stage. Matthias.

Okay, now that Kim has told you what to do with speech, Kevin and I are going to talk about how to do this. Using our speech technologies in Mac OS X is pretty simple. We are installed in every install of Mac OS X. To use us, just link with Carbon Framework, or if you have a Carbon-based application with Carbon Lib. Our APIs are identical for Cocoa, for Carbon, the same APIs that you used on Mac OS 9. You can use them from Objective-C, from C, or from pretty much any language that we ship on Mac OS X.

Let's start with speech synthesis. Let's say you want your application to say something. How difficult can this be? Turns out it's not difficult at all. It's a single line, and you will get... Hello, world. Okay, that was simple enough. If you want to have a little bit more control, you open a speech channel, giving a voice. This can either be something that you get from a menu you give to the user, or if you pass in null, you get the default voice.

You probably shouldn't hard-code a voice unless you know exactly why you would want to do this. Then you can adjust parameters as you like them, and once you have them to your liking, you can speak the actual text by calling speakText. All of these calls are asynchronous, so it will actually return control to you before the text is entirely spoken.

We offer a lot of control. You can control the speech rate to speak slower for younger users, for instance, or quicker in a game situation, for instance. We can control the speech pitch, modulation, so it sounds more lively, the volume, to customize the way speech sounds in your application.

We also give you callback routines, so when you have a screen reader, you can highlight words on screen as they are spoken, or if you have an animated character on screen, you can animate the lips of the character as the phonemes are spoken. You can see many of these options in action in our Cocoa Speech Synthesis example, which ships on the developer CD.

Most of these controls, you actually don't have to write any code because you can simply embed them in text. So, for instance, in this sentence, if you want to emphasize the word next, you just embed an emphasis command in front of the word. Don't meet till next Tuesday. This is very important because to really have speech synthesis work for you the best it can, you should customize what is spoken.

Basically, your application just knows a lot more about how things should be spoken than our engines can know by default. So, there are a number of things you can and should do with your application. First of all, you should filter the text that is passed on to the text-to-speech engine.

For instance, if you have a stock ticker application and you come across the acronym AAPL, what you should do is tell the text-to-speech engine to say Apple Computer instead. Second, you should customize the pronunciation of words that don't come out right. And last of all, you should customize the intonation of what is spoken.

Now, we try to have a huge dictionary, as Kim already said, but even the biggest dictionary cannot possibly handle all the words, and not all proper names especially. For instance, my first name is tricky. It's certainly, our system cannot pronounce it by default. In the past, some developers have just used funny spellings to get it to work approximately the right way, like this. My name is Matthias.

Sounds almost right, but we don't recommend this, because if you use words that are not part of the English language, or strange combinations, we might change in the future how this is pronounced. Second, this is not a very precise way of specifying what you want said. So, instead, what you should do is use embedded commands to temporarily switch to phoneme input, using the phoneme notation. We describe this notation in inside Macintosh speech. It's explained with examples on a page. It doesn't take very long at all to learn how to use it, and the result is something like this. My name is Matthias. Sounds somewhat better.

Second, you should customize the intonation of the text you pass on to text-to-speech, because the written words alone are not always enough to convey the meaning. For instance, if you see this sentence, you can read it as, John only introduced Mary to Bill. He didn't introduce her to anybody else. You could read it as, John only introduced Mary to Bill. He didn't introduce Caroline to him. Or you can read it as, John only introduced Mary to Bill.

He didn't ask her to marry him. So, these distinctions can be very important. So, you should... Annotate the text you pass on. Our system tries to do the best it can to find out how a sentence should be spoken, but this can be very difficult, if not impossible, to do in the general case. Your application has domain knowledge of much of the text that is spoken and has the potential to do much better.

For instance, take a flight reservation system. At the end, it gives a confirmation text. And I'm going to play you two different versions of saying this confirmation text. First of all, it's not annotated at all, and the second is annotated. I'm not going to say anything between the two versions. Your first flight is with Alaskan Airlines Flight 2762 departing from San Jose on Monday, May 24th at 6.10 landing in San Francisco. Thank you for using TTS Travel.

Your first flight is with Alaskan Airlines, flight 2762. Departing from San Jose on Monday, May 24th at 6:10 p.m., landing in San Francisco at 7:10 p.m. Thank you for choosing TTS Travel. So did you hear a difference? Raise your hand if you heard a difference between the two versions.

Excellent. I see that the hands are up. So we did this with quite a bit of annotation. And this can basically be distilled to five principles of how to improve the intonation of the spoken elements. The first principle is let the user catch up by adding pauses at strategically important points.

Add punctuation wherever appropriate, and appropriate does not mean appropriate in the sense of English grammar. Nobody is going to see what is spoken, so feel free to add a comma if you think a pause is necessary at a point. Break up larger sentences into smaller ones, and insert some explicit pauses with the silence command at major pause points. So in our example, We added punctuation, we added pauses. All of this lets the user catch up.

Second principle is to let familiar things go into the background by de-emphasizing repeated words. For instance, if the minutes are identical, you should de-emphasize the second instance. Also, de-emphasize items inferable from your overall application scenario. You know that you're booking a flight, so you don't have to emphasize this word. Third principle is to liven it up simply by adding an exclamation point at the end. Fourth principle is to focus the user's attention by emphasizing the important words. This can be done with an emphasis command or simply by inserting a colon before the most important item.

And fifth, and maybe most important, use paragraph intonation. Group your sentences together into intonational paragraphs. And the first sentence in this paragraph, you should raise the pitch range and then reset it for the rest. This makes quite a bit of difference for longer texts that are read. You raise the pitch base and increase the pitch modulation, and then decrease it after the first sentence.

And between paragraphs, add an extra pause. So to summarize, you should-- Customize the pronunciation of words that you say. If you notice that a word you have hardcoded in your application gets mispronounced, use phoneme mode to get it pronounced correctly. You should customize the intonation of the text that is said, which helps the user understand the text a lot better and gives the user a much better overall speech experience. Now, let's move on to speech recognition with my colleague Kevin Aiitken, who is not lazy at all, I might add.

Can you all hear me? Yes, I can hear myself. Well, my manager, Kim, has actually assured me that this is not necessarily a commentary on my work ethic, so I feel better. But if you're like me, once in a while, I do feel a little bit lazy. And at those times, I just love having a simple solution to be really productive.

And so in the next 15 minutes or so, I'm going to show you two easy methods that you can use to add spoken commands to your application. And so hopefully, in an afternoon's worth of work, you can walk into your manager's office or your co-worker's office and say, Oh, by the way, our Mac OS X application understands spoken commands. So let's get started.

So as I mentioned, I'm going to provide two methods. So the first method is to use the Speakable Items application that's built into Mac OS X. As Kim demoed in the beginning, it's designed for end users, so they can easily add spoken commands to any application. And so you as a developer can use this also.

So it's great because you don't have to write any speech code. Speakable Items takes care of this for you by taking a list of items, building a language model, and then waiting for the recognition result. And as Sal showed in his demonstration, it understands how to execute AppleScript so that you can easily send Apple events to your application or even other applications for that matter.

Now, the second method I'm going to describe is to use the Speech Recognition API. You may be familiar with this in Mac OS 9. It gives you a little bit more flexibility. You can have multiple command lists, so you can have one set of commands for when the user has selected an object, and one set of commands when they haven't.

And I'm going to show you in a little bit an example that gives you a really easy three-step approach to adding spoken commands to your application using the Speech Recognition API. Well, one of the things that both of these methods have in common are commands. And so let's talk just for a second about what makes a good command.

Well, commands are like menu items, but we suggest that they're normally from three to six words long. The longer, the better, generally, because the recognition system can understand them easier, and they're more unique amongst your other commands. But you don't want them too long to where the user has a hard time speaking them fluently. Also, you should avoid single words, and especially words like hot, cut, and quit, because those are oftentimes misrecognized or they sound a light to the recognition system.

The other important item is that you should test your commands, especially test them together to make sure they're not confused with each other, and test them with the global commands that are shipped with Mac OS X. And to prototype your commands, you can use speakable IAMs or the SR Language Modeler application that you'll find on a developer CD.

Okay, let's talk about Method 1 in a little more depth. And this is using the Speakable Items application. So the first thing, as I mentioned, you want to create a number of items. You can easily do this by bringing your application to the foreground and speaking the command, make this application speakable. That creates a folder in the speakable items directory, as Kim showed you earlier, inside that application speakable items folder. And then you can begin adding your items as he showed you.

So once you have all your items together, the next thing is you want to bundle those inside of your application. So I'm going to show you a little bit of an example. You can use Project Builder to easily copy these files into your application bundle at the time you build it. And then finally, you need to install those items.

Now, we really suggest that you install them at runtime. This gives you a couple of added benefits. It allows your application to be drag-and-drop installed, so therefore, in order to support speakable items, you don't have to have a separate installer to install the items. And it's great for Mac OS X support for multiple users, because after your application has been installed, let's say the administrator creates new users, well, that new user is just going to get those speakable items the next time they run your application, because you'll automatically install them.

Well, let's talk about items real quick. Kim briefly mentioned those. It's basically any file that can be opened. But there's really two types that are best for you developers. The first are AppleScript files, as I talked about, that allow you to send Apple events to your application. The other one that Kim mentioned are the new XML-based command files, and they allow you to send keyboard events to your application so that you can activate menus or controls via keyboard shortcuts.

Well, one of the things that I wanted to do in preparing for this WWDC this year is create an example that really showed how easy it is to add spoken commands to your application. I really wanted to make it as simple as copy-paste-go. So, as you saw in Sal's demonstration, you use iTunes, and that's a really pretty good real-world application. It's shipping, and you can see how those are integrated into the application. Well, since I can't ship or give you the source code for iTunes, I thought, well, let me create a clone of it.

So, I've named mine Fotoons, courtesy of our French person in the group. And it's up on the web right now. You can go grab it at this URL. And hopefully you can either grab it this week or right now, or when you get back and start taking a look at it. It's really... I believe it shows a really easy way of getting going. So, let me go show it to you. Here we go.

[Transcript missing]

So let me show you the application, my clone of iTunes real quick. It has the identical menu items. And the window is, you know, pretty close. I mean, if you haven't done anything in Cocoa or Interface Builder, I basically took 15 minutes, threw all the menus in there, laid out the window, and pretty much got an automatic resizing window. It's really awesome. Okay, let me show you that it really is listening for commands.

Show commands window. Show speech commands window. Oh, okay. That's the first. Display visual. Oops, let me bring this forward. Get song info. Get Song Info. There we go. As you see, it doesn't do anything. It just shows the command down below. Okay. So it really is Photons. Okay. Let's switch into Project Builder, and I'll show you how this is set up. Let me try the command. Switch to Project Builder. Yay. Okay. Cool.

I'll move those out of the way so we can see the window here. I'll put this down since I'll be giving it more commands. Okay, so let me show you real quick. What this basic object looks like that manages the window. It's really simple. It has a couple of instance variables, and then it pretty much has a method for each one of the menu items, and a couple extras to handle some of the controls in the window. So it's really simple.

All these methods do is basically display at the bottom of that window what's happening. Okay, so as I mentioned, the first step was creating the items. So we've created those items. The next step is we need to include them in our application bundle. So what we'll do is we'll go to the active target. And the way we do this is we use a file copies build phase.

And so we include those down here. Let me show you where you do that. And here, if you haven't already seen it, you go New Build Phase, and you go New Copy Build Phase. It's not highlighted right at the moment because I haven't selected a particular item. But as you can see, I've included the items here. I have two sets of items. I have the command files, the majority of them.

And so I'm saying place these in a folder named Command Files inside the resources directory of the application bundle. And then I have a single AppleScript file that I've included in here as well. Okay, so now Project Builder has made that easy. Now they're being copied when I build it, being copied inside the application bundle. The next task is to install them at runtime. So we've tried to simplify this a lot by providing a single routine that you can call.

So here it is. Install speakable items for this application. You'll pass in the name of those folders that you placed your items into your resources directory. And then you'll call it, and it's smart enough to go out, creates the folder. If the folder's already there, it doesn't create it again. In the case of this demo, I actually call this routine every time it's startup, but you could choose to call it lazily later, or you could call it in response to a user specifying it in a preferences dialog, or something like that.

The rest of this file has a tutorial or documentation in more depth than I've talked about here, about creating the items, adding those items to your application bundle, and then how to call this routine and special notes. So it's all there. Okay, let's go back to the slides for a second.

Okay, great. So let's touch on the second method for a minute here, and that's using the Speech Recognition API. Well, you know, just like the previous example, I want to make it a copy, paste, and go solution. So we're trying to provide some really simple routines that you can use. So what I've done is I've broken down the process of recognition, setting up for recognition, and handling the recognition into three easy steps.

And you'll see in a minute where I provide you with a single routine to execute each one of these steps. So let's talk a minute about basically what the recognition process looks like. This is a very simplified version of it and a graphical version of it that just kind of cements it in your mind as to what this example is doing, and in general, what recognition process is. what happens during that.

Okay, so step one, we provide a routine that basically sets up all the recognition objects and instantiates the recognition system, a recognizer object, a language model object, hooks them all together, and that's all set up and ready to go. Something that most, virtually every developer has to do when they adopt the Apple Speech Recognition API. The second step is that you need to tell it what commands to listen for.

So in the routine that we give, you pass in the recognizer object, and then you pass in an array of commands. It basically gives the recognizer object those commands to display in the speech commands window. Then you also pass in the language model object, and it gives the commands to that so the recognition engine knows what to listen for. And then for the third step, you need to implement one Apple Event Handler, the SpeechDone Apple Event Handler.

And so now, your application is just sitting there, you've set up, it's running, ready to handle the user's spoken command. So when the user says something, the recognition engine passes this off to the recognizer object, it then sends an Apple Event to your application. In this case, play this song.

So we provide you a single routine, you pass in the Apple Event that you get, and it returns an ID. And then you can take this ID and map it through a switch statement or a table lookup or however you'd like. To a particular action or routine. Okay, so let's go back to Project Builder. And I'll show you how this is done.

[Transcript missing]

Then I call the routine to set up speech recognition that we provide in that utility file. I create an array of the commands because I need that to pass to the add commands routine that we provide. And then finally, I call our SR start listening routine that's part of the API, and now the application is up and running.

Page down here, this is the Apple Event Handler. Just a couple of lines of code. I call the routine that we provide in the utility file to convert the Apple Event into the ID. And then through the wonders of Objective-C, I just use it as an index into a table and then basically go off to the appropriate method selector. So that's pretty much it. I really urge you to go out and grab this and see how this can be applied to your application. So let me summarize real quick.

So we saw in the first method, using the Speakable Iams application, that's really easy because you don't have to write any additional speech code. All you need to do is include those Iams inside your application bundle and then install them at runtime with the routine that we provide. The second method is using the Speech Recognition API.

As I explained, that's an easy three-step process that we give you a single routine to execute each one of those steps. So I've discussed a lazy way to do it. There's more things you can do with the Speech Recognition API, and Matthias is going to come up here and talk about what to do if you're feeling a little bit more ambitious. So thank you.

Thank you, Kevin. My manager assures me that overachievers does not necessarily apply to my performance either. So Kevin has shown you how to get 95% of the benefits with 5% of the work. However, there are some situations where you might need the extra 5%. One example of this is Chess. You've seen it demoed.

It ships with Mac OS X. As of this week, you can get the source code from this URL here. You will find the speech-related code in Chess Listener. And Chess illustrates important lessons in language model design. Now you might think, the language model of Chess is not very complex, right?

Pawn, D2 to D4, all simple sentences. The problem is, if you just do this as a list of possible moves, it gets out of hand pretty quickly. If you do the math, you find out that if you just have a model with all the possible moves, you end up with more than 24,000 moves. And clearly this is unacceptable. It doesn't help accuracy, plus you're not doing the user any favor if you're listening to stuff like Rook A1 to H8. This won't do him any good at all.

In fact, it turns out that in each Chess position, there are only 20 to 30 moves that are actually legal. So there is no reason whatsoever to include the extra moves. Performance is going to go way up, and user satisfaction is going to go way up. So you can't do that. going to go up if you only include legal moves.

However, you shouldn't quite over-constraint your model. There are some illegal moves which are still plausible. For instance, people frequently put their king into check accidentally, even experienced chess players. So what you would want to do is to leave a move like this in so you can say, I heard you, but I won't do it.

Another technique that we use in chess is to use prefabricated parts. There are not so many words that are actually used in this language model, so we fabricate them at startup by calling srNewWord to get these word objects. And then when we come to a position and see that, for instance, pawn d2 to d4 would be appropriate, we simply grab these prefabricated objects and paste them together. to form this This command.

So to summarize, for complex language models, you will want to constrain your language model to only those commands which are plausible in each situation. And consequently, you adapt the language model when the situation changes. Furthermore, in very complex situations, you might consider using prefabricated language objects to quickly get your list of commands. To build these language models, we've included a tool called SR Language Modeler, which helps you to quickly experiment with different language models, how well they work for your users.

SR Language Modeler allows both live microphone tests for rapid turnaround, if you want to try something or grab somebody into your office to have him try something. If you want to do systematic scientific research tests, you can record your users saying these commands, record them into AIFF files, and feed those files into SR Language Modeler to get a systematic evaluation of how well this performs.

This tool and all of our sample code, which we shipped with Mac OS X, you will find on the developer CD in the examples/speech folder. And we encourage you to start with that if you want to do anything with speech. So let me now turn our session back over to our fearless leader, Kim Silverman.

For some values of fearless. So to summarize, speech synthesis and speech recognition are there. We've given you a conceptual overview of the APIs, and I tried to give you some ideas about why you would want to use them. And Matthias and Kevin have followed up with how to use them well.

So, I want to encourage you all to speech enable your apps. At this stage, I'd like to single out just a couple of developers who've been doing this. You might remember Thinking Home, which got the Apple Design Award last year. They've ported their application to OS X, and they added speech to it, and they found that that adds a lot of value to their users.

I was talking to one of their developers on the phone yesterday. He said that they're getting a lot of feedback, saying that users just think it's great when they can walk into the room and say things like, "Dim the lights in the living room," or, "Turn the upstairs thermostat to cool."

The folk who are working on OmniWeb, you may have seen their cool browser on OS X. They've been experimenting also with using speech to integrate speech into the browsing experience for those that don't want to or can't deal with the keyboard and the mouse. We saw a prototype of that this morning. It's looking really good. They've got some great ideas about how to do it.

So, I'm going to give you guidelines of good, better, and best about how to put speech into your apps. Good is the easy way. Use speech recognition to allow people to speak the visible controls on the screen, things that they would normally manipulate and say them. And use speech synthesis to speak simple alerts and alert panels when they come up. You can do these, by the way, with either the speakable items framework or by calling the API directly.

If you want to go better, then use delegation. I've mentioned this a few times. You've probably inferred what I mean. Normally, when we interact with a computer, we specify explicitly each step we want the computer to take in order to reach a goal that we have in mind.

With delegation, we delegate the goal to the computer and have it figure out the steps about how to get there and then execute them for us. So, group what would be otherwise multiple interactive actions into one spoken command. For speech synthesis, start to customize your texts using the guidelines that Matthias went through.

If you want any help on those, read back information to your users. And if you want to be best, then move to interactive spoken dialogues, like you saw Sal demonstrating, where you delegate a goal to the computer or your agent, and it then comes back and asks you questions to refine that goal. And think about using speech for form filling. So that's it. Thanks a lot for coming. I am always--