Application Technologies • 58:22
Speech Technologies have matured and are now mainstream, pervading personal computing, cell phones, cars, and telephone networks. Bring your application into the 21st century by adopting these technologies in your application. Speech synthesis, also called Text-to-Speech (TTS), converts any text into human-sounding audible speech, and thereby delivers information to users without requiring them to shift attention from their current task. Mac OS X Leopard includes new speech synthesis features and APIs that help you easily incorporate speech synthesis into your application. Learn about these new capabilities, and guidelines for how best to use speech synthesis and recognition to give your users exciting modern ways to interact with your application.
Speakers: Kim Silverman, Kevin Aitken, Wil Shipley
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Welcome to the session on speech technology. We call this Let's Talk. I am touched that you guys all came out, and ladies, came out here at 9 o'clock in the morning for us. You must really want to be here. My name is Kim Silverman, and I'm the manager of the spoken language technologies at Apple.
Today we're going to talk about speech synthesis. We're going to talk about speech recognition. I'll give you overviews of those features in the Mac, and give you some guidelines about how to use them well in your applications, and then we'll talk through the APIs in a bit more detail to show you some real code. So let's start with speech synthesis.
Speech synthesis has been on the Mac since the Mac was first launched in 1984, and particularly starting around about the time of OS 7, it started getting better and better with each release. We currently have quite a range of voices that give you different trade-offs of footprint and technology.
Speech synthesis is the core technology in VoiceOver, which is Apple's screen reader. And today, I'm delighted to be able to announce that we have new speech synthesis. You might have heard this in the Steve Jobs keynote on Monday morning, and I'd like to introduce our new voice, whose name is Alex. And rather than me telling you about it, it seemed to me that I should let Alex introduce himself. So could we shift to the demo machine, please? Let me just plug in the audio, just a moment. I'm going to plug in the audio, this might click.
Okay, we are in. Ready? My name is Alex, and I am the new speech synthesis voice in Mac OS X. I'm not yet finished, so you will probably notice some places where I still need more polishing. Apple has worked on a different approach to speech synthesis technology in order to be able to produce me. People say that I sound more natural and easier to understand than Apple's prior voices. Like some of those voices, I'm based on recordings of a real human.
But unlike those voices, I'm based on much longer recordings and have a very different approach to the signal processing. Another improvement is that I can speak very quickly, and yet people can still understand my speech. Here is a sample. Make sure to be there Monday, July 5th, about 8.30 in the morning.
Try to be on time and bring the most recent results with you. And now, to tell you more about me, let me introduce a man so deep, he is almost unfathomable. A man so quick, he is even fast asleep. A man who is a legend in his own mind. Gandalf. Er, sorry. I mean, Kim Silverman.
So Alex is being a bit of a smart alec this morning. I mentioned that Alex has been designed to be the core technology and work well in VoiceOver. For those of you who are unfamiliar with it, VoiceOver is a feature of Mac OS X for people with low vision or no vision. And it uses speech synthesis to feed back to a person who can't see the screen information about everything going on on the screen, what the front most window is, where the mouse is, and so on. Let's turn it on.
Welcome to Macintosh. VoiceOver is running. System Preferences, Universal Access Window, Back button. So I'll navigate around a few of these controls and you will hear how it sounds. Forward, Dimmed button, Show All button, Seeing, Selected tab 1 of 4, Hearing tab 2 of 4. Now, people who use these things all day, every day, want the speech usually to be faster than this. So let's take the volume, the rate up. You can do this with keystroke commands that I'm using here.
Faster, faster, faster, faster, faster, faster, faster, faster, faster, faster, faster. Let's try that. Seeing, Selected tab 1 of 4, Hearing tab 2 of 4, Keyboard tab 3 of 4. But they like it even faster than that, so let's go further. Faster, faster, faster, faster, faster, faster, faster, faster. Turn VoiceOver on or off, Command F5. VoiceOver on, Selected radio off, Radio button 2 of 2. go with voiceover. Thank you.
Now, while I'm in the System Preferences, I'd like to show you one of the features that we've used to incorporate speech synthesis into the user interface. And the reason I'm showing you this is to inspire you with one use of speech synthesis you could use in your own applications.
If I go to the Speech Preferences and the Text-to-Speech tab, There's a control here that's labeled "announce when alerts are displayed." We call this "talking alerts." Here's what this is about. Suppose I'm working on the computer, and then an alert comes up on my screen that wants some kind of response from me, but I'm not looking. My attention is out the window, or I'm daydreaming, talking to somebody else, and I miss the alert and miss responding to it.
In that case, if I don't respond after a certain amount of time, the computer could read out the alert to me in order to get my attention back. So that's what this does. In the alert options, you can set a delay until it starts speaking. By default, it's about 20 seconds. I'll put it back to being zero now and set it to, say, one of these phrases.
So suppose I modify this document. And then I go to close it without saving it. Here's how it sounds. Alert. Do you want to save changes to this document before closing? We've been getting a lot of feedback about this. One guy said to me, "I was crawling around on my hands and knees underneath my desk rearranging some cables, and suddenly this voice unexpectedly came out of my computer saying, 'The internet is being disconnected.' I turned around and, oh, sure enough, my foot had kicked out the ethernet cable." And so, the principle here is, if there are things going on in your application that you'd like your user to respond to, but your user may not be attending, then you can add speech as a way of getting the user's attention back.
Now, another announcement we have today is that we have modified the API to allow a plug-in architecture. What that's all about is that there are other companies like yourselves out there who have speech synthesis technology that has different characteristics from ours. Different voices, different genders, different trade-offs of quality and footprint, or, most relevantly to you guys, different languages.
Thank you. So it's now possible, oh let's just close this, for them to port their speech synthesis to the speech manager, and their voices show up under the list of voices just like our own voices do. One of those companies is capstral.com. You may have heard of them. So I have here a couple of voices from Capstral. One of them is Katrin, their German voice. Dies ist die Capstra-Katrin-Stimme. So let's have her introduce herself over here.
Hello, I am Katrin from the Capstall house. I am always well occupied and therefore have not held a lecture on an interesting conference for a while. That is why I was also deeply moved when I was asked to talk here in front of Apple. And I have to say, on a Mac I really feel like home again and again.
For those of you that don't understand German, she just said some very complimentary things about you and about us. So this means if you've got an application that you want to sell into a German market, German speaking market, and you use speech synthesis, here's a way that you can go.
If you're selling into the English market, there are also some interesting things that you can do with these voices. For example, if you pass them English text, then they will speak that text as a native speaker of that other language would try to speak it. And that can sometimes sound interesting. We think this is very interesting.
We have another voice over here, Miguel, who speaks Spanish. Miguel, who speaks Spanish. So let's get him to introduce himself as well. Hola, me llamo Miguel. Soy la voz del español de las Américas de Kevstral. Encantado de conoceros. Es mi primera vez en una conferencia WWDC. Verdaderamente me fascina el mundo del Mac. So that's speech synthesis. Thanks.
Can we go back to the main slides, please? So, you use speech synthesis just by passing it text and having it speak it. And the speech synthesizer will analyze the text and make its best guess regarding how you would like that text spoken. And usually the guess is pretty good, but you can do even better.
The problem is, the way we say something depends on why we're saying it and on how it relates to what the listener already knows. So really, to get it right, the speech synthesis engine has got to be a mind reader. And we're not going to release that one in Leopard.
But the solution, the workaround, is that your application knows a lot more about what it's saying and why it's saying it than the speech synthesizer could ever guess. And so you can use that domain-specific knowledge that's inherent in your application to improve the speech. For example, one thing you can do is filter the texts.
Take, for example, an application that let people deal with stocks. It could substitute for each of the ticker symbols the actual company name, such as, rather than saying AAPL, it could say Apple Computer. You can customize the pronunciations. You can customize the intonation. Let's talk about those in a bit more detail. In the case of pronunciations, there are some words that a synthesizer may not pronounce the way you know they ought to be pronounced in your application. Some names are a particular problem like that. Some technical expressions. Some words that are specific to your application.
Well, what developers have done in the past has been what we call "funny spelling", which is they retype a word with different spelling in order to try to get it to be pronounced the way they want. We don't recommend this because pronunciation of non-standard spellings might change as we make different releases of the synthesizer.
And besides, there's only a limit to how far you can go with a spelling because you're still working through the letter-to-sound rules. The better way is to use part of the API that's called "phoneme input". That lets you embed in the input to the synthesizer the exact specification of the phonemes that you want it to use when pronouncing something. Let me give you an example. This is a real example.
The name "beardno" was mispronounced by our synthesizer up until about a week ago. Here's how it was pronounced. It's saying something like "by I are now". So, I... substitute in our dictionary this particular phoneme string, and now it says: "My name is Professor Beardno". "Beardno". "My name is Professor Beardno". This phoneme input is very easy to learn. There are about 40-something, maybe 42 symbols, and it's all documented in the speech documentation. This is your insurance against the future. Use this method to customize the pronunciations.
Now, to customize the intonation, the problem that we're addressing here is once we have intelligibly synthesized the words, we haven't done enough. You know the expression, it's not what he said, it's the way that he said it. The way that you say something can change its meaning. Take, for example, the sentence, John only introduced Mary to Bill. That means he didn't introduce Mary to anybody else. But if I say, John only introduced Mary to Bill, that means he didn't actually encourage her to have a business relationship with him.
Now the problem here again is that automatically determining the correct intonation is difficult in the general case. You've got to have ESP. But your application knows why it's saying what it's saying, and it knows what the listener has already done in its prior interaction. And so you can do better.
And one principle that you can apply is what I call let the user catch up. That is, add pauses at the boundaries between major information chunks in your texts. One easy way to do that is just liberally sprinkle punctuation around and break up larger sentences into smaller sentences.
When we developers and technical folk write our texts in our applications, we tend to write it in the style of written text. Written text tends to have long sentences. Human speech tends to have short sentences. So break them up. If there are any pauses that the synthesizer inserts that you'd like to be even longer, you can lengthen them by embedding the command in yellow up there into your text. That inserts silence. In this case, it adds another 500 milliseconds of silence.
Let me illustrate the issue here. Suppose you have an airline reservation application that reads back information about a booking to a user. You might populate a text sentence by filling out fields from a database. The values in yellow would be pulled out of a database, put into this frame sentence, and then the text sent to the speech synthesizer. If you do it with this text as written, it sounds like this. Your first flight is with Alaska Airlines Flight 2762D, departing from San Jose on Monday, May 24th at 6:10 p.m., landing in San Francisco at 7:10 p.m.
Thanks for choosing Apple Travel. So you can get all the words, but it kind of barrels through it pretty quickly. If the user were trying to jot down the information there, they might have trouble keeping up with it. So in this version, I've added commas, some periods. I've separated out that last line into a separate paragraph, which adds a bit more of a pause and changes its pitch.
And I've added an extra silence of 300 milliseconds before the last phrase of the first paragraph. So that now sounds like this. Your first flight is with Alaska Airlines Flight 2762, departing from San Jose on Monday, May 24th at 6:10 p.m., landing in San Francisco at 7:10 p.m. Thanks for choosing Apple Travel. That gives people a bit more of a chance to catch up and maybe jot down some notes while they're listening. The point is, Customize.
Okay, so let's talk about speech recognition. There's a speech recognizer built into every Mac. It's part of OS X. It has a well-documented API that you'll hear about a little later this morning. But there are a few characteristics of the recognizer that I'd like you to know about now. One is that it's speaker independent. Your users don't have to spend hours training it to their voices. They can just walk up to it and use it.
The second is that it's driven by a finite state grammar. What that means is you load up the recognizer with a representation of everything that you want it to listen for, and then tell it to start listening, and it tells you when it hears a string that's part of that grammar. That means it works really well for command and control, for spoken interactive dialogues, but it doesn't do unrestricted dictation. For that, there are some third-party solutions available, such as iListen from Mac Speech.
The third thing is that it works with a Far Field Microphone. That means for the general case, you don't need to tether yourself to the machine with a head-mounted, close-talking, noise-cancelling microphone. You don't have to go out and buy extra hardware. You don't have to have extra cables dangling across your desk. Now because it's working with a Far Field Microphone, that means it hears all the background noises in the room. The reverberation, the echoes, the computer fans, the air conditioning, the traffic going past, all the stuff that we as people are not aware of, it's hearing.
How many of you have ever recorded a meeting on a cassette deck or a dictaphone and then played the tape back later? Who's done that? It's amazing isn't it? You listen to it and think how on Earth did I understand anything? Well our speech recognizer has to deal with that as well, so we have layers of algorithms to dynamically track, adapt to and compensate for the room acoustics.
There are, however, limits to how far it can compensate and we're going to be demoing it in here where the distance between here and the back wall is causing this enormous echo which is just on the limit, actually a little bit past the limit of what we've designed it for.
So I am going to be using a head mounted microphone demoing it today and there's a principle there for your users, give them some kind of guidance to know am I going to be using it in a place that's too noisy, in which case there are a lot of third party head mounted microphones out there.
Now because it's adapting, I'm That means that if you use it in one place and then go to a different place, a different acoustic environment, it will still think it's in the old place and it may not work so well. For example, you're using it at home, on the kitchen table, at night after the kids go to bed, and then you close up your MacBook, take it to the office next morning and open it up, it still thinks it's at home, and so it takes a while before it can adapt to your voice in the new acoustic environment. And you might think, oh, the recogniser's not working so well.
Well, we have a user interface that speeds up that adaptation, and I'd like to start off by showing that to you now. So could we go back to the demo machine, please? If you go to the speech recognition preferences, then there's a button there called "Calibrate". If I click on that, I get a dialogue that lets me set the volume sensitivity of the microphone. And I should set that so that the peaks of my speech are just coming up to the end of the green or bumping into the red.
Then there are a list of commands on the left. After you have set the volume correctly, then you read down these commands, and each command, if it is recognized, will flash. If a command doesn't flash, you say it a second time. If it doesn't get it the second time, just go on to the next command. When you get to the bottom of the list, if it missed any of them, go back up and go through the list a second time.
So let's give it a try. What time is it? Quit this application? Open a document? Show me what to say? Make this page speakable. Move page down. Hide this application. Switch to Finder. Done. We just set a couple of hundred parameters in there. So let's put it through its paces a little bit before I talk any more, and so you can see what happens.
[Transcript missing]
What time is it? It's 9:24. What day is it? It's Friday, August 11th. Quit from System Preferences? Quitting System Preferences. Get my mail? Open my browser. Open my browser. Get Apple stock price. Get Apple. Let's see how we're doing. Ah, 63. Hey, the stock's a bit low, guys. Could you write lots of cool applications for us and get our stock price up? Oh, look at that.
Astronomy Picture of the Day. Astronomy Picture of the Day. Oh, this is cool. Look at that shooting star. Zoom into this. Cool. Zoom back out. Suppose I'd like to mail this web page to a friend. Well, I could go through the steps of copying the URL, finding his name in the address book, launching mail, or I could just say, Mail this to Kevin Aiitken. Mail this to Kevin Aiitken. One moment please.
Created a message to Kevin Aiitken with this URL. So it goes into the address book, finds his address, goes to mail, creates a new message, goes back to the browser, copies out the URL, goes back to the message, pastes it all in, and puts it up in front of my face, ready for me to either type some explanatory text, or just say, Send this message. Send this message. And that goes out as real mail.
Now at this stage, Kevin is very patient, but he's probably sick of getting all these stupid demo messages from me. So I want to telephone him and let him know that yet another one's on its way, but I can't remember his phone number. So I say, phone for Kevin Aiitken. Opening text edit. Oh. Phone for Kevin Aiitken. Phone for Kevin Aiitken.
And were a dress book set up properly, it would do it. Why is that not working? Every time I do this, it works. Then I demo it. I have to tell you that this is my personal PowerBook here, and some of the components here are parts of software that we've been working on as we download new versions of Leopard, and sometimes I get interactions like this. Oh well, it would have displayed it on the screen. Try it on your own machines, and it will, there.
Quit from Address Book. Quitting Address Book. There are some applications that call our API directly. Let me just show you one of them.
[Transcript missing]
Nye G A2F6 I don't like it when a computer this powerful takes six and a half seconds thinking about what it's going to do to counter my moves. This lets me do something that my 17-year-old son won't let me do when I'm playing chess with him. This lets me say, "Take back move." Take back move.
Now one of the things that you can do with, uh, well let me, well I've shown you a whole bunch of things, let me explain a little bit about what you've just seen. We have a speech recognition engine that has an API. Applications such as your own can call that API, and already many of you do, and use it to add speech control to your own user interfaces.
But it seemed to us that It's not very good if our users have to go out, track down, and purchase third-party software before they can have any experience at all with speech. So we also make some simple applications that sit in the user interface and do some things with speech. One of those is called Speakable Items. Open the Speech Preferences.
You turn speakable items on in the speech recognition preferences. That's up here. When you turn on speakable items, We create a folder in your home directory called the Speakable Items folder. Open the Speakable Items folder. So here it is, Home, Library, Speech, Speakable Items. Anything that's in that folder can be launched by double-clicking it. And it's exactly the same-- sorry, it can be launched by saying it. And that's exactly the same as double-clicking it.
We pre-populate that folder with a number of items that kind of get you up and going, and some of the things I've been saying are in there. For example, what time is it, and zoom into this. Anything that can be launched by double-clicking, you can now launch by speech, by putting it into that folder.
The real power is that you can add things to that folder to customize it to work well with the things that you do most often. Let me give you an example. Suppose that I have a standard workflow that I always go through, which requires me to always create a standard set of folders with standard names.
Make a new folder. Hang on. Make new folder. There we go. Open this folder. Open this folder. Make a new folder. Make a new folder. Call this data to be analyzed. Make a new folder? Make a new folder. Call this work in progress? Make a new folder? Make a new folder. Call this "Finished Publications".
Close this window. Move this to the trash. Move this to trash. Empty the trash. Empty the trash. So it's pretty easy to customize it like that. One of the things that you can do to customize it like that is you can add a spoken command to send a keyboard shortcut. And that means that you guys can very easily add speech commands to your applications if you have keyboard shortcuts. Let me give you an illustration of that. Get my mail.
So suppose I want to be able to mark a message as being unread. Now I know it's possible, and I know it's in one of these menus. Let's try Edit, no, View Menu, Message, Mark. There it is. Down in this submenu, Mark as Unread. It's Shift, Command, U. These kinds of keyboard shortcuts are sometimes hard to memorize. So I can attach a spoken command to it by saying, define a keyboard command. And it asks me, "What's the keyboard shortcut I want to send?" Well, I type Shift-Command-U.
And then it asks me what is it that I want to say? So I'll say, "Mark this as unread." That's easier for me to memorize than the keyboard shortcut. So now let's try it. I'll zoom in over here. Zoom into this. Mark this as unread. Mark this as unread. There we go.
Zoom back out. Can I go back to the slides please? So when and where should you use speech? Well, a lot of us play out recorded sounds, alerts, beeps. Back in the 1960s, beep was about all that a computer could do. And somehow we're stuck in that mentality.
But it seems to me we're in the 21st century now, when your computer makes a sound to get the user's attention. You know why you're doing it, so why not tell the computer? Use speech. And if you think about the mouse, the point-and-click interface, it's like pointing at things and grunting.
"Ugh! Ugh! Ugh!" There's a lot you can do with it, but I like to think that we've gone beyond just grunting. So how can you go beyond beep and grunt in your applications? Well, one thing you can do is notifications. Talking alerts was an example of that. Another one would be that in iChat, you can set it up to speak to you when somebody wants to chat with you. Mail allows you to set up rules that could say, for example, you've got new mail from your boss.
You might want to hear that rather than discovering it two hours later when you next get to looking at your mail program. So if you have things that you want to notify your people about, tell them. Gosh. Proofreading. You know, we all have spellcheckers, but there are limits to how far a spellchecker can go.
Many of our typing errors make real words, such as form instead of from. Spellcheckers don't find those. Grammar checkers don't do much better. They're not very satisfying. But when you hear your prose read out to you, you get proof. You don't get the grammar mistakes immediately. You also get a much better feel of the overall flow of the text.
Games are a good opportunity to use speech. If you have non mission critical interactions that you want your user to have, speech can be a good alternative modality for that. For example, there was a few years ago a car racing game where you could radio ahead to your pit stop crew to tell them what they needed to be ready to do to your car when you next came around, such as changing the front right tire and topping up the oil. So speech was good for that.
Education-- the teachers keep telling us we love speech synthesis because when our kids interact with an application, if it talks, it engages them much more. Like a talking math application will engage a kid for 20 to 25 minutes, whereas without the speech, it only engages them for about five minutes. And while kids are hearing things spoken back, it's reinforcing their literacy, whether they know it or not.
Web browsing, you saw me surfing the web by voice. There's a lot more that could be done in that direction. Speech is good for form filling. If you have an application where there are fixed alternatives that could go into a field. For example, a finance application where people can write checks and have a list of people that they normally write checks to. So you could say, make out a check to Macy's. In general, if you have any user interaction scenario where people's eyes are busy and their hands are busy, speech provides another modality for you to allow them to interact. And it's kind of cool, you know.
And finally, Delegation. Normally we interact with a computer by specifying the steps we want the computer to take to achieve a goal that we have in mind. A better user interaction model is to delegate the goal to the computer and have it do what it's good at. Have it figure out the steps along the way to get there. If, along the way, it needs more information to refine how it gets to that goal, it can come back and ask you.
Sending a web page to someone was an example of delegation. If you're going to be successful with speech recognition in your app, then here's a few things you should do. First of all, educate your users. Let them know what the user interface elements are, like the speech commands window, which shows people what they can say. Explain about calibrating, the way I just explained it to you.
And when you go back to your hotel rooms tonight to try out the speech, what's the first thing you're going to do? Give them examples of how to speak. For example, ship an AIFF file recording of you saying something to your application so they hear the style of speech, so they know that they shouldn't pause between words. They shouldn't shout at the computer. They shouldn't mumble. They shouldn't over-articulate.
If you want some guidelines, look at the helpful tips in the speech preferences for what we say. Make users aware of background noises. Often people say to me, the speech recognition wasn't working for me. And when I look into why, well, they have the microphone listening all the time. And they're using speech recognition. Well, somebody comes into the room, tells them a joke. The computer thinks it hears the other person, responds to it. They both start laughing. There's overlapped speech going on. And then they say, the thing doesn't work.
Make people aware of background noises. And in the context of doing that, help your listeners to know when they should be considering using a head-mounted microphone. Such as, if you have an application to be used in a warehouse where there's a lot of machinery around, you might want your users to wear a head-mounted microphone in that case.
There are two listening modes: Push-to-Talk and Continuous Listening. Push-to-Talk is the mode that I was using, where it's not listening unless you press a hotkey. That addresses the following problem: that speech recognisers are great at recognising what they're supposed to recognise but terrible at knowing when you're not talking to them. And so in situations like this, where I'm doing a demo, 99% of the words that I say are not directed to the computer and so I have it in push to talk mode.
You can tell it to listen continuously and if you do, then it will require a key word before each spoken command. You can even turn that key word off and then the microphone is listening all the time. Very dangerous. If you're in your office by yourself, that will work quite well until the phone rings and you have a conversation on the telephone and then it's dangerous again. So educate your listeners about the costs as well as the benefits of continuous listening.
And give them some guidance about non-native speakers. We have trained the speech recognizer on 750 speakers of varieties of North American English. Including a few non-native speakers in there. Because we're doing statistical modeling of the acoustics of speech in Hidden Markov models. That means there's some forgiveness in the recognizer. So as you can hear, I'm Australian. In my group, Jia is a native Chinese speaker. Matthias is a native speaker of Swiss German. Jerome's a native French speaker. Devang is a native Hindi and Gujarati speaker. Heavens, Kevin Aitken's from Oregon.
And it understands all of us. So there is some forgiveness in the models. Nevertheless, let your users know that the further their English is from North American adult English, the more chance there is that the recogniser might trip up. So, we've talked a lot about speech synthesis and speech recognition. I'd like to hand over to Kevin Aiitken, one of our senior engineers, who will tell you about how to put this into your own applications.
Kevin? Good morning. As Kim said, I'm Kevin Aiitken. So, what I'm going to do is dive right into some of the programming details. So, that's why I'm here. First, I'm going to start talking about speech recognition, and then I'm going to talk about the APIs for speech synthesis.
So let's start with accessing speech recognition. So there are four main ways that you can access our speech recognition engine. And I go through each one of those, giving you some ideas of what the API looks like. So if you're writing with AppleScript, we have two commands. The first one is the listen for command. It's great for a simple Q&A. You give it a list of commands listened for. It starts recognizing once it's heard something, or it's optional timeout, stops listening, and then you can take the result and do something with it.
If you need a little more flexibility, you can use the listen continuously for command. Unlike the other command, it continues to listen even after it's heard something. That way, you can update the commands depending on what you want to do. And then once you're finished listening, you can just call stop listening.
So now Kim gave you an overview of Speakable Items, demoed what it looks like to the user, and then he also delved into the Speakable Items folder. And so he showed you what an item is. If you have a Scriptable application, you can use Apple script files. That's really great.
Otherwise you can use keyboard command files to map a keyboard shortcut to a particular spoken command. And you can even create many little applications for your spoken commands that can communicate with your application in whatever way you want. The simplest way to get started is just to embed those items inside your application's bundle.
You just put them in a folder inside the resources directory, set a key in your Info.plist, and when Speakable Items is running, it finds that in your application and puts those commands in a list. Or if you want to give the user just a little more flexibility to rename and delete your items, you can actually install those into the user speak volume folder that Kim showed you.
If you're running a Cocoa application, we have a great class called NSSpeechRecognizr. It's designed to be for fast, easy implementation of speech recognition. It has a number of ways that you can change what it's listening for as it's running. So, I'm going to give you a little bit of code as an example of what it looks like. But let's say you're running a game program. It's a maze.
The user can move in four directions to move their little token from one side of the maze to the other. And so you've set it up so they use the keypad or the arrow keys. And then you go to a talk like this and you go, wow, you know, I can have the user say one of those directions. So, what might the code look like? So we're going to start off by instantiating a speech recognizer. And then we're going to set the delegate to an object that you want to receive the message of when one of those commands is recognized.
Then we're going to tell the recognizer what to listen for. In this case, it's just an array of strings. Each one of those strings is one of the directions that you specified. And then finally, we're going to start listening. So now the user is using your maze, having fun, making way to the end. And so they say one of those commands. And so your delegate message method, delegate object, is going to get the did recognize command message along with a string that it heard.
And then what you can do is you can just easily compare that to one of the known directions and move accordingly. So that's a really simple example of a fictitious program. What I'd like to do now is show you a real cool application that a developer has used this API in order to enhance with speech. So what I'd like to do is bring up Will Shipley, CEO of Delicious Monster, to give us a demo of that. Thank you.
Hello. Thanks for coming out at 9:00. This is probably the only time you'll ever see me before noon, so live it up. We just bought some books and CDs to test this to show you guys that this is not a canned demo. If you guys want these, see me after the session, because we don't really want to take them back with us. Let's see. I need to change-- can I have demo whatever it is thingy? I've always wanted to say that. That's so fun.
The power, I'm going to be like, could you switch it to the demo in the next room? Can you go back? I must use this only for good. Um... So I was instructed to, uh... Yeah, can I go back to demo again? I was instructed to calibrate this before I use it.
I had no idea about this calibration stuff, and I put the voice stuff in two years ago, so that tells you something. I don't know what. But it's probably something really interesting. So I'm going to calibrate now, which is really exciting for all of you. Actually, I think this is kind of magic every time I do it.
So there we go. What time is it? Quit this application. Open a document. Show me what to say. Make this page speakable. Move page down. Hide this application. Switch to Finder. That was pretty neat. That was 100%. I mean, I don't know. I just got up here, I set it up. I'm just like, I find that completely magic.
Okay, so here's the point where I'm going to do something which you're never supposed to do, which is I have not tested these. But I was told to show the app a little bit. I don't want to turn this into an ad for my app, because you guys really don't want to be sitting here paying to watch me do ads.
But the basic idea of the app is this gives you a physical catalog or a virtual catalog of all your physical stuff. So it's sort of like iTunes for your physical items. It doesn't do your electronic items, and it doesn't try to actually have the content of the items. It's just like a card catalog into those items.
So you can catalog the books that are on your shelf and the CDs that are on your shelf and your video games and your blah-biddy-blah-biddy-blue. The cool thing about it is these apps have existed for 30 years. It's all really we all wanted to do with computers is catalog our crap.
The cool thing about it is this is really easy to get it in. So what I do is it scans the barcode. So I'm going to show you the first part of voice feedback, assuming this machine is mic'd up. Hopefully the audio is on this too. So I'm going to pop up the iSight camera on this, point it at this barcode. And please, please, please, please.
Now I think today we've all learned a valuable lesson about licensing software. Stealing is wrong, and I want you to take that home with you. It's not something you want to do. So what I'm going to do, and this is actually really teaching me a lot about how I should ship the program.
What I'm going to do is go ahead and I'm going to delete a bunch of items so that I'm under the 25 item limit. But you saw how fast that would have worked if it had worked. So here we are. This is the barcode. Hold it up to the eyesight. Scans it in. I can go on, do other items. Look, I failed. Is this on the net? It's really slow. I see. Wait.
Me and my gang. All right, so there's that. Did that. Mm-hmm. Mm-hmm. Yes, we're not going to go over that now. This is why you never do demos on items that you haven't actually checked in the lighting conditions that you're going to do them in there. Game of Shadows, Barry Bonds, BLSU, and the steroid scandal that rocked professional sports. All right, so that's scanning items in. And you can see I'm using the feedback there to read out the item title, which seems kind of neat and kind of magic.
And the big point behind that was I did want to make it seem magic. I wanted for the user to understand what they did, because just the beep, and then it would take a little while for it to load in. And they weren't really sure what had happened. And they'd usually moved on to another item, because it's not threaded, as you may have seen. I was actually able to read in other barcodes while it was looking up.
And so you get behind. And so this isn't just like just adding flash. It actually adds something useful, which is if you're sitting there and you go beep, beep, beep, beep, and then it fails, it says look up fail. And so you'd hear, oh, Lost Lake, and then look up fail. And you go, oh, Lost Lake.
Oh. This one didn't work. I better type it in by hand. So we're not just talking about flash. We're talking about actual functionality that's enabled by speech when someone's not looking at the screen. And similarly, you can actually look up things by title here. And I'll show this.
"Was I supposed to do something before this? No, not. So I'm going to hold down escape and I'm going to just say the name of one of these. Mario Party 5." Mario Party 5. "Grand Theft Auto: Vice City." Grand Theft Auto: Vice City. "So you can see it's just selecting the item that I say.
Now that's really cool because we actually tested this, and you can actually be, and they don't guarantee this, so you know, this is in Apple words. You can actually be like 30 feet away from a microphone in one of these machines and say fairly clearly in a relatively quiet environment what you want, and it'll get it.
So you could actually have this continuous listen and be in your library and say, 'Oh, did I scan in Mario Party?' And the thing will go, 'Mario Party?' And you're like, 'Oh, yeah, I got that one.' And so you can just read off your bookshelf, and it actually works. It's really, really cool.
Something I just discovered when I was doing all this, when they were showing it to me, whoops, there's the source code for it. Well, I'm going to show you the source code so it's not going to be too long. It's not that bad. Did anybody notice the crasher in the code earlier? No one? He was showing Cocoa code. Anyone see the crasher he did? What was it? He didn't nil terminate his array when he said array with objects.
So one of the things I discovered is, oh, Speakable Items on/off. That's supposed to turn it on. So now it's on, which it was before, but I'm pretending it wasn't. So yesterday, he's going through this. I just got out of this session. That's why this is so random. Yesterday, we were going through this with Kim, and he's showing me the Speakable Items.
Where's the other thing? He was showing me this, and as you can see here, let me see if I can do this little trick. Isn't that neat? All these things that are just coming from my app for free because it's a Cocoa app. Now, I didn't program these, and it really freaked me out because, I mean, you can see things like open page on Amazon's with this title, which I didn't add.
I mean, I added it to my app, you know, as a menu item, but I didn't add that as a speakable item. But it turns out in Cocoa, you just get your buttons, the titles or the labels actually get used as speakable items for free. Which is really amazing. So our apps are actually already more speakable or unspeakable in some cases.
Then we thought, so for example, I'll actually do this and I'll say, "Hey, uh, my info." My info. And it switches to my info tab. Details. Details. Similar. Similar. I forgot what that command was now, sorry. I was going to do the show info, blah blah blah blah blah blah blah. Yes. View other formats on Amazon.com. I probably said dot wrong.
Oh, I'm looking at the wrong one. Yeah. What did I do with it? Speech Commands Window. Right, right, that Speech Commands Window. Oh, oh, oh, yeah, I got it from a different place. That's why we didn't see that. Toggle EyeSight Monitor. Toggle EyeSight Monitor. Open page on Amazon on books with this author. Open page on Amazon on books with this author. So that's pretty cool.
Now the big story here is I've always been a skeptic on speech technologies and with apologies to the people who actually wrote all this stuff. I've always been a huge skeptic and I think a lot of people are. I think a lot of people think, you know, well, that's not really what I'm going to do with it when you see demos where you're like, move up a pixel, move up another pixel, move up, oh, damn it, I'm just going to use the mouse, you know? And I think the story is we need to start thinking of this as a way to do hands-free input when your hands are doing something else and we need to think about doing output feedback when your eyes are doing something else.
And the example Kim gave was an awesome one where the computer says, hey, I just came unplugged. How great would it be if you had a server machine and when you unplugged that server, it said, hey, hey, I'm unplugged, you know, you unplug the network, I'm not, you know, I'm not serving web pages anymore. Hey, hey, and it just sat there.
Hey, hey, hey. That'd be good feedback, right? I could really use that. I'd save a lot of money. And I see my time is up. So, yeah, so in conclusion, I think this is really great technology. I hope we get a lot more APIs in Cocoa. Please, please, please, please, please. And I hope you all use it. Thanks.
Thanks a lot. Okay, slides. Slides once again. Alright. Thanks a lot, Wil, and especially for debugging our demo code. That's what we get for not compiling and writing our demo code. Okay, so if you need more functionality than what we provide with the Cocoa API, you can use the Carbon Speech Recognition API. It gives you full access to the speech recognition engine, access to a very powerful way of defining what it should listen for, and a little bit of control over the user interaction.
But because we give you all that power to define what it's going to listen for. You need to be careful. So in the case of chess, it's easy to just say, oh, well, that's just a piece followed by a square followed by the word two followed by a square. Well, if you multiply that all out, you've got over 24,000 spoken commands. Well, the message here is constrain your language model and listen only for what's appropriate. So that improves accuracy because we're not listening for all those invalid commands in the case of chess.
But, you know, I'll leave a few alternative phrasings in there just to make it easy on the user. So a language model, again, is just a way of telling the recognition system what it should listen for. So one way we might define the language model is we might have a set of branches for each one of the piece types under that, and we have the valid moves at that moment. As the user is playing the chess game, we're going to update this so it's always correct.
So what we do is we give this to the speech recognition engine and it searches through trying to find the best match, and then once it does, it's going to make a copy of that original language model and hand that back to you, trimming away everything that wasn't valid, just the valid portion. And then you're going to take that and then map that onto a particular move.
So as far as just a summary of how you use speech recognition API, you're going to create a recognition system and a recognizer object, set up your callback routines, set up the language model, and begin listening. And then finally, as those commands come through, then you're going to handle the recognition result. Let's talk about speech synthesis. Very similar. Four main ways to access it.
AppleScript, we have the say command. Here are a couple of examples. You can just say a string. You can say it using a specific voice if you want perhaps a female voice. You can even, let's say, convert some of your news articles into a podcast by actually saving it to an AIFF5 that file, throw it on your iPod.
We also have a Say command line tool. Similar, you can just say a string, say it with a particular voice, and save it off in a file. One thing Kim loves to do is if you've got a long compile or something, you put that command, and then you follow it by the Say command. And that way, he can put it off, move the terminal to the background, answer some emails from his team, and then he hears the computer say, your compile is finished, or whatever voice it uses.
If you're using Cocoa, just like with speech recognition, we have a class for that. It allows you to do asynchronous speaking. You have a number of notifications. You can do the bouncing ball across the words. You can animate a mouth to speak along with it. You have access to the voices, and you can integrate it with the speech recognizer objects as well.
So here's a little bit of an idea. Hopefully there won't be any bugs in this one. Instantiate the synthesizer, set the delegate if you want to get those notifications, and then you just start speaking the string. And if you've signed up for one of the notifications, such as the "Did Finish Speaking" message, you'll get that, so you can update your UI or whatever you want to do.
So we also have a Carbon version of the speech synthesis API. That allows you full access to the synthesizer. Oops, let's go back. Direct access to the synthesizer, so you can set and get properties. We have more notifications. And there's a few utility routines for converting text into phonemes and support for custom dictionaries, so you can fine-tune particular words.
So like speech recognition, it follows a general pattern of creating a speech channel, setting a callback routine, setting any properties you need, such as rate or pitch, and then you start speaking, and then you handle any of those notifications. So that's just a summary of the APIs, a lot more detail in the documentation, other areas. So with that, I'm going to let Kim sum up and finish out the session.
Thanks, Kim. So, in the last couple of minutes, To summarize, you've seen the new speech synthesis, and you've got some guidelines about how to use it, and speech recognition. Regarding that new speech synthesis, this is a work in progress. We have a lot more work that we want to do on it.
So please, please, please help us out. Run it. Listen to things you don't like. And write bugs and send them to us. We need your help. We can't test all of the texts that you'll be testing. And in particular, make sure it works well in the kinds of strings that are in your applications.
For speech recognition, We don't think there's a lot of value in trying to use speech recognition to replace the mouse completely. The mouse allows you to point at and click on an object on the screen in a single gesture. The value of speech recognition is for things that you can't see to point to, to click on. If you have items within menus, within submenus, within other submenus that would scroll off the screen, speech recognition is a good way to get to those.
Consider at least adding keyboard shortcuts to all of your buried items so that you can then ship speakable items so that people can speak them. Look at other ways. You can maybe shift some scripts, Apple scripts, for speakable items that control your commands, or even better, call the API directly.
We have more information. If you have any questions, don't call me. I'm terrible at answering email. Contact John Galenzi, because he's good at getting back to people, and he bugs me. There are some great examples of code at developer examples speech. Download some of those. Those do compile, and they do run.
and uh... we've documented all the speech api speech recognition and speech synthesis we are holding a lab session from eleven o'clock into one o'clock today we don't have time for cook our public question answered right now but we'll uh... be hanging around outside this room from the end of this uh... presentation including kevin lindsoe kevin could you stand up please kevin lindsoe was the c_e_o_ founder and inspiration of casterly dot com and he's here to answer any questions about his voices how to get them or the experience of putting them to the mac so we'll be out the back thanks a lot thanks for coming