General • 1:15:23
This session is an in-depth look at Speech Recognition and Synthesis (Text to Speech), offering high quality and natural sound. Speech-enabled applications offer many benefits to Mac users, especially those with accessibility needs. Learn what you get for free with your Mac OS X applications and what you need to do to deliver the Speech experience for your customers.
Speakers: Kim Silverman, Tom Bonura
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Well, welcome to the last session of the day and the last session of this week, I think. I think we're the last on the calendar. I mean, there are lots of other sessions running at this time, but this is the last one in this hall. And this session is really important in many ways, primarily because Speech Recognition and Synthesis plays such a key part in the universal access component of the next release of Mac OS X, Jaguar, known as Jaguar.
And your application can build on the speech technologies that are in Jaguar and gets a lot of stuff for free, and we're going to explain what those things are, and then can do a lot of extra stuff to make your speech synthesis or recognition work even better and get better performance out of that.
So speech is a really important component in terms of our universal access. It's also a really important component in terms of providing a full, complete user experience on Mac OS X because the technology is really great. So to go more in-depth on this stuff, I'd like to introduce Kim Silverman, who runs the Speech Technologies Group at Apple. Thanks, John.
Thanks. Hi. First of all, I'd like to thank you all for staying till the end of a very long and very exhausting conference. You know, I think four out of the last five WWDCs, Speech has been the last session of the conference, and we've learned that one advantage of that is that only the people who really want to know about Speech are there. So thanks for being here. Can I ask, since it's a little sparse in here, that you come forward.
Move forward. I know you're probably tired and sitting and comfortable, but if you can find it within your legs to move, come on forward so that we can have a more intimate session here. Thanks. Yeah, come on up, come on up. There are plenty of seats down the front. You'll enjoy yourselves more. Okay.
So what we're going to talk about today is a little bit about why to adopt Speech, but we're not going to spend much time on that because we think it will become obvious as you come through the presentation. Then we'll talk mainly about introducing what's new in Jaguar and then take you through some scenarios as developers of what should you adopt first and we'll describe what we would consider to be a good adoption of speech, a better adoption of speech and then the best adoption of speech. So this is me.
Okay, to start off, I know that there are a bunch of people here who are familiar with the Speech that we've been shipping for a while, and there are also people here who are new to the platform. I know there are a whole lot of people who've come over from Unix and people coming over from Windows.
So I'd like to first of all bring everybody up to speed. Those of you who are already familiar with the Mac, please bear with me. I'm going to talk through briefly what we have been shipping in the previous version of OS X. Is this microphone on? Great. I'd like demo machine number three, please.
Great. Okay. You turn on Speech from the Speech Preferences panel. Under the Speech Recognition tab, there's an On switch. Let's get it to say a few things and then we'll talk about how it works. It dynamically adapts itself to the acoustic environment in which it's finding itself. So I have to now give it a chance to adapt itself to me being up here speaking into a large auditorium with echoes of whatever the wavelength of this room is and through a PA system. So bear with me for a moment. What time is it? What time is it? It's 5:03.
What day is it? It's Friday, May 10th. Usually it takes between one and three sentences. If you're using it, and you've been using it successfully in one room and then you shift to another, expect maybe the first one or two utterances, it will act like it's not hearing you. That's when the adaptation happens.
[Transcript missing]
Okay, let's see. Quit this application. Quit this application. Get my mail. Open my browser. Open my browser. Hide this application. Open my browser. Let's scroll this page down a bit. Actually, I can do this by speech. Move page down. We'll select some text here. Copy this to the clipboard. Switch to Mail.
Reply to this message. Reply to sender. I'm glad that happened. We'll refer to that, what just happened in a moment. Paste the clipboard here. Send this message. Send this message. "I'm going to quit this application. Alright, let's try switch to chess." Pawn D2 to D4 Oh boy, I'm thinking on my feet here. Knight B1 to C3 That made it think. Look at that. Uh-oh, I'm in trouble. He's going to bring his bishop out. This lets me do something that my 13-year-old son will never let me do when I'm playing chess with him. It lets me say, take back move.
Okay, let's talk a little bit about how this works. We have a Speech Recognition engine which has a well-documented, mature API, and applications can call this engine. We ship some applications that call it, and some of you guys do, and we hope after today more of you will.
The Speech Recognition Engine itself has a number of characteristics that are important. First of all, it's speaker independent. You don't have to train it to your voice. You just open it up and turn it straight on and use it. For example, I had trashed the preferences just before we came up, so what you've seen me do here is straight out of the OS X CD. There's no adaptation to your voice. Secondly, it's a continuous speech engine. You don't have to pause between words, which is a great relief.
It's a continuous speech engine and it has a large vocabulary. We have 120,000 words in the dictionary. If a word is not in the dictionary, then we decompose it linguistically. We know about known endings and word components and look up its stems in the dictionary and then reconstruct its pronunciation. If that doesn't work, then we fall through to statistically trained letter-to-sound rules, trained with classification and regression trees, and we have different sets of rules for names and for the rest of English. So it's a large vocabulary engine.
And we use it for command and control. The way we do this is driving it by a finite state grammar. That means to call the engine, you tell it what to listen for. You express that as, in the simplest case, a flat list of commands, or you can give it a full grammar, such as set up a meeting with name at time, where within the grammar, name would then expand to, say, all of the names in your address book.
Okay, so what I've been showing you here has been a combination of a couple of things. First of all, the chess application, since it's in front of us, is an application that's a front end to GNU Chess and that calls the API directly. The rest of what I was doing, which I turned on from the speech preferences, is an application called Speakable Items.
From the Speech Preferences, you turn that on under the On/Off tab of the Speech Recognition tab. Now, Speakable Items is a very simple idea. When you first turn it on, we create a folder in your home directory called the Speakable Items folder. Open the Speakable Items folder. Anything that's in this folder can be launched by speaking it, and it's just the same as double-clicking on it. So anything that you can launch by double-clicking, you can launch by speech. Applications, documents, folders, anything that you can double-click on.
We pre-populate that Speakable Items folder with a kind of a starter kit of a few basic commands that can get you going. The ones I've been saying today are all there. What day is it? What time is it? But the real power of this is that you can add your own items to it, and thereby customize the Speech Recognition for the way you work. All right. Let's talk about the structure of the user interface for a little bit. This is our Speech Feedback Window. It has a number of components.
Speech Feedback Window: The Speech Feedback Window, as we have found over the years with using Speech Recognition, gives users the minimum amount of information they need to help have a successful experience with Speech. If I turn it on, it should come back up. Okay. You can see that it says ESC in the middle. Is the screen large enough? Can you see that? Who can't see the ESC? I don't see any hands, so I guess you can see it. Good.
That means that by default, the push-to-talk key is the escape key. That means when the escape key is pressed, it's listening, and when the escape key is not pressed, such as now, it's not listening. Good thing. Otherwise, it would be trying to recognize me while I'm talking to you. And there's an important point there. If you put Speech Recognition into your products and you're demoing it, demo it in push-to-talk mode, because while you're demoing, you'll be talking about it.
Across the industry, Speech Recognition technology is good at recognizing what you say to it, but it's not good at knowing when you're not talking to it. When I press the push-to-talk key, the escape key, the ESC disappears and the microphone stops being grayed out. You see that? That shows me that the application is not wedged, it's running, and I've got the right push-to-talk key. And then there are some lines, colored lines down the bottom. You see blue, green and red? They are a VU meter.
They're just like the VU meters or the LEDs on your home stereo. And they tell you that the microphone is connected, it's working. They also give you a measure of the background noise and of how loud you're speaking. If I press it now and don't say anything, You see there's about one line of blue there? That's measuring, although we may not notice it, but there's quite a lot of background noise up here. The buildings, air conditioning, the noise coming from next door, or the hums from a... Oh gosh, far more computers here than I would like to own.
When you're using it, we can cope with one, maybe two lines of noise. Above that, performance starts to degrade. So if you try it on the train, in the subway, you might find that there's too much noise. As I speak, if I get too close to the microphone, I see lots of red. That means it's clipping. If you speak to it and it doesn't recognize you, don't yell, because that also will make it clip. You don't want there to be any red.
While I speak, you can see that there are some black triangles moving towards the microphone. See them up there? They kind of look like angelfish. That shows that the Speech Recognition search is happening. So whenever you see those, you know it's trying to recognize me. One of the user interface issues with Speech is knowing what to say. We'll have more interesting things to say about that later, but let me just show you the current user interface for dealing with that. Open the Speech Commands window.
Whenever any application runs the Speech Recognition Engine and calls the API, it gets these two pieces of user interface for free: the Speech Feedback Window and the Speech Commands Window. The Speech Commands Window is a nice Cocoa application. Love Cocoa. It's divided into two parts. The top part shows you what has just been recognized and if it spoke back to you it shows you what it said.
The bottom part is broken into different sections. This one called Speakable Items with a disclosure triangle shows us what's in the top level of the Speakable Items folder. For example, well, cancel last command is up there. That's one of the items in the Speakable Items folder. In OS X, unlike in OS 9, we keep track of all the applications that you ever run while Speech is running. Now let me talk a little bit about user interface issues here and be a little controversial, perhaps even with my own management.
The Aqua user interface is designed to be uncluttered and very elegant. And I think it succeeds in doing that. There's not much clutter on the screen and the dock lets you very quickly get to those things that you've put there that you run most often. The problem though is that there are some things that are more difficult to get to because of this uncluttered interface.
So with Speech Recognition running it means that you can go directly to any application that you've ever run. If I open this out we'll see, this is just what I have running on my desk, this is my development machine actually, and you can see there's lots of stuff in there.
And so I can say any of these commands and it will switch to that application. Let me try just say switch to Internet Explorer. And there it is, it comes to the foreground. And that leads me to the next thing to show you about Speakable Items. Open the Speakable Items folder.
In the Speakable Items folder, There is a folder called Applications Speakable Items. That folder contains folders that are named by applications. The items in those folders are only speakable when that application is in the foreground. So for example, when I said switch to Internet Explorer, you may have noticed another section appeared here. Here it says finder.
The finder is in the foreground and there are a bunch of commands here that are in the finder specific application speakable items and so are only speakable at the moment. If I switch applications, you watch, you'll see that the recognizer's language model dynamically changes. Switch to Internet Explorer. Switch to the Finder. You see it changing there? Switch to Internet Explorer. Close the Speech Commands window.
Get Apple's stock price. Check it up, I actually don't know what it is. Move page down. Looks too far. Oh, our stock's doing pretty well, look at that. Okay, what did I just do there? You can make web pages speakable.
[Transcript missing]
All right. Within the application-specific Speakable items that we ship for Internet Explorer, there's one that's called Make This Page Speakable. This is really cool. What's your favorite web page? You, what's your favorite web page? Yes, you, in the gray top. I'm sorry? ZMAC.
[Transcript missing]
and it will make a guess as to what I would like to call it by looking at the content of the page. I can just go with that or I can change it. Let's call this go to the Znet And that's it. So let's see, I'll go to another page so we know we're not cheating. Go back. Go back. All right. Go to the Znet page. There we go.
Thanks. So the way that that happened was the application-specific item for Internet Explorer that we ship, called Make This Page Speakable, took a copy of the URL for that page and put it into the Internet Explorer application-specific folder. And there it is, right there. We find our buddies at Apple that use this generally have 10 to 30, sometimes 50, webpages speakable by voice.
Okay, so that's a brief walk through the structure and architecture of what we've been shipping. Now let's talk about... Let's talk about what's new in Jaguar. We've improved the Speech Recognition, we've improved the Speech Synthesis, we've improved the User Experience, and we've improved the Developer Experience. Let's talk about the Speech Recognition first. We've improved the accuracy, we've been working a lot on that. We've increased the robustness against background noise. If you were up here, you would hear how loud it is.
We've continued to work closely and painfully hard with the hardware groups to make sure that the microphones that ship in our hardware are optimized to work with Speech Recognition. This is important. This is an advantage that Apple has. Since we own the hardware and the audio layers and the operating system and the Speech Recognition, we can make them all work together in a way that's very hard for other developers.
and we have, you'll be glad to know, improved the compatibility with third-party USB microphones. Let's talk about the Speech Recognition accuracy, its performance for a moment. Now, Speech Recognition performance is notoriously difficult to measure. The way we measure it is, traditionally, we take a recording of several hundred spoken commands, spoken by six different people, three female and three male, at different volumes, and we play these several hundred commands over and over again, overnight, from a loudspeaker in front of the Mac we are testing, at the volume of a normal person.
We take a couple of measures. First of all, you may have noticed while we were working up here and while you were working at your desk, sometimes there will be noises in your office that are not speech. Door slams, telephone rings, coughs, paper shuffling. We have to make sure that we don't try to search those in the speech, against the speech grammar, and so we have models for lots of sounds like that.
The first thing that happens when Speech comes into the Recognizer is we test it against those models of noise. And if it matches one of those better than it matches Speech, then we reject it as being background noise. Now we have tuned up the Speech Recognizer to prefer to reject than to make a wrong choice. And so it turns out that this rejection performance is the most sensitive measure of the performance of a microphone and of the Recognizer.
When we started work on this back in the OS 8, OS 9 timeframe, the percentage of times that it falsely rejected, that is, that it heard Speech and thought, no, I'm not confident enough, I'll class that as noise, was 22.5%. Then when we released OS 10.1, we managed to halve that down to about 10%. And now currently in Jaguar, our latest measurements on an iMac, we've halved it again down to 5%. Thank you.
Now what's really important is if it thinks something is speech, it tries to recognize it, it had better make the right decision. You don't want to say, open my browser and have it empty the trash. So the next thing we measure is, of those spoken commands that it accepted as being speech, how accurate is it? Well, back in the OS 8, OS 9 time frame, the error rate was 4.5%. When we got to OS 10.1, we'd reduce that by a factor of 3, down to 1.5%. And in our current measurements with Jaguar, we've currently reduced that again by another factor of 3, it's down to 0.5%. Devang, he's the man, Devang. Dr. Nike.
Okay, we've been working on Speech Synthesis. In the engine itself, we've been working on the quality to make it sound more natural and easier to understand. We've been improving the voice signal processing. We've put a lot of work into the intonation, the tune, the rhythm, the pitch. Now that's important for you to be able to follow the meaning and for it to sound more natural.
We've reduced the number of pronunciation errors and we have actually modelled intonation now unlike, I think, just about any other synthesiser that I know. Not at the level of the sentence, but at the level of the paragraph. Now let me talk about this for a moment. We're working hard on the synthesis. We have put more work into Speech Recognition than into Speech Synthesis over the last few years because that's where the greatest market need has been. And now we're correcting that balance. So we've been working hard on the synthesis.
We continue to work hard on it for Jaguar. What I'm going to show you now is a work in progress. Can we switch back to demo machine number... No, no, I'm right here. I have the stuff right here. What am I talking about? Okay, so one thing we've added is paragraph intonation and topic tracking. Let me be clear about what we mean here. As we read out a paragraph, we build up a model of the listener's internal representation of the concepts introduced in that paragraph. And we annotate that in the intonation.
So I'll show you this paragraph for you. This is a paragraph that I pulled out of the Mac OS X edition of the World Book Encyclopedia. It's a great application. Check it out. I'll play this in two versions. First version is passing that text to the speech synthesis in the shipping version of OS X. And then the current work in progress with Jaguar. So here's the shipping version first.
As computers have grown more powerful and their functions more various, the size of computer programs has grown dramatically. Because of their large size, many modern programs are distributed on CDROMS. Okay, let's see how we do now. See if you can hear a difference. As computers have grown more powerful and their functions more various, the size of computer programs has grown dramatically. Because of their large size, many modern programs are distributed on CD-ROMs.
"Hold your applause, hold your applause. Okay, there have been some discontinuities in the signal. We've been working on those. Here's an example of a rather strange sentence. You'll hear this on OS 10.1 and then in the Jaguar version. First, 10.1." Without these problems, go at those targets, get the news. And now in Jaguar. Without these problems, go at those targets, get the news. Hear a difference? No, hold your applause, hold your applause.
We've worked on word pronunciation because although the dictionary is large, some words have been mispronounced. So here's, I've gathered together a few examples of things that we've got wrong. Listen to this. This is first of all what's shipping in 10.1. May zillions and deathy a potions with al-al-he-amers get a remail with that look via pee-pee-poey.
I've got to do that again, that's so much fun. Listen to this. Those of you, has anybody been playing with a synthesis? Are these problems representative of what you've been hearing? Yes, yes. Okay, let's just listen to that one again. Maze Aliens and Ethia Potions with Al-Al-Hiamers get a remail with that look via PP Poey. Alright, I love that PP Poey. Ha ha ha.
Alright, here's the current version of Jaguar. Brazilians and Ethiopians with Alzheimer's get their email with Outlook via PPPoE. Thanks. No, no, hold your applause. Hold your applause. A big problem for word pronunciation is no matter how extensive your dictionary, people create words on the fly. It's one of the wonderful, rich things about language, particularly Germanic languages where we concatenate items together in text.
So here's a bunch of examples of things that we tend to encounter, and if I added all of these to the dictionary, it wouldn't give me any help because next time I have some text, there'll be more of them. I have these in increasing difficulty. We'll just read them out with the OS X version. I'm sorry, the Jaguar version. Myfile.cp Readme.txt Smiley Face Apple Lovers Jobsification George Bush for President.info SummonAccreditedUniversity.edu/phonen umbers.html Okay, now you can applaud.
As I said, this is a work in progress. This is a little bit further ahead from what you have on the Jaguar CDs that you've got, but we hope that you'll track this and we're having a great time finally folding into the code base research advances that we've been making over the last few years.
Alright, we've made some improvements to the user experience. One, as you may have seen already, is that there's a screen reader built in for free. For free for you. That means that you with your applications will get the benefit of it. It reads any text on the screen, including any selected text.
Users can navigate the entire system. Menus and front windows, controls, all of the buttons and so on can be spoken by the system and can be spoken by the user. And we have talking notifications, so applications that need your attention can do more than just jump up and down in the dock. They can talk to you. Let's demo some of those improvements. Could I have demo machine number four, please? Okay.
Okay, once again I'm going to have to let this machine adapt to my voice as it sounds in this acoustic environment. What time is it? It's 17:29. What day is it? It's Friday, May 10. Okay, I think we're there. Alright, we have added a tab at the top level of the Speech Preferences called Spoken User Interface. This tab controls how the computer speaks to you. For those of you who remember from OS 9, Talking Alerts is back. Thank you.
So you can have any time an alert or a sheet comes up, you can have it spoken to you. Now the aim of this is that you should never hear it. That is, alerts or sheets should come up on the screen, you read them, you think about them, you respond to them, they go away. There shouldn't be any speech.
But if your attention is away from the machine and you're staring out the window or crawling around on the floor plugging in cables and you don't respond after a while, the machine can then get your attention back, not just by going beep, but by reading out the content of the alert. Because of that delay, we ship it with a delay between the time the alert appears and it gets spoken. By default, that delay is 10 seconds. I will move it back to zero for now so that we don't have to wait.
For those of you who remember Talking Alerts, you can now give it a different voice from the rest of speech output in the system. The voice of the messenger already contains some information for the listener about the identity of the message. Let's do a demonstration. Alert. This is a demonstration of your Talking Alert settings. Use these settings to customize how your alerts are spoken. Okay.
Now there's some other Speech output as well. One is that we can get the computer to talk to us when an application in the background requires our attention. Let me give you an example. I'll turn that on. And let's switch to Internet Explorer. and I'll open up a new page here. So let's try going to a web page that doesn't exist. Let's go to some-nonsense.com.
[Transcript missing]
Excuse me, Internet Explorer needs your attention. There it goes. Okay. I can see it saying "meep, meep." Okay.
You can use a hotkey to speak any selected text in any application. We ask you to define what the key would be. For example, I'll make it Command F11. Now this is useful for people with disabilities, but we think it's also got general usefulness. For example, you'd like your email read out by the computer while you stare out the window, while you shuffle papers on your desk. Rather like hearing your voicemail over a speakerphone. Or perhaps, here's an example, you're looking at a spreadsheet.
You've just typed in your budget and you want to make sure you've got the figures right. Well checking them against the paper and the screen, who's done that? It kind of gets a bit tedious, right? So now you can select the numbers that you've typed in and just hit that hotkey and they'll be spoken.
[Transcript missing]
Now there are already two important messages for you developers. One is every dialogue, every alert, every sheet that you have in your application can now be spoken.
So test with that and make sure they sound okay. You might want to reword some of them. The second message is every piece of text that's selectable in your application can now be spoken. So test with that to make sure it works the way you want it to.
[Transcript missing]
The message here is: Even if text is not selectable in your application, it's likely to be spoken anyway. So test with that so that you as an application developer don't get embarrassed when your application sounds stupid.
Notify when an application requires your attention. Okay. So far we've been talking about the computer talking to you, but it's also possible now for you to talk to the computer. That is, anything that can be spoken by the computer to you, you can speak back. We have two extra controls.
There's an extra tab under Speech Recognition called Commands, and that lets you turn on and off different sections of speakable items. Do you remember on the Speech Commands window, there are different sections such as the Speakable Items tab, the Application Switching tab? Well, each of those can be independently switched on and off, and we've added a couple.
One is Menu Bar Navigation, the other one is Front Window Controls. Let me demonstrate that one for you. I'll turn it on. I'll try speaking those check boxes. System-wide Specific Icon. Whoops. It got me right anyway. Good. The recognizer is a little forgiving, and that's usually a good thing.
[Transcript missing]
You can see things are changing over in the Speech Commands window as I say this. And the front window controls appear in the Speech Commands window so again users can know what to say. So we've seen check boxes are working, let's try some tabs. Spoken User Interface Speech Recognition Spoken User Interface Spoken User Interface This is kind of fun. Buttons work. Demonstrate current settings. It's not my fault. This is a demonstration of your talking alert settings. Okay, okay, okay. Speech recognition. System-wide Speakable Items Application-specific Speakable Items Application Switching Application Switching Front window controls. Front window controls. There we go. Okay, cool.
Let me just say something about the acoustics up here. We're always a little nervous about demonstrating Speech Recognition for the following reason. We have designed it to work in the acoustic environments in which people usually use their computers. That is in homes, in offices, where there's a kind of background noise and echo structure of rooms of about that size. Up here the noise is quite different and the echo structure, particularly with a PA system, is quite different.
And as you can see, it didn't misrecognise. Did it mis- no, not once. It didn't once misrecognise. It again, it preferred to a couple of times ask me to repeat myself. We think that's the right thing to do. Oh, that was that demo. Okay, we have improvements for you, developers.
The first one, Oh, could we go back to the main slide machine please? Thank you. We have improvements for you guys. The first one you've been requesting for years and we finally released it for you. We're calling it Tune Format. We'll go into that in a moment. Another developer request has been output to an audio file. People keep asking us, how do I get text-to-speech into a file? Well, you can do it now. We'll show you.
We allow you to embed speakable items into your own application so you don't have to worry about installing them in the speakable items folder. And we have some Cocoa classes. Let's first of all talk about Tune Format. Those of you who have been coming to WWDC for a while might remember this as being called Repeat After Me. Here's the problem.
When you send text to text-to-speech, we try to make some rough guesses as to how you might like it pronounced and how you might like it intoned. Which words you might like to emphasize, where you might like it to pause, where you might like it to raise its voice, where you might like it to lower its voice.
Those guesses may not be what you intend. You can use a lot of controls to embed in the text to give the synthesizer hints and move that pronunciation around, that intonation, I'm sorry, that intonation around. However, you may get frustrated that I don't quite know how to use these to get what I want.
Well, wouldn't it be great if you could just take a recording of you or your favorite voice talent saying something the way you want it spoken and then have the computer copy it? Well, that's what this does. You specify the exact tune and timing details, bypassing the front end of the synthesizer and sending them straight to the back end. Let me give you an example. What we have here is an internal data visualization tool that we use in development. The author was a little unsure whether we should show this to you because it has not been beautified to his standards yet.
I will play you... Well, let's play the demo first. I'll play you a sentence where the synthesizer makes the wrong choice about the intonation. First of all, we'll pass the text directly to the Jaguar version of Speech Synthesis.
[Transcript missing]
I like the good old nuts and bolts approach. And here she is copying me. I like the good old nuts and bolts approach.
So here's how it works. Here's a plot of the relevant internal parameters in the synthesizer for that sentence. The horizontal axis is time going through that spoken sentence. The vertical axis is the pitch, the difference between R and R. This line represents the pitch contour that's been generated by the synthesizer's rules throughout that spoken sentence. Can you see vertical striations of light and dark grey? They represent the boundaries between the different speech sounds. If you squint, you might see that it says "I" there, and then there's an "L" and then there's an "I" and then a "K" So that's "I like".
Now, these little X's are mathematically fitted key anchor points to represent that contour. We generate this contour and then fit these to it and output for you a text format where there's one line per phoneme. Here, for example, is the I, here's the all, here's the I.
[Transcript missing]
I want to show you a couple of interesting things that you can do with this because I think this tool itself produces lots of opportunities for developers. One of them, of course, is Yep, here we go, good.
It's customizing the Speech output so that within your applications, if there are any strings that are going to be spoken that you know are going to be spoken, you can use this tool to have them spoken according to the personality of your favorite voice talent and therefore associate a personality with your application. And they'll be spoken in the same synthesis voice, but with that personality as you just heard. And so the voice is consistent with any other synthesis that occurs in your application.
But there are other things that you can do with this. I thought about applying this to lip syncing. Lip-syncing is a large problem in the animation industry, as I'm sure some of you painfully know. You have recordings, you have animated characters, you want to sync them together. There are a few tools available that make rough guesses just by things like the amplitude of the signal. Well, it seemed to me that just like this Speech Recognizer, this special one, can find the phonemes in the signal to control this tune format, similarly it could find the phonemes in the signal to control lip-syncing.
When I looked at this, I didn't have a talking head that I could test this on. So I tested it actually what turns out to be a more rigorous, stricter way, and that is by listening. When you're doing animation, there is some... and other users have some forgiveness in the lip-syncing being slightly out of sync with the Speech, but our ear is not that forgiving.
Here's an example. This is a recording sent to me by our buddies over in Not Borland. Oh, I'm sorry, I've forgotten the name of the company. Big company that makes lots of multimedia applications. From an application that they were working on, and their problem was that they wanted to do lip-syncing and automate it.
So... Here we go. The sentence is, well, I'll let you see if you can figure out what the sentence is, but what I'm going to do here at the moment is play you out of one speaker their recording and at the same time you'll hear the speech synthesis saying the same sentence with the durations of each of the phonemes generated by the speech synthesizer itself to let you know what the wrong answer sounds like. These guys are not in sync with each other.
Do be careful as you make your way up the paths. Do you hear that they're different? Do be careful as you make your way up the paths. Do you hear that they're out of sync with each other? A few nods, no? Yep, alright, alright. So, I passed this through that tool and without touching the data, just got back what that Speech Recognizer did and told the Speech Synthesizer to use those durations that it found in that recording that they sent me. Now it sounds like this.
Do be careful as you make your way up the path. Do you hear that now? They're now more aligned. But it still seems to me that there's a little bit of misalignment there. Do you hear it? I'll play it again. Do be careful as you make your way up the path. So I thought, well maybe the Speech Recognizer wasn't quite accurate on where it found those phonemes.
Or, maybe, what I'm thinking is differences in alignment are really differences in the pitch. So using this tool, again, without touching the data myself, I took the pitch from their recording and put that onto the synthesizer. Now, it sounds like this. Do be careful as you make your way up the path. Now they're aligned. Thank you.
I'd show you this because I think this tool is giving you guys some opportunities to really make some money here. Now, this tool lets you customize the way a particular string sounds, a particular sentence. That's fine if you know ahead of time what that sentence is, but then if you have to synthesize something else that you haven't seen before, you can't take advantage of that.
Well, it seemed to me we ought to be able to generalize from what we have seen to other sentences if they have a similar syntactic structure and are used for a similar communication purpose. We should be able to generalize from a few spoken examples. So, I worked with CompuServe when they were making a telephone access, a service for telephone access to information over the internet. You get a PIN number from them, you call a special phone number, type in your PIN number. And then you can get your email read out, you can hear the news, you can hear the weather.
The important thing is, by typing in your PIN number, they know who you are, and so they want to greet you by name. And they wanted to sound cheery and friendly. So they, in their eagerness but naivety, recorded their voice talent saying greetings to about 5,000 names in the hope that 5,000 would give them some kind of coverage.
How many names, Kevin, do you think you need to have 80% coverage? You'd probably know about 40,000? Oh, way more, maybe 90,000 I'd say. 90,000 names to give you 80% coverage. So they have a problem here. So they thought, oh, that's okay. What we'll do is we'll use our recordings from our voice talent if somebody whose name we recognize calls up, one of those 5,000. And if you happen to be not in that 5,000, we'll use the Speech Synthesis. But it didn't sound the way they wanted it to sound. And that's when they came to me for some help.
I tell you this because this is what we're making available for you guys now. Okay, so the first thing I did was use RepeatAfterMe to take some of their recordings and get the synthesizer to copy their recordings. So what I'll play you first is the greetings from their voice talent, but this is not by their voice talent, this is by the Victoria Voice with its input controlled by the output of this RepeatAfterMe, this tune format tool.
Hi Lenny, good afternoon Bolin, good morning Greg, hi Jim, hi Tom, good morning George. Okay, so that is synthesis and that, I think that sounds, I can hear in that the personality that they're trying to convey. The problem is, if somebody calls up with a different name, then they have to use synthesis. Here I have typed in greetings, and I've got a
[Transcript missing]
Okay, well, imagine bad synthesis. So I took names that had very different length and very different syllabic structure from the names that they sent so that I knew I wasn't cheating. Then, when I just synthesized those, they didn't sound anything like that same personality. But then I generalized from the recordings we have using a combination of the output in that tune format with modifying the intonation of the new text by a simple linear transform from what came out of the synthesizer when it saw that text. This is what it sounds like. Now, this is totally new strings that the synthesizer has not seen before. Hi, Kevin. Good afternoon, Dominic. Good morning, Dr. Scott. Hi, Jean. Hi, Jean. Good morning, Lewis. Good afternoon, Dr. Livingstone.
So I've spent a bit of time talking about that tool and this tune format. May I have the main projector again, please? Because we're pretty excited about it and we're delighted that at last, after a few years, we can move towards making it available for you. This tune format is not yet working in the version of the CD that you guys have, but it will be soon. Okay, we can now do output to an audio file. And you can do this a number of ways, but one way that's particularly easy is through AppleScript. May I have demo machine number four, please? Okay, so let's run Script Editor.
Who has not used Script Editor? Okay, do you know what AppleScript is? AppleScript is a scripting language that Apple makes available for controlling applications across the whole operating system. One of its characteristics is that it's very easy to read an AppleScript program and understand it because the language looks sort of like English. We have added a command to that, we added this a little while ago, called the SAY command.
[Transcript missing]
If I run that, we should hear it. Hello, everybody. Okay. We have now added to that saving as, and then you put in a file name.
[Transcript missing]
and whatever you say here will be recorded into that file rather than spoken. Let's not do something as simple as "Hello everybody." Let me go to the web and we'll grab a piece of text from here.
Let's talk about a paragraph here. Something that would take a while to speak. Copy this to the clipboard. Switch to Script Editor.
[Transcript missing]
Get rid of that return. Run it again. Sorry, I've got a close quote there. We have a close quote there. I've got an open quote there. Mac OS is helping out. What am I doing wrong? There's new lines. Oh, oh. Got to watch that.
Now, is there one there? And is there one there? Let's try that again. Alright, let's try something smaller. Let us try something shorter for now. I don't want to go searching for these things. Alright, let's try that. Let's retype that. Say, "Will it work this time?" It's been working for me for weeks.
[Transcript missing]
Include something that could take people's email. May I have the main machine again? You could take people's email and read each email message into an audio file and then download that onto a person's iPod so that as they're cycling to work in the morning they could be listening to their email. We think there's an opportunity here for you developers.
Okay, so we've talked about the tune format, we've talked about output to audio files. Let's talk for a moment about embedded speakable commands. I mentioned that there are application-specific folders that can contain commands only speakable in those applications while that application is in the foreground. Several of you have already been using that by shipping with your applications speakable commands for those applications, and it's sometimes tricky to install them.
You install them in the speakable items folder if the user has one, if the user doesn't have one, it's a little unclear. So now, you don't have to install them at all. You just bundle them into your application package in a folder, and we will find them. and we'd like... Well, thank you. Obviously somebody's been doing this. And we've got Cocoa classes at last. There's been a lot of rumours on the net. Thank you.
This has been requested by lots of developers and we've tried to make these simple to adopt and simple to code and do them in such a way that we prepare for the future because in the format that it should be pretty straightforward to incorporate these into Interface Builder. But at the moment, they're just available as classes. They're not in Interface Builder. There are two main classes, Speech Synthesis and Speech Recognition. The Speech Synthesis is a single class and it can speak any string synchronously. It's really easy. It's one line.
If you need asynchronous speech, that is, you want to animate the screen or do something else in your application while it's speaking, or you want more control, such as highlighting words while they're spoken or doing lip syncing, then you can set up a delegate for notification and there's a lot more control like that that you can get.
For Speech Recognition, again, there's one class to use in the simple case. You get a shared instance of that class, you specify which commands you want to listen to, and then you just begin listening. This takes, does in three lines, what in the Carbon API would take somewhere between three and five pages of code. Is that fair, Tom? So we think we've made it easier for you.
If you want to do something more sophisticated, such as having the recognizer parse the spoken string and give you back the parsed version, or make some things that people speak optional so that the recognizer can recognize fragments, then there's a handful of classes available to build a language model and deal with those cases.
So we've tried to set this up to be very quick and easy to adopt, just a couple of lines for the simple case, but all the power of the original API there if you need to do the things that we have found developers need to do for more sophisticated uses of the speech technology.
and we've improved the user experience. I've already shown you some of the accessibility features we've done for users. In addition, we have some new speakable items and we have something called Semantic Inference. To demonstrate the speakable items, I'd like to invite Dr. Tom Bonura up onto the stage.
Tom. Thank you. Hi. I'm going to try to whip through this. There are a number of interesting new speakable items in Jaguar, and I'm just going to show you one that I think is particularly interesting. So let's just go right into it. Could I have, okay, we have the right machine. Hide this application. Switch to TextEdit.
The first thing I'd like to show you is something that's actually in PUMA, or in the released version of Speech Recognition. Define a keyboard command. And the idea here is that you should be able to define your own spoken commands by simply saying, "define a keyboard command." And so this is just a way of defining a keyboard shortcut. So I'm going to say Command-A is going to be represented as Select All.
Which it usually is in applications and I'll say, well, in TextEdit only we'll refer to select all as Command-A. So now we have created an application-specific speakable item and I just did it on the fly and I can just simply say select all. and it did. Okay, well that's cool and it means that all of your users can now define their own commands if there's a keyboard equivalent for a menu item, which is a good thing. So let's go a little further. And let's say I'm interested in this piece of text and I'd like to use it over and over again. Make this speakable. Make this speakable.
Make this speakable. Okay, so this little bit of functionality says when I say, now I'm going to type the phrase that I'm interested in.
[Transcript missing]
[Transcript missing]
Insert my work address. Okay, so it did the right thing. Now, let's go a little further. And here I have a picture that I took last year in Hawaii. Make this speakable.
[Transcript missing]
Let's save it away. So now, if I want to use that in the context of an application, switch to mail, switch to mail, Insert my great golf picture. It does the right thing. So, there's-- There's a lot of fun stuff like this. Shh, quiet. There's a lot of fun stuff like this in Jaguar, and we think that users will be able to build a whole lot of very compelling and very useful little speakable macros, if you will, using this kind of functionality. Let me go back and turn things over to Kim. I'll be back.
Thanks Tom. Okay, so let's talk about semantic inference. We are so excited about this. Here's the problem. Back when we were less confident about our Speech Recognizer and it was less robust, we only shipped about 20, 30 commands. Users would only have about 50 commands of things that were speakable.
That's all the Recognizer could reliably handle. And that was fine. It was easy to remember the set of commands. But now that the Recognizer is so much stronger, we're finding we and others have hundreds of commands. The burden is now on the user to remember the exact wording of those commands. And if you get it wrong, you're out of luck.
Here's why. The way the Recognizer works is that you load it up with a language model, which is a list of what it's supposed to listen for. Then if it hears something that's in that language model, it'll listen for it. Then if it hears something that's in that language model, such as, what time is it, that goes into the search, and the search finds which path through the language model most closely matches that utterance.
Associated with that path is a speakable action, such as, in this case, speaking the time. The problem is, what happens if the person accidentally says, what is the time? Now you can feed that into the search, and you know what'll happen? Nothing. Because there's no path through the grammar to match that. So, our solution is to use the Recognizer. Thank you all.
That was not talking alerts. Okay, we're about on time. We're doing okay here. So the solution is we take out that whole approach to language modeling. and replace that language model with a more flexible extended language model. The details of which we'll be happy to talk about in the question time or afterwards if you want to know.
Now, when we pass what is the time in there, we do get a string of words that comes out of the recognizer. But now we have the problem. Since it's not a string of words that we've seen, necessarily seen before, how do we map it to the speakable action that was associated with what time is it? Well, basically, a miracle occurs.
And because of that miracle, we can indeed still map it to Speak the time. The truth is, it's not actually a miracle. It's a technology called Latent Semantic Mapping, and it was introduced in the keynote. This is the technology that we gave to the mail group to use for the junk mail detection.
In the case of junk mail detection, we are looking for two categories: junk or not junk. Here, we can map any string by looking at a statistical representation of the components of its meaning, map any string to a set of categories which are defined as being the speakable actions. Do we have time for Tom to demo this? You want that? Alright, applaud Tom.
We're going to really skate on a little thin ice here. We are really going to push things. This is brand new. This is very exciting technology. This is... I'm going to appear in Jaguar. I don't know if this actually is part of your release, but it's very exciting and it's... It's very, very... how shall I put it? This is somewhat more taxing on our Speech Recognition engine. And so, because of the room acoustics, I'm going to change my strategy a little bit and rather than use the in-built mic, I'm going to try this with a closed talking mic, which none of us actually likes to use.
[Transcript missing]
is something that we're referring to right now in this control panel as allow command name variations. So the idea, as Kim said, is a lot of times you know what you want to do, but you don't quite remember how to say it. And we want to provide a mechanism that allows the user to do, to be very forgiving or to be forgiven by the recognition system. So let's turn this on and see how far we get.
Let me just test it a little bit. What time is it? By the way, we also can adjust the gain on these microphones. What time is it? What time is it? What time is it? What time is it? Quit this application. Quit this application. Quit this application. What time is it? Well, we may be having some problems. What time is it? Okay, I'm going to really go on thin ice and I'm going to try the built-in mic. What time is it? What time is it? What time is it? Now, I'm going to try the built-in mic.
What time is it? What time is it? It's 18:10. What time is it? It's 18:10. Well, we'll see how far we get. Switch to Finder. Switch to Finder. Switch to Finder. Kim, I think I'm going to punt on this because we're really short on time and I don't want to push it too much.
The normalization started moving towards the head-mounted microphone and then when he was moving it back, it started moving back. If we persevere with it, we've done some tests. Somewhere in this case, in this room, it takes up to 20 repetitions of some commands before the normalization has reached the right space and then it seems to perform fine. We won't keep you with that. Tim, let me just try one more thing. I'm going to start it once again and if it doesn't do anything, then we'll punt.
What time is it? What time is it? No. Let's not go with it. We won't text you guys. Here's the message for you developers about this. This is a change in the overall concept of how speech recognition works. We're not trying to recognize the words. We're trying to recognize the meaning by a technology that's actually been quite well proven now in the way that it works for classifying word strings according to their meaning, and the recognizer is capable of catching those word strings. We'll encourage you to try it out.
Is this in the Jaguar WWDC CD? This is not. Look for subsequent seeds. The way this works is the standard approach to dealing with this problem is to sit down ahead of time and try to imagine every way that people might say every command. Right, exactly. You end up with this huge list, and the next person who comes along says something that you didn't anticipate.
Instead, you gather examples of the different ways that people might say each of the spoken commands for your application. Then you train what we call a latent semantic map, which can generalize from those and look at other strings and see that they are similar in structure to those examples by looking at word pairs, word triples, and co-occurrences of words statistically represented.
We're not making the API available to developers in Jaguar because we want to iterate on that API ourselves a little bit and get more experience with the technology and understand the constraints in using it. But you can start thinking about it because our intention is indeed to make it available. And the way to start thinking about it is to start gathering alternative ways to say the commands in your applications. Don't go crazy and try to get as creative and as imaginative as you can.
We know you're a very imaginative bunch of folk. But look at what a naive user is likely to say if they know what can be done by speech but don't remember how to say it. And start building up those lists so that you're ready for us by the time we're ready to release this to you.
So what we've done, let me summarize, is talked through a whole bunch of things that are new in Jaguar. We're working on Speech Synthesis. You've seen a work in progress. Expect that work to keep going. You've seen widespread incorporation of speech into the user interface, which means that your applications are speakable, whether you do anything about it or not.
What you can do about it is make sure that they work well by making sure that your texts and strings sound okay when they're spoken, by using standard sheets, standard alerts, and standard controls, because that's what will be spoken. If you really want to use a custom control, then you can associate an alternative string with that control, which is, we will see and we will speak that string if people select that control.
And test that when people say these controls themselves, that the speech recognition works. We've shown you that people can navigate the user interface by voice, and we've shown you semantic inference, and Cocoa classes. We have some more information here about how to use these, but I think you guys can figure it out for yourselves. Let's end up the main session now, and we'll hang around here and outside to talk about this further. Once again, I want to thank you for staying so late. Thank you.