Adopting Modern Internationalization APIs - WWDC 2007

Mac OS X Essentials • 1:03:43

Prepare your application for the global marketplace with modern internationalization APIs. Learn how to replace the deprecated Script Manager APIs with the new Text Input Sources Services and other modern Leopard APIs to simplify and enhance your code base. Learn about the new Leopard Dictionary and Language APIs and how to create your own dictionaries and glossaries.

Speakers: Peter Edberg, Deborah Goldsmith, Yasuo Kida, Mike Grady

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

[Peter Edberg]

Hello, I'm Peter Edburg, and I'm going to give you just a brief introduction to the presentation before we get into the actual meat of the presentation. SO what we're talking about here are the core internationalization services that are used by both Cocoa and Carbon. And these consist of elements in several different layers, part of several different frameworks. This is designated by the yellow in the slide.

And they extend all the way from core foundation up through Carbon. Now AppKit is sitting on top of all of this. It actually uses some of the internationalization services from Carbon. And so if you're a Cocoa developer and you want standard internationalization behavior, you get that by default from Cocoa applications. If you want something that's beyond standard behavior, or if you're a Carbon developer, then you may need to call these services directly and it's for you that we're directing this session.

Now we're not talking about all the pieces in all the frameworks, we're really focusing on a couple of areas. One are the internationalization services at the CF, or core foundation, layer. That's the bottom part of the diagram. And we're also going to be speaking a little bit about some of the services in the Carbon layer. The services in the core services layer are mostly deprecated and we'll be talking about that in a minute. And the ones in the application services layer are mostly about text layout and those are covered in the core text session later today.

Okay, so what we're going to be focusing on are the modern services. And what do we mean by modern? Well, there are two key aspects of these. One is that they use Unicode and another is that they are available in 64-bit. There are some other services and we're going to be providing some guidelines on how to shift away from using the deprecated services like the script manager and so on.

Now what are the important reasons for doing this? Well, of course, one of the key issues is that you can use Unicode text. And this is not just so that you can display (inaudible) form or some more obscure scripts in Unicode. Unicode text actually provides a lot of characters for East Asian languages, for example, that were not available in the legacy Mac encodings.

It also means that all characters are always available. In the Mac encoding, for example, some characters, and not necessarily ones associated with specific languages or scripts, were only available in certain context. For example, the registered sign was in MacRoman, but it was not in MacJapanese. So if you happened to be drawing text with MacJapanese, you didn't have that. The contoured integral is in MacJapanese, not in MacRoman. And these are not specifically things you think of as associated with either Japanese or the Western European languages.

Another big advantage for using the modern services that we'll be focusing on is it gives you access to more languages and localees and it also gives you a more standard way of designating those languages and localees using the language tagging facilities described by BCP 47. Of course another reason for using the modern APIs is there they ones that we're focusing development on. They work better, we're fixing more bugs in them, they also support the new system initiatives that you've heard about, so 64-bit and resolution independence and so on.

Now there are some cases where we haven't completely filled in all the functionality that you might need with one of our modern APIs that we're focusing on. In some cases we have some transitional APIs that fill these gaps. And these transitional APIs typically do support Unicode, but they may not, for example, represent strings using CS strings and so on.

Okay, now I'm not going to go through this diagram in great detail. You'll get a copy of this later. But this is just to give you a sense of what that things are in which category. What's deprecated, transitional and modern? And so for example, for the text utilities area, what's deprecated is certainly the script manager and the legacy encodings associated with it, the old international and text utilities, language analysis manager and so on.

In the modern category we have a lot of services at the core foundation layer. And you'll be hearing about those, CFlocalee, CFCalendar, CFDateFormatter, and so on. Now an example of something that's in the transitional category would be use of Unicode utilities to get a collation key if you wanted to do optimized sorting and Deborah will be describing a little bit more about that later. That's something we haven't got available in the CF Domain at this point.

Okay. So the topics that we're going to be covering in this presentation are these. The three middle categories on this slide are things that are all new for Leopard. And the first and last topics also are areas in which we've made some enhancements for Leopard. Anyway, now to move into the media presentation, I'd like to bring up Deborah Goldsmith who's going to talk to you about the text utilities, international formatting and so on. Deborah plays many roles at Apple.

Deborah is the representative to Unicode and is the, I guess Vice Chair of the CLDR Technical Committee and involved with ICU, is also on the ICU board and sort of the person who knows the most about how the internationalization pieces at Apple fit together. So, with that, Deborah. Thanks.

[Deborah Goldsmith]

Thank you Peter. ( Applause ) >> Deborah: Okay. So we're going to talk a little bit about the modern APIs that you can use to process text in your application. Now many of these APIs appeared in Tiger and some even before that. So this is going to go very quickly. It's going to be part review and in a few places introduction to some new features that are coming in Leopard.

So before we start talking about the APIs, we need to talk about what the APIs deal with. On the old world side for the script manager, we have things like MacRoman and MacJapanese, which are the legacy character encodings. In the new world, all of the new APIs use Unicode.

So if you haven't already transitioned your application to Unicode, that's what you should be focusing on. The script manager used ScriptCodes, RegionCodes, things like that to represent localees or languages. In the modern world we have CFLocale, CFStringEncoding to denote legacy encodings and CFCharacterSet to keep track of sets of character or repertoires. With the script manager legacy encodings you use Pascal or C Strings.

In the modern world we use CFString or its toll free bridged equivalent NSString. And the API sets you used in the Mac OS 9 and Script Manager world are the Script Manager, Text Utilities, Date/Time Measurement Utilities and so on, and those are replaced by the core foundation API sets that you see here.

So moving right along, we'll go through a number of scenarios. These don't cover all of the APIs by any means, but it's just to give you an idea of how to move your application forward. So if you are interested in getting locale data, things like the current currency symbol or localized decimal point or what have you, you might call GetScriptVariable or GetScriptManagerVariable. Those are now deprecated. And there are several API sets that you can use as the modern equivalent. For anything that has to do with locale data, the place you start is CFLocaleCopyCurrent, which gives you a copy of the user's current locale settings.

In order to disassemble that, or put it back together, there's CFLocaleCreateComponentsFromLocaleIdentifier, which lets you pick out the pieces of the locale ID. And there's another API for putting it back together. You can use CFLocaleGetValue to get particular attributes of the locale. And you could use CopyDisplayNameForPropertyValue to get a localized name. and we'll give a little example using that.

So say we want to get the name of the user's primary language, the user's most preferred language in Polish. Well, we can use CFLocaleCopyPreferredLanguages, which is actually a new API in Leopard. It returns an array, which is the list of the user's language preferences from most preferred to least preferred.

We then get the first element of that array, which is the user's most preferred language. Then we need a Polish locale. Now the user may not have their locale set to Polish, so since we want Polish in particular, we'll create one, instead of using CFLocaleCopyCurrent. We create a Polish locale and then we call CFLocaleCopyDisplayNameForPropertyValue. That's a mouthful. And we pass our Polish locale.

And we say we want the localized version of the locale identifier and then we pass in the language. Now why do we say locale identifier instead of a language identifier? Well you can see the reason below. Because a language tag doesn't just consist of a language, it can also have a script code and possibly other things associated with it. In this case, the users primary language is Simplified Chinese, which is denoted by the language tag and the scripts tag. And the result in Polish is shown below. And the English equivalent of that is Chinese in Simplified Han.

Okay, here's another thing that you might need to do in your application. You might need to get information about characters. Now in the script manager world you would have used CharacterType to do that. And there are several replacements in the modern API set. First of all, CFCharacterSet, in addition to letting you create your own sets of characters, lets you get predefined sets that represents various different kinds of characters like uppercase or lowercase or digits or what have you.

In addition there are a few things you can't do yet with core foundation like get nonbinary properties of a character. Like, for example, a decimal digit, what is the value of that digit zero through nine? So in this case we're going to test a particular character to see whether it is a decimal digit.

We call CFCharacterSetGetPredefined to get the predefined set of decimal digits and then we can use CFCharacterSetIsLongCharacterMember to test a particular character for membership in that set. And then you could go on to do other things such as get the actual value using UCGetCharProperty if, for example, you want to scan a number.

Another scenario is sorting text strings, or searching through them. The script manager had StringOrder and TextOrder for that, and there's a large set of APIs that replace that, which not all of which are listed here, we'll just go through a few examples to give you some of the flavor of how you do things with core foundation. You can use CFStringCompare to compare two strings with various options. There's CFStringFind for finding a string within another string.

And there's also a find and replace option that works on mutable strings. Sorry, find and replace API. And if you're looking for a particular character in a CFString, like say you want to find the next white space, you can use CFStringFindCharacterFromSet, you can call CFCharacterSet and get the predefined character set of all white space and then use that to scan through a string looking for the next one.

Now we have one API here that's in the transitional set and that's UCGetCollationKey. And let me explain what you might want to use that for. If you're sorting a large set of strings, you're going to be comparing strings more than once because the fastest sorting algorithms now require N login operations.

Now if you're doing a localized comparison, one of the expensive parts of that comparison is looking up the localized ordering information. And that's, the technical term for that is developing a string of collation elements. What a collation key lets you do is do that work up front and get a binary string that's much faster to compare. So if you're doing a sort or other large operation where you're going to comparing strings multiple times, this lets you amortize that localized information up front and that'll speed up your sorting operation.

Another thing you might need to do is to convert strings from uppercase or lowercase. And CFString has a variety of modern APIs that you can use to do that. You can convert to uppercase or lowercase, as you could with the script manager, but you can also do capitalization, which converts the first character of each word to uppercase.

Now you may notice that these APIs all pass the locale and you might wonder why you need that for uppercase and lowercase. And the answer is that the rules for converting between upper and lowercase vary from language to language and country to country. For example, the rules for Turkish are not the same as the rules for English. So if you're doing this to a string that is going to be displayed to an end user, you're going to want to pass the locale into make sure that it's done properly.

Another operation that applies to Unicode that wasn't necessary in the script manager world is normalization. Now as you may be aware, Unicode can have multiple ways of representing the same character sequence. For example, an E with an acute accent can be represented either as a precomposed form, E with an acute accent as one character, or an E based character with a combining acute accent.

And what CF normalize string does is let you convert between those two forms because different systems and different applications have a preference for one form or the other. For example on the Internet or on Windows, you'll typically prefer the precomposed form which is a normalization form C, NFC. But HFS plus disk volumes prefer file names in the decomposed form, NFD. And there's a few subtleties there, but I'm not going to go into them now.

Something else you might need to do in your application is perform transformations on your text and the script manager had a few of those. You could strip the diacritics off or you could transliterate text to Roman from other scripts. And CFString has a very powerful API that lets you do those transformations and more called CFStringTransform. It can strip diacritics and this is actually a new operation in Leopard. In Tiger we had an operation for stripping combining marks, which is a little bit more enthusiastic than stripping diacritics. So now we have just stripping diacritics.

It also support transformations to Latin, but actually you don't want to use that if your application is running on Leopard because starting in Leopard we have CFStringTokenizer, which does a much more detailed analysis of text and looks at grammatical information in dictionaries and so on to convert text to Latin. So it does much better job with things like Japanese or Chinese than CFStringTransform can do. So if you're running on Leopard, use CFStringTokenizer instead and Kida-san is going to be talking about that in a few minutes.

However, if you need to do other kinds of transformations, CFStringTransform is your API. It also does bidirectional transformations, which CFStringTokenizer doesn't do. So if, for example, you can go back and forth between Latin script and Hiragana or between Hiragana and Katakana, you can do things like convert non-ASCII characters to numeric escape codes.

And something that you can do that you could actually do in Tiger, but we're supporting it formally for the first time now, is you can pass any transformation ID that's supported by the ICU library to CFStringTransform and it will do it, in addition to the predefined ones that we've supplied in the past. So in this case, we're using the Half- width-Fullwidth transform from ICU, which converts halfwidth to fullwidth characters.

But we don't want to do that to the Latin characters in our string, we just want to do it to the Katakana, so we use a transform that has a Katakana filter and the result is, on the right, you can see that the Katakana characters become fullwidth, but the Latin characters are left alone. CFStringTransform can do lots more things and you can read about it in the documentation.

Well, in addition to looking at strings and transforming them, you also need to do basic operations like replace parts of them or move bytes around. In the past you would have used Munger or ReplaceText, but in the modern world, CFString has a variety of APIs you can call instead. You can replace a range of a string with another string. You can create a string with a format string, which is very similar to printf. you can trim a string from the beginning and end of another string.

So for example you can remove comments or indicators or add comment indicators. And there's a more specialized version of that, which removes white space characters from the beginnings or ends of strings. And again, there's more APIs than we have time to go into here. You can read the documentation to find out about it. If you're just moving bytes around and they're not Unicode characters, then you can just use memmove and similar APIs from the standard C library.

Okay. Well now we start to get into some of the more complicated aspects of Unicode. You might thing since Unicode is 16-bits wide, that it's just like a wide version of ASCII, and it isn't. it requires a little bit more of sophistication in handling than that. So in the past, you had to user CharacterByteType to make sure that for a double byte character set like MacJapanese or Mac Simplified Chinese, you were not truncating the character in the middle. Well, Unicode is not a multi byte encoding, but you can still have more than one code point used to represent and end user character.

For example, the example I gave earlier about an E character followed by a combining acute accent, you don't want a break in the middle of that. So suppose you're inserting something like an ellipses, if you don't pay attention to the boundaries of composing character sequence, you could get something like the first example on the bottom where you insert the ellipses in between the base character and the accent.

And that's not the result you want. Instead you can use CFStringGetRangeOfComposedCharactersAtIndex, and what that will do is it will take the index you give it and it will give you the range on either side of that index, of the composing character sequence. And you don't want a break in the middle of that.

So if you use that, in this example, you'd start with an arbitrary index and then you say, okay, I want to put the break at the end of that range. And then you get the second example we have here where you have the E with it's acute accent is okay and then it's followed by the ellipses.

Speaking of that particular example of inserting an ellipsis in order to truncate a string, you actually don't have to do that yourself. The script manager had APIs for that and we have modern equivalents also. If you're using core text, you don't even have to truncate the string in memory, you can pass a certain set of tags when you're rendering and core text will truncate the string for you while leaving your string and memory untouched.

You just tell it how wide you want it to be and it will do the right thing Unicode wise and put the ellipses in the right place when it's actually drawing the string. And there's, if you're still using ATSUI, there's similar tags for (inaudible). If you actually do want to truncate the string in storage, you can call TruncateThemeText, which is available as part of the appearance manager.

Another thing that you need to do in application, for example, is handle double clicks, or do whole word search, or handle word wrapping. And it turns out that word boundaries and line boundaries are not quite the same thing. That's why the script manager had two separate APIs. It had find word break and style line break. And we'll find out a little bit more about that later in Kida-san's presentation.

For Unicode text we've had UCFindTextBreak, but new in Leopard we have CFStringTokenizer, and CFStringTokenizer actually is now the preferred API for looking for word, line or sentence boundaries. If you're looking for cluster boundaries for things like handling arrow keys, again, you can use CFStringGetRangeOfComposedCharactersAtIndex, which we just discussed.

Okay, moving along to handling locale information. One thing you need to do in a localized way for the end user is format and parse dates and times. So the script manager had a long list of APIs for doing that, too long to list here, but those are all replaced by CFDateFormatter.

The model for CFDateFormatter is very simple. You create an object, a CFDateFormatter object that formats dates or parses dates the way you want and then you can use that object to convert a CFAbsoluteTime to a string or convert a string to a CFAbsoluteTime. You can also set properties on the object that lets you control the way that it does those operations.

So I have a quick example here. We'll create a CFDateFormatter object, and we use the current locale. Now if you've used core foundation before, you'll notice that I'm leaking the current locale here, but that's just for purposes of this example. You'd keep track of it and do a CFRelease on it later.

I'm asking for the LongDateStyle and the LongTimeStyle and then I set a property on the formatter object saying that I want lenient parsing. And then I call CFDateFormatterGetAbsoluteTimeFromString and the string I'm giving is January 1, 2001, 12 AM, GMT, and it so happens that that's the origin for CFAbsoluteTime so the answer I get is zero.

Sometimes you need to do more with dates than just parse or format them. Sometimes you need to do calendar calculations, and since Tiger we've had CFCalendar, which replaces a long list of script manager APIs that are not deprecated, or actually have been deprecated for some time now. And again, we're going through this very quickly. CFCalendar lets you go between a list of calendar components, like the year, month, day, hour, minute, second, and compose those into an AbsoluteTime.

Or go in the other direction, you can also do arithmetic on dates. Say, for example, I want to take CFCalendar that represents today and I want to say what's the date that corresponds to 10 months from now, I can add 10 months to that date and it will do all the calendrical calculations correctly. CFCalendar also supports non-Gregorian calendars, which is something that we've supported in the OS since Tiger, so we would like it if your application supported non-Gregorian calendars as well.

Very similar to dates and times, you also need to format and scan numbers in your application, and it's very analogous to the way dates and times work. You create a number format or object. You can use it to convert to and from a string. And you can set properties on that formatter that control the way it does it's work.

In this case, you can actually set the format pattern string if you want to use a particular number format as opposed to using the number format that you get from the user's locale. So what we're doing here is we're creating a number formatter, again with the current locale and again, leaking the current locale and we're asking for the currency style. CFNumberFormatter supports a variety of styles. It supports currencies, plain numbers, floating point numbers and percent.

In this case, we're taking the number 1,992 and we're formatting it as currency. And another thing that's different from CFDateFormatter is you'll notice we have to pass the type of the number, and in this case it's a double. CFNumberFormatter supports converting to and from various binary formats, not just doubles and floats, but also integers as well and also CFNumbers. In this case, we have the Russian locale selected, so we get an amount formatted in Rubles, which is shown at the bottom.

Okay. That ends our whirlwind tour of text management APIs, and I hope you're not too dizzy from that because we went very quickly. And now I'm going to turn things over to Yasuo Kida, who's the manager of our Tokyo OS Engineering Group. Kida-san.

Thank you, Deborah. I'm Yasuo Kida, the manager of OS Engineering in Tokyo. I'm actually living here, managing Tokyo engineering team. I'll talk about CFStringTokenizer. It is a text segmentation API that's new to Leopard. Let's see what it is.

It can tokenize a given string into a word, sentences or paragraphs. Not only can it tokenize languages that has explicit word boundary marks, like this, it also talk nice languages like Japanese and Chinese into natural word boundaries. It can do so by utilizing a technology called morphological analysis.

By utilizing the morphological analysis, it can give Latin transcription for not only for alphabetic language like Arabic or Indic, it can provide Latin transcription for ideographic languages like Japanese and Chinese. Look at those two kanjis that are highlighted. Those are the same kanji, but have different pronunciation. And CFStringTokenizer can provide correct Latin transcription based on the context of the text.

Sometimes a word consists of multiple words, this is called compounds. The CFStringTokenizer can break compounds such as German noun compounds and look at the first compound of sub token is different from the original token. And it can also break CamelCase and words with numbers into symbols. So previous examples, you are interested in tokens that are surrounded by boundaries, but there are cases you are interested in boundaries itself.

So one of those cases, line breaking opportunity, finding line breaking opportunities. Other case is finding word break, which is used for double click range. For some languages, word breaking and line breaking opportunities are similar, like shown in the above example. Some other languages, those are completely different. Lastly, it has an API that does language identification. It identifies the language of a given text, like this.

Here is a relationship with other similar APIs. So first of all, do not use language analysis framework, find word break, style line break, focal tokenization, or finding text breaks. And do not use TransliterateText for Latin transcription. Instead use CFStringTokenizer for tokenization in Latin transcription. Use CFStringTransformation for other transcriptions like Deborah mentioned. So let me show you how it works.

( Silence. )

Okay.

( Silence )

So here you have a bunch of text. This is the desktop application. So let's drag and drop this text into a text field. So why do you think this language is? Let's click on identify language button. And this is Simplified Chinese. And let me do the tokenization. It can tokenize text into pieces. Now let's see. And this is Latin transcription. Another example is this one.

I cannot read this language. Let's try identifying the language. This is German. And notice that, not only I identify the language, I set the spelling Czechker to the language, which is identified. And you see the spelling Czechker now works correctly on this text compared with this state. And it can tokenize the German compounds into pieces. Look at this word. It's a single word in the original text. How about this one? How about this one? What language this is? This is French. How about this one? This is the last one.

( Silence )

- This is Dutch. And this is an example I like.

( Silence )

This is a mixture of some Latin language and some Kanji. This is in German and let's get the Latin transcription. You can get the Latin transcription correctly even for the Japanese part.

( Silence )

Let's get back to the presentation screen, please. Thank you. So let's see how you program it. So for most of the purposes of language identification, you first create the tokenizer by calling CFStringTokenizerCreate. And for this word tokenization case, you specify KCFStringTokenizerUnitWord to it. And then you form iterator by calling CFStringTokenizerAdvanceToNextToken and you iterate while it is returning non zero body. And you can access to the range of the current token. So CFStringTokenizer API is playing a lot of attention to memory allocation. It is trying to reduce the need for the memory allocation as much as possible during the iteration process.

You can add one API to it to do the Latin transcription. You call CFStringTokenizerCopyCurrentTokenAttribute. With CFStringTokenizerAttributeLatinTranscription, which is pretty much the only attribute that is currently supported to get the Latin transcription for the current token. And here is how you break compounds. It's a little bit complicated, but the basic is similar. First you create an array. Because like I explained and mentioned earlier, there are some cases that tokens are different from the original text. So you need to be getting those into arrays of ranges.

And if the type says yeah, I have some tokens you call CFStringTokenizerGetCurrentSubTokens and get those sub tokens in the array. Here's an example for finding a line breaking opportunities. You create a tokenizer by specifying KCFStringTokenizer, you need line break. And instead of forming (inaudible), you call CFStringTokenizerGoToTokenAtIndex by providing a chapter index to it.

Usually you're not interested in getting a series of tokens for entire text, but you are interested in knowing text boundaries surrounding your or near a particular chapter index. The last thing you do is language identification. Simple. One call. You gave string and range. If you gave a range of zero left of zero it reads appropriate number of characters to recognize language.

So that was CFStringTokenizer. Next I'll talk about the Dictionary Development Kit. What is it? In Tiger we released an application called Dictionary App. It had US English dictionary and Thesaurus and that was it. In Leopard we added the capability for you to add dictionaries to it. So this is what Dictionary Development Kit is for.

You can draw up dictionaries for Dictionary App. Like regular dictionaries and thesaurus or bilingual dictionaries like Thai and English or specialty dictionaries like (inaudible) or technical, some technical area or medical or you can draw up reference libraries or glossaries, glossaries for in house usages. It is HTML based, so very simple. Each page is an HTML page. And you can think of a dictionary as a set of HTML pages targeted with terms.

It will use a rich set of terms you already know. And it is in UTF-8 throughout, so it is Unicode and it can support many languages. Let's look at the source. So Dictionary Development Kit defines a few XML tags that surround and encapsulate the concept of the dictionary and each entry. This is a document entity, dictionary, our root tag.

Inside the root tag you have entity tags and entry tags. You have multiple entry tags in the dictionary. And this is an entry tag that's supposed to an entry called AirPort. The ID is a unique identifier for the entry. You would be using for linking between entries, for example.

You can add multiple indexes and here's your content. Typically it starts with <h1> tag like this. And your content and let me add a link between entries. In this cased it links to AirPort Extreme. And we define the URI scheme that is x-dictionary and you can add images. It's HTML.

And you can actually add your own CSS images etcetera and you provide info.plist, you provide dictionary name and bundle identifier, etcetera. And compile it into a dictionary. You can a Dictionary Development Kit at the following directory: /Developer/Extras/Dictionary Development Kit/ We are planning an Xcode version of it, but that might become available to ADC website. And let me compile a dictionary.

( Silence )

- It is taking a long time.

( Silence )

- Let me open it.

( Silence )

- Hello. Oh yeah. Phew.

( Silence )

So this is the source code for the dictionary we just created. It is rather small. We need the Xcode. ( Whispering ) Okay, anyway.

This dictionary has two entries. One is for AirPort the other for AirPort Express. And like we created in the presentation, it has a link to the other entry and an image. So CD, and then we compile it to the project dictionary, project directory and link. And it will create a dictionary into the objects directory. And let me copy it into the dictionaries folder. If your library folder doesn't have dictionary folder, you can create it.

And launch dictionary, viola. This is your dictionary. Let me type AirPort, it finds entry, it should. Hello, yep. So this is the title as we said and the content and the link and the picture. And you can click on the link then go to the other entry. So it's that simple. And thank you very much. Let me introduce Mr. Michael Grady to talk about Text Input API.

[Mike Grady]

Thank you Kito-san for that. Hello everybody. We'll be discussing the state of Text Input Sources in Leopard. You've heard several times mentioned in this talk that the script manager's deprecated and some parts of it are really, really deprecated like the KeyScript API. The reason is not just because those API are based on the Mac script code, which is a deprecated notion.

But really they have not as a API been able to keep up with the variety of text input sources available on Mac OS X. What are examples of text input sources? You see up here a text input menu. It has a number of keyboard layouts, the US keyboard, Arabic, Czech, and the Unicode text input keyboard layout that we'll talk a little bit about later.

You also see some input methods. And here you see some input modes that belong to the Kotoeri input method.

( Speaking in foreign language )

Are three input modes that belong to the input method and the grayed out portion of the menu you see up here, Kotoeri. There are additional input sources as well.

You'll see the character palette available at the bottom of the menu and keyboard viewer palette. These are all examples of text input sources that the script manager could not really deal with. So we're introducing a new API in Leopard to help you, and modern API to help you control these input sources. And we'll tie within nicely afterwards with a demo.

What are the things that you would expect the text input source API to provide? Well, of course accessing properties of an input source, icon, localized name, and so forth. Discovery on iterated mode to find input sources that may or may not be enabled already by the user. The current "state" of the user's configuration, and something that simply could not be done with the script manager, manipulation of the input sources from the background and having them take effect in the foreground.

So let's get into that. what is a text input source? It's typed as a TISInputSourceRef. It is registered with the CF runtime as an instance of CFType. And it's identified uniquely by TISInputSourceGetTypeID. And that makes it a candidate to be used with the normal polymorphic functions you'd expect, CFRetain,CFRelease, and CFEqual.

Once you have a TISInputSourceRef what can you do with it? Well, the workhorse of the API is accessing properties of an input source. TISGetInputSourceProperty is shown here being passed in input source. In this case the current keyboard input source, as you see at the top and we're querying a property in this case its BundleID, very simple API, very powerful.

What are some of the properties that you might be interested in? we'll go over a set here. Identification input sources have a unique InputSourceID. For input methods and input modes that's very straight forward, it's just a BundleID and a concatenation with the ModeID. For keyboard layouts it's important that if you develop a keyboard layout that you provide your own InputSourceID keyboard layout should be bundle.key layouts. And in the KL info dictionary that they usually contain that's where you specify your unique InputSourceID. Input sources also belong to categories. That's a broad distinction between things like KeyboardInputSources or PaletteInputSources. There's a more fine grain distinction between them by the input source type.

For example, something, maybe a keyboard layout, a keyboard input method. In the case of palettes that may be a character palette or a keyboard viewer. Other static properties of input sources would be whether it can be enabled. Most input sources can be enabled, some cannot. For example, an input method may provide a private keyboard layout in its own bundle. That is a text input source not enableable.

Things that can be enabled are not always selectable. For example, we know that input modes of an enabled input method can be selected but the parent input method itself cannot be selected while it can be enabled. Input sources have, are associated with, languages they support. For example the US keyboard layout could be used to type not just English but French and Spanish. So an input source will typically be associated with several languages, the first of which is the intended language. The intended language is also another property that you should be specifying in your keyboard layouts bundle.

Keyboard layout input sources in particular have keyboard layout data of the TIS or the text input source API only makes available access to the Unicode keyboard layout data, the uchar, the Kkhar format is completely deprecated. It's not accessible by either TIS API. And this reminder also to, as you do provide your keyboard layouts to make them bundled.key layouts. Some nonstatic state of input sources might be whether it's currently enabled or currently selected.

So you seen what we can do with the TISInputSourceRef, what it is, but how do we get one in the first place? Well we have a powerful iterator API called TISCreateInputSourceList. It's a very simple API. It takes a filter, really a dictionary of properties of property keys and values. And it returns an array of text input sources that match your criteria. Another API is TISCopyInputSourceForLanguage. This takes a language in the BCP 47 format, a language code such as would be returned by a CF locale create language identifier from stream.

And it'll return the most recently used text input source for that language or whose intended language is what you specify. And we'll discuss inconvenience functions that also return TISInputSourceRefs. Here's an example of the iterator API in action. TISCreateInputSourceList is being passed to filter, a dictionary of one key value pair. In this case we're interested all input sources whose category is palette input source.

We're interested in all palettes. You'll note the second parameter to TISInputSourceList is important. In this case we're only interested in text input sources that match the criteria that happen to be enabled currently. There are times where you would want to find input sources that are not yet enabled by the user but are installed in the system.

So you pass through in that case to look through all the installed input sources but you'll want to exercise some care with the filter you pass in. If your filter is too loose, not restrictive enough, you may end up allocating memory for a whole bunch of objects in your app's memory that you did not intend to use. You see the second API we mentioned TISCopyInputSourceForLanguage. Again it takes a language code in BCP 47 format. In this case the Czechk line which it'll return the last used input source for Czechk.

I mentioned some convenience APIs as well for the current state of the user's configuration. You may be interested in the current keyboard input source. Again this may not necessarily be a keyboard layout. It could be an input method or an input mode. The current input mode may not be Roman. It may not be ASCII capable. And that may be something you want. What is the last used ASCII capable input source? Keyboard input source, so we have TISCopyCurrentASCIICapableKeyboardInputSource.

Neither of those need to be keyboard layouts and you may really want the keyboard layout for yourself or perhaps you want to retranslate key events, changing modifier bits and so forth. So you can get that through the TISCopyCurrentKeyboardLayoutInputSource. That input source may not be ASCII capable either. So we have another accesser, TISCopyCurrentASCIICapableKeyboardLayoutInputSource. And in the case where an ASCII capable keyboard layout is not actually enable by the user this will usually return to where it was used in Mac buddy in the set up system at the time you installed the system.

Here are the things you can do in the manipulation space. You can enable input sources, disable them, you can select usually any input source if it's selectable. But you cannot de select a keyboard input source. There's only one keyboard input source active at a time so you can only select another one. And the de selection is implicit in that case. But you can deselect explicitly palette input sources. And again there are things that could not be done with the script manager. That is selection, enabling, disabling deselection from the background. And that automatically takes effect in the foreground.

And so, we have two notifications that are dispatched through the CF distributed notification center, not the local center. Have selected KeyboardInputSourceChanged or EnableKeyboardInputSourceChanged. These would be dispatched for example as a user enables or disables input sources in the internationalization press panel. And so to show you how powerful these API are we'll tie everything together with a nice demo that Peter Edburg will be helping use with.

[Peter Edberg]

Hi, what we've got here, thanks, what I've got here is app, it's a bit of a cross between the international preferences input menu pane and the text input menu. And so just to illustrate the discovery or iteration, basically it requested a list of all of the keyboard input, all of the installed keyboard input sources in the system and that's what the table source is displaying here.

in terms of properties, some of the properties, and I should mention that everything that's being done here is being done with public TIS API. Some of the properties are being displayed in the table view here. for example is it ASCII capable what the type is and icon and so on. And to get additional properties I can select on a row for example and I get the InputSourceID, the BundleID if it's a bundled source, the InputModeID if it's an input mode and the language codes.

And the language codes for key layouts we figure out automatically from the repertoire of the key layout and the ICU exemplar or the CLDR exemplar set. So, for example for Czech, it's intended for Czech but it can also be used to input a number of other languages including German, English and so on.

There are notifications going back and forth. So, for example if I, right now we haven't gotten Czech as an enabled key layout but if I enable it here for example it the text input menu will reflect that and so on. If I select it here in the text input menu it shows up as selected here.

In terms of the current state, the bottom four fields show the state of the various convenience APIs that Mike was mentioning. So, for example if I select US then US is our current key layout, or keyboard input source, it's ASCII capable of the Key layout and it's the ASCII capable key layout.

If I select say Russian, okay well Russian is not ASCII capable but it is the keyboard layout input source and it's also the current key layout. I can select, for example Romaji. Romaji is ASCII capable okay so it shows up as the ASCII capable input source and so on.

Let's see, and I think the only other thing I wanted to illustrate is that the APIs that enabled manipulation of input sources can operate in the background. Let me just get rid of a few of these so it's, so we don't have to go through many, so many.

I'm going to turn off some of the input methods here and then go to this little facility that will cycle through the enabled input sources in the background. So you can see that the menu up there is changing and I'm just going to be hitting the same key on the keyboard here.

So now we've got text edit in foreground, my little test app here is working in the background changing the input sources. So there's Arabic, and Czech, and so on, anyway, French, okay. And that's about it. So that you can do with the TIS API. Thank you.

( Applause )

[Mike Grady]

Thank you Peter. And the rest of the talk will be considering briefly the state of the Text Services Manager. TSM, for short, is introducing a new model for restricting input sources. This was formerly done by using the super deprecated KeyScript API restricting input source, the restriction, restricting input sources means limiting the set of text input sources available to the user in the context of a particular text field for example. So, we'll be considering that, and a brief mention on some support in TSM Carbon events for resolution independence, modern event primary types, and the state of TSM and 64-bit.

So input source restriction, the KeyScript API had a limitation or an inconvenience in that it required, first of all it was global in scope, at least in application scope. And whenever you restrict this sort of input sources for a particular text field you have to carefully undo that before switching to another text field so that the effects would not spill over.

But really the set of text input sources available for text field are a property of that text field and it should be encapsulated by the input context, in this case the TSMDocument itself. So, TSM is introducing two new TSMDocument properties, the EnabledInputSourcesPropertyTag and that allows you to specify an array of TISInputSourceRefs to which the user should be restricted in the context of the text field.

And you can also specify a preferred input source within that set to be switched too the next time that text field gets focused. And again because they are properties of the input context with the TSMDocument you can set it at any time even at the creation of your text field. Set it and forget it, let TSM do the gymnastics of doing and undoing the restriction for you.

Now as the user's session, the user's log in session goes on the user may at various times enable different input sources or disable some. And you would want the set of enabled input sources to which you restrict a particular context to reflect that. So, you'll from time to time want to look for the enabled keyboard input sources change notification that is dispatched through the CF distributed notification center. Here's some examples of the new document properties in use.

We have the EnabledInputSourcesPropertyTag being set for a particular TSMDocument or just passing in an array of TISInputSourceRefs in the first example. And then at a later time simply removing that restriction with a TSMRemoveDocumentProperty API. If you want to specify preferred input source within a set of restricted input sources, you have an example of the override property tag being used for that.

Cocoa also has a similar notion. While it doesn't expose the notion of a TISInputSourceRef, it does allow you to restrict a set of input sources available for a text field using the SetAllowedInputSourceLocalees API. And again instead of passing an array of TISInputSourceRefs, you pass an array of language code or language identifiers in BCP 47 format or loGcale identifiers work as well. And if at a later time if you want to remove the restriction just pass nil.

By far the most common type of input source restriction that you'll see used is the need for Roman input or ASCII only input. There's a convenience API TISCreateASCIICapableInputSourceList that will return to you the entire set of enabled ASCII capable input sources and you can just pass that indirectly as the value of the EnabledInputSourcePropertyTag. And Cocoa has a similar notion with a constant NSAllRomanInputSourcesLocaleIdentifiers that you can pass directly to setAllowedInputSourceLocales.

TSM has modified a number of event parameters in the current events that defines to the float types where in the past there were a mixture of different numeric types, typedFixed, and the typeSInt16, and so forth. You can request any event parameter that's typed as a float using typeCGFloat if you want.

And in that case you will get data that's 72 dpi by default. But you can also, as shown in the example below, request that event parameter in using a coordinate space variant of the typeFloat. So you see here we're asking for the ReplyLineHeight which is unidimensional value but you may want that in screen pixels instead of in 72 dpi.

And the same thing is done automatically by the Carbon event manager, it does this coercion automatically independent of what is actually in the Carbon unit for you. And that applies to typeFloat, it applies to points, typeHIPoint. And so you have screen pixel variance of typeHIPoint and also for typeHIRect.

TSM has also introduced our support for a number of modern data types. We, see here typeCTFontRef referring to the CTFontRef data type, CTGlyphInfoRef are also supported in addition to CFAttributedStrings. Now, an application may modernize to use those types in the events it returns but the input methods may not or vice versa.

So TSM solves this compatibility problem with a promise mechanism where it will promise one modern parameter from the older parameters and vice versa. So it automatically coerced one set of parameters to or from the other based on how you request the data. An example below we have the OffsetToPosevent.

You may return an ATS font and point size and the input method may request that information using the kEventParamTextInputReplyCTFontRef. And TSM will coerce that from the older data. There's another example of coercing an attributed string from Sintax and GlyphInfo and font array in the event. And this works in both directions.

By now you've all heard that a part of Carbon, the tool box, the part of the Carbon that deals with user interface is not available in 64-bit and that includes the Text Services Manager. This would not normally be an issue for you because the direction, of course is to go with Cocoa.

And Cocoa encapsulates all of the interaction with the Text Services Manager anyway. Of course if you're doing some heavy duty customization of Cocoa, in particular a layout engines, perhaps you want to do vertical text layout, you will need to support interactions with input methods and so forth. And that is doable by using the NSInputContext. This was made public in Mac OS X 10.4.

Compliment based input methods are also not supported in 64-bit so you'll be very interested in deploying your input method for 64-bit using the input method kit. There's a session a hands on session on the Input Method Kit Friday morning after the lab that we'll be showing later on. I believe that's session 577.

So in summary for my part of the talk it's a Unicode world. Let's get off of the script manager, stop using KeyScripts, we really need all the keyboard layout developers or providers to switch to the uchar and bundled .keylayouts. We talked a little bit about the Cocoa support for restriction of input sources allowed input source locale API and a similar notion in TSM through the document properties.

Really you'll get the latest goodness by reading our headers TextInputSources, TextServices, and CarbonEvents. And again for input method deployment for 64-bit you'll really be interested in the Input Method Kit session on Friday morning. And a brief summary of everything we've talked about this morning, again it's a Unicode world, be Unicode. We talked about the CF modern API available to replace the older deprecated functionality in particular CSStringTokenizer and the TextInputSources API. Some contact information for your Evangelist Derek Horn.