Making Your Application Unicode-Savvy - WWDC 2002

Carbon • 54:36

What's larger than a breadbox and has an airplane, a snowman, and two smiley faces? It's Unicode, an international character set standard that can represent most of the world's languages. Unicode is central to Apple's operating system strategy and should be central to your application strategy as well. In this session, developers learn how to create a Unicode-savvy application, providing better support for such heavily used languages as Japanese.

Speaker: Deborah Goldsmith

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, everybody. Thank you for joining us today. Welcome to session 200, Making Your Application Unicode-Savvy. If you were here yesterday in one of the keynotes in the afternoon, you saw Scott Forstall discuss the new international features in the upcoming version of Mac OS X codenamed Jaguar. A lot of people have been asking in the past years about tier-three support in Mac OS X, better support for Unicode fonts, better support for extended character sets, better support for keyboards. Well, I'm glad that today we're going to be actually telling you about all these things. There is one thing that was missing in the keynote, is that support for new languages will be achieved only, and I say only, through Unicode.

So you will need to have a Unicode application, use Unicode technologies to display text in order to be able to do, you know, Indie scripts, support BIDI, support Hebrew, Arabic, and all these new languages will be supporting in Mac OS X. What was my title? To discuss all these Unicode features that we're going to be adding in the upcoming version of Mac OS X, I'd like to introduce Deborah Goldsmith, who is the manager, actually, of our phone team, and she'll be discussing all these features today. Deborah.

"Is this mic on? No. Okay. Alright, I'll speak over here. Is this mic on? Hello? Okay. Oh, is this mic on now? Yes, okay. Good morning, everyone. Today we'll be talking about how to make your application Unicode-savvy. And in fact, I'll be covering two sets of topics. First, I'll give an overview of the Unicode support in Mac OS X. And then I'll go into more detail about what you can do in your application in order to support Unicode.

So first, let's look in depth at the UniCode support in Mac OS X, both in the current version and in the new version, codenamed Jaguar. Before I do that, I'd like to talk a little bit about the past. In the past, on Mac OS 9, we used a technology called WorldScript rather than UniCode.

And in WorldScript, every script, every writing system had its own character encoding. In fact, some of them, like Roman and Arabic, had more than one. There were several different versions of the Roman encoding. And these character encodings were based originally on standards. Mac Roman is based on ASCII, and Mac Japanese is based on ShiftJS, but they weren't standard themselves. They were unique to Apple.

How did you figure out which one of these character encodings to use? Well, that was implied by the ID of the font you were using. If the ID of a font fell within a specific range, that implied a specific script system and a specific character encoding. The problem with that is that if you chose the wrong font for some text, you got garbage. Would anybody like to try to guess what language that is at the bottom of the slide?

No, it's not French, it's Korean, but you can't tell because it's the wrong font. So, Unicode provides a much better solution to the problem of character encoding. First of all, Unicode is a worldwide standard. It's defined by the International Standards Organization and the Unicode Consortium, and it covers most of the writing systems in use in the world today.

In Unicode, unlike the Mac OS character sets, every character has its own unique code point. So, a given code number means one and exactly one character. So, an A is always an A is always an A, with apologies to Gertrude Stein. And in Unicode 3.2, there are 95,156 graphic characters, meaning characters that produce something on the screen.

So what are the advantages for Mac OS X of using Unicode as a character encoding? First of all, there's one character set for all languages. You don't have to worry about which character set you're using for a specific language. There's only one. You get more characters for existing languages. So even if it's a language you're already supporting through Unicode, you can get access to more characters that your customers want to get at. And I'll talk more about that later. There's support for more languages.

Xavier mentioned that briefly in the introduction, and we'll see more of that later in the presentation. A big plus is no more garbled text. You don't see garbage like you saw in the previous slide in the Korean example because each character has a unique meaning. It doesn't depend on the font that's in use.

Because Unicode covers multiple languages, you can get multiple languages at once in your text stream without having to worry about which font is being used. So plain text can support multiple languages. A very important example of plain text is file names in the Finder. And finally, because Unicode is a worldwide standard, not just defined by Apple, it's cross-platform. You can send Unicode text to Windows, to Unix systems across the internet, and other applications and other systems will know how to deal with it.

Let's take a little look at the Unicode coverage in the currently shipping version of Mac OS X 10.1. In our Lucida Grande font, we support a wide variety of Roman characters as well as several other languages. So we have support for the Roman characters, not just that are in the Mac Roman character set, but also that you would find in Mac Central European and several other accented Latin letters. We also support the Cyrillic alphabet that's used in Russia and other Eastern European countries. And we support the set of accented letters that are needed for Vietnamese.

And finally, there's a complete set of modern Greek characters in Lucida. And that's all available today to Unicode applications. If you have a non-Unicode application, Lucida will just look like another Mac Roman font and you won't get access to these extra characters. Some of our other fonts, like Times, Helvetica, Geneva, and so on, also have an extended set of Roman characters. And again, these are only available to Unicode applications.

Next, I'd like to talk about the Japanese support in Mac OS X, and this is where our Unicode support really starts to shine. In Mac OS X 10.1, we shipped a family of six desktop publishing quality open type Type 1 fonts, the Hidagino family. And these fonts have more Japanese characters than anything we've seen on the Mac platform and I think lead the industry in terms of their character coverage. These fonts contain the complete repertoire of the new GIST 213 character set standard. They contain the complete set of Adobe Japan 1.4 characters.

They also contain the set of Shaken 78 kanji that are used for photo typesetting machines in Japan. And finally, they contain--they follow the shape recommendations from the National Language Committee for kanji that are not in the Jo-Yo kanji, the Hyogai kanji. So it's a very complete character repertoire.

So the net result of all this is, compared to the approximately 7,000 characters that are available in the Mac Japanese character set, there are over 20,000 characters available for Japanese support in the Hiragino font family. And again, those are only available to Unicode applications. And because of this, the gaiji problem is greatly reduced. Now, what's the gaiji problem? For those of you not familiar with it, 7,000 characters just isn't really enough.

To support Japanese the way customers want to. So if you're printing people's business cards or doing some other application where you need a kanji that's not in Mac Japanese, you have to use a font editor or a special Dingbad font or something to get that kanji character that you need.

And because it's done that way, it's not a standard character, and it's not interchangeable with other applications or systems. That's the gaiji problem. Because the Hiragino fonts have so many characters in them, the gaiji problem is greatly reduced, and rather than being something that you have to deal with every single day, it's something you can deal with on an exception basis.

Because of that, your data is now reusable and can be sent to other platforms where it'll be understood, unlike the standard Gaiji solutions. Because these are our standard Japanese fonts, you can use them in the user interface of your application and not just for documents. And soon, planned support for Jaguar. There are some characters in the Hiragino fonts that are not accessible through Unicode because they don't have Unicode code points.

There are slightly variant versions of kanji that exist in the font already. And for Jaguar, we plan to provide access to these additional glyphs through an extension to TSM. And that will also be discussed at the ATSUI session this afternoon, I believe. So now I'd like to demonstrate a few of the features of our Japanese support via Unicode.

Here I have a document, and this document contains some Japanese. And I've color-coded some of the characters in this document. The characters that are in red are characters that are not in GIS-208, which is equivalent to Shift-GIS and roughly equivalent to Mac Japanese. And the characters that are in blue are not in GIS-213, but are in Unicode.

So in the first sentence here, I have an example of a name written two ways. It's Yoshida. And the standard way of writing Yoshida is with this kanji. But some people with that name like to write their names this way, and you can see that the characters are slightly different. In the standard version, the middle stroke is longer, but in this different version, the middle stroke is shorter.

And if that's the way I like to write my name, I'm probably going to be a little irritated if my computer forces me to do it this way. And so I'd like to be able to use that character for my name, but in a Mac Japanese application, I can't do that because only this version is available. But in a Unicode application with the Hiragino fonts, many more kanji are available.

Similarly, here's some more family names. They're all Kusanagi. This one can be written using characters in standard Mac Japanese, but these two versions can't. You can see that there are characters. They're in red or orange, I guess. And so they're in GIST 213, but not in GIST 208, and again, not in Mac Japanese. Down here we have a person's name. This is Takata, except that the first kanji is again a different form. It's not available in Mac Japanese. It is available in Unicode. And there's also a character in the person's given name that's not available in Mac Japanese.

Here's the name of a department store in Japan, Takajimaya. And again, the first kanji is a non-standard form that's not available in Mac Japanese. So you can see that especially for people's names or names of companies, there's demand for these characters because people have used them traditionally for that purpose. And you can help your customers do that by supporting Unicode in your application.

And I have a lot more examples in here. I'm not going to go into them in detail. But I want to show you some of the differences, the subtle differences that you can find in kanji that are important to customers. So this is a standard Mac Japanese character.

And if I make it a little bit bigger, you can see that in this version of the character, there's two strokes. There's a break there. But in this alternate version, which is not in Mac Japanese and not in GIST 213, you can see it's done with a single stroke.

So these are some of the kinds of subtle differences that customers can start to express with an application that supports Unicode. Before I go back to my slides, I'd like to give a brief tour of the Japanese repertoire in the Hiragino font. So I brought up the character palette in our Japanese input method, and you can see all these characters here. But a lot of them, I don't know if it's visible on the projector, a lot of them are grayed out.

And the reason for that is if you click on them, you'll see at the bottom it will tell you that a Unicode application is required to access this character. And you can see that there are quite a number of them that fall into that category. In fact, most of them. That's because two-thirds of the characters in Hidagino are only accessible to Unicode applications. Let me go back to my slides now.

Okay, so that's what we have that's shipping today in Mac OS 10.1. How about what do we have planned for Jaguar? Well, in Jaguar, we'll be turning our attention to Chinese, and we're planning to include some new fonts to support the new Chinese GB18030 standard, and these fonts contain over 32,000 characters, so they greatly increase our Unicode repertoire.

These fonts fully support the entire CJK unified ideograph region in Unicode, plus ideographic extension A, so this means all the ideographs in versions of Unicode up to 3.0. In addition, these fonts contain support for the Yi writing system, which is a minority writing system in China, and partial coverage of Tibetan and Mongolian. So we're adding new support all the time, but again, this is only available to Unicode applications.

Well, we're not just adding new support for Chinese. There are some old favorites that have been available on Mac OS 9 for a long time that we're finally adding support for in Mac OS X. These include Arabic, Hebrew, Thai, Devanagri, Gurmukhi, Gujarati, Icelandic, Turkish, Greek, Croatian, Romanian, Slovenian, and a new one, Hawaiian.

Which is very important in Hawaii, obviously. And as Xavier mentioned, all future scripts, these scripts plus all future scripts that we might add in the future, will only be available via Unicode. We are not planning to support World Script 1 or Roman variant scripts on Mac OS X. In addition, we are not planning to make any extensions to World Script 2, so we will not be extending our existing double byte character sets, nor will we be adding new ones. Everything in the future will be via Unicode.

Now I'd like to talk a little bit about what's required to add support for a language to Mac OS X. Well, first of all, you need to have a font so that you can display the characters on the screen or on the printed page. You also need to have an input method or a keyboard so you can enter the characters.

Other things you might want to do are override the sort order. The default is to use the Unicode sort order that's defined in Unicode Technical Standard Number 10, but you might want to override that for language-specific sorting. And, of course, every country has its own date, time, and currency formats, so you want to be able to format dates, times, and monetary amounts using that country's standards.

Unfortunately, today, we don't allow you to override either the collation order or the date, time, and currency formatting, which is why those are shown in blue. It's always been possible to add fonts to Mac OS X, and new in our planned support for Jaguar, you can now add new keyboards.

But first, I'll talk a little bit about what you need to do with a font in order for it to work with Unicode applications. And this is aimed primarily at font developers. First of all, your font needs to have a valid, comprehensive Unicode CMAP. And by comprehensive, I mean that every character that your font contains should have a CMAP entry. Quite often, we'll see a font that has a glyph for a character, but there isn't an entry in the Unicode CMAP. And if that's the case, it's not accessible to a Unicode application. It should have a valid POST table in order for printing to work.

It needs to have a valid name table. It needs containing the PostScript name, which is, again, necessary for printing and also to work with AppKit. It needs to have a correct unique name, the full name, the family name, the style, and a version string. It should have a valid OS2 table, and the fields of primary interest there are the UL Unicode range and the UL CodePage range.

The reason we would like to see those in your font is so that the system can tell what parts of Unicode or what code pages the font supports. And finally, if your font supports a language that has shaping behavior like Devanagri or Arabic, you want to put in a morph table in order to control that.

Now, if your font just supports standard Unicode composition like base characters and combining accents, it's planned for Jaguar that we will support that composition automatically because we will synthesize the necessary tables in the system. But still, even in that case, it's better to put an explicit morph table in your font because then you have complete control as the font designer.

Okay, next I'm going to talk about keyboards, and we have something new that we have planned for Jaguar, and that's drop-in keyboard support. So finally, thank you, thank you, finally you can add a new keyboard to Jaguar without having to patch the system frameworks. There's a new folder in the library folder called Keyboard Layouts. It can either be in the shared library folder or the library folder in your home directory.

And it supports all of our existing keyboard formats, the K-char format and the U-char format. You can just drop a file in containing one of those as long as it's in the right format, and that keyboard will become available. But we also have a new keyboard format that's meant to make it a little bit easier to design your own keyboards. And that's a Unicode keyboard layout format defined via an XML text file.

Now, functionally, this is equivalent to our existing U-char keyboard layouts, but it's a lot easier to edit. In order to edit a U-char, you had to read the documentation in the document supporting Unicode input. You had to decipher it. There were no tools available. It wasn't very easy to use. So this makes things a lot easier.

Because we now support drop-in keyboard layouts, you can't get at these keyboard layouts via the resource manager. So if you have an application that accesses keyboard resources directly, you need to use the new APIs, which have names like KLGetKeyboardLayout, et cetera, et cetera, to access keyboards, because the keyboards that are added via the keyboard layouts folder will not be accessible via the resource manager. This is very similar to what happened with fonts in earlier versions of Mac OS X.

So now I'd like to show you a little bit about what one of these XML keyboard layouts looks like. And I'll switch back to the demo machine. Let me hide world text away. Let's see. You need to have text edit. So here I have text edit. And just to show that these are part of the system, Switch the encoding to UTF-8. So, the keyboard layouts are in System, Library, Keyboard Layouts, look in the UniCode bundle. And there they are. Let's open the US Extended Keyboard Layout.

I'll make this a little bigger. Let's make it a 13-point font. There. Okay, so I opened this straight out of the keyboard layouts folder in the Jaguar system. Here's an example. This is the U.S. extended keyboard layout, which used to be called the extended Roman keyboard layout. And you can see it's a standard XML file, and it just uses standard XML syntax to describe how the keyboard layout works.

First, there's a section which defines which hardware keyboards are supported, and this is necessary for remapping keys on just keyboards. There's a section which defines how the modifier keys are mapped. And then down here is the meat of the keyboard, which is the key map set, which maps a virtual key code to a character, or in some cases, a state machine action.

And one of the nicest things about using XML for a keyboard layout is because XML uses Unicode, this is a UTF-8 text file, I can just put the characters that I need right in the file. So let's scroll down to a section where there's some more interesting characters, like the options section. You can see there are all sorts of interesting characters in here. I don't know if those are visible.

You can just type them in directly. Also, because it's XML, you can use a numeric character entity to define a character as well. You don't have to actually type it into the document. And then way down at the end, past all the key mapping sections... There are the state machine actions which control things like dead keys.

So this is a good example to look at. If you install the Jaguar CD that you received yesterday, you can just navigate to the same place and open this file with text edit and look through it and get an idea of how to create XML keyboard layouts yourself. Okay, I'll go back to the slides now.

Okay, so that concludes the section on Unicode support in the operating system, and now I'd like to spend the rest of the presentation talking about what you can do in your application to support all these great Unicode features that you just saw. Before I do that, I'd like to talk about some of the concepts that are important in understanding and supporting Unicode. And probably the most important concept is the character glyph model. It's a very central idea that's necessary in processing Unicode. And it's discussed in detail in Unicode Technical Report 17.

The best way to understand the distinction between characters and glyphs is you can think of characters as being like the spoken language. They represent strings of information the way that you would speak them aloud. Glyphs are the shapes that you see on the screen or on the printed page. You can think of that as being the written form of the language.

And quite often, there's a one-to-one relationship between characters and glyphs. One character goes to one glyph, but that's not always the case. That's true in complex writing systems like Arabic and Indic, but it can be true even in English and Japanese, and we'll see examples of that later on. And a Unicode rendering system, in order to fully support Unicode, has to be able to map characters to glyphs in a many-to-many fashion.

So here's an example of the character glyph model at its most daunting. This is a string of characters in the Devanagri script. It's Sanskrit. I'm going to use the mouse over here so I can point things out a little bit. Here is a string of Unicode characters in Devanagri in the order in which you might speak them. But the order in which those characters are presented on the screen is actually quite different. As part of the layout process, while the first character winds up in the first position, that's not too surprising, but the second character winds up down here.

The third and the fourth characters get combined into a ligature up here. The fifth character is down here, and the sixth character is back over here in the middle. So you can see there's quite a lot of rearrangement that goes on, and in no sense is there a one-to-one mapping between the characters and the glyphs. And this is just a normal feature of the Devanagri writing system. And this is not the only writing system in the world that has features like this. There's quite a few of them.

And just to demonstrate that this actually goes on live in the system and not just on a slide in a presentation, I'm going to bring up WorldText again. I'll close the Japanese document that I have open. And I'm now using the Devanagri keyboard. You can see some of the keyboards that are available in Jaguar. And I'm now going to type that example. And you can see that as I type, different characters wind up in different places. And the form that's shown on the screen changes shape as I type each character.

And again, this is only available to Unicode applications. Okay, let's switch back to the slides. So that's some of what you have to worry about when you're handling Unicode. But let's now get into the details of supporting Unicode in your application. So the first thing you have to worry about when you're handling Unicode is, how do I store it? And Unicode can be stored in several different forms.

Probably the purest form of Unicode is what's called the scalar value. And this is just a number from 0 through 10FFFF that defines a particular Unicode code point. The scalar values are used in HTML and XML. You saw an example in that XML keyboard that I brought up. Those numeric character entities were Unicode scalar values.

But we don't use that in Mac OS X, except in the context of HTML or XML. The format of Unicode that's most widely used in the industry and that we use in Mac OS X for Carbon and Cocoa and Java is the UTF-16 form, and that represents Unicode code points as either one or two 16-bit values. If a Unicode code point is in the first 64K, that is, it's from 0 to FFFF, then it's one 16-bit value. If it's above that, then you use two.

There's another form of Unicode called UTF-8 that's used in the BSD part of Mac OS X and is also used in Unix systems, wherever Unicode is used. The reason is it's compatible with ASCII. It looks like a multibyte encoding like ShiftJIS or any other multiple byte encoding. And so it works better with software that's expecting multibyte encodings. There are one to four bytes per Unicode scalar value in UTF-8. And finally, there's another form called UTF-32, where each Unicode scale or value takes 32 bits, and that form is not used in Mac OS X or indeed on most other operating systems.

Another issue to worry about in storing Unicode is the normalization form that you're going to use. Well, what does that mean? This is discussed in Unicode Standard Annex Number 15, if you want to read about it in more detail. But in Unicode, there's several ways to represent the same character. Not for every character, but for a good many of them.

A very good example is the E with an acute accent that you see up there. There's two ways to represent it. One is by a pre-composed character, that is the E with acute accent as one character, and another is as the character E followed by a combining accent, and that's the form you see on the left there. And the combining form is used for most Latin support, but there are a large number of pre-composed forms in Unicode as well, and those are there for compatibility.

So, which form do you use? That's where the normalization forms come in. The fully decomposed form of Unicode, which is the version that we use for file names in HFS+, you always store things as a base character plus a combining mark. So you'll never see the pre-composed E with a cute accent in the fully decomposed form.

That's one form. And that's what we use in HFS Plus. But on the Internet and on Windows, they prefer the other form, the canonical composed form. And in that form, you always use the precomposed form if it's available. Sometimes it's not. In preference to the base character followed by a combining mark.

And in the currently shipping version of Mac OS X, it's possible to convert from the canonical composed form to another form, but it's not possible to convert to the canonical composed form. We're planning to add support for that in Jaguar so you can convert to the canonical composed form in order to send it to Windows or to another application that expects it. And there are two more forms which I won't discuss today that have to do with compatibility decompositions. You can read about them in Unicode Standard Annex Number 15.

So you have these UTF-16 values which I discussed. How do you store them? Well, if you're using Cocoa, the answer is easy. It's NSString. NSString provides string storage. It will store UTF-16 for you. Carbon has a subset of NSString called CFString. And in addition, you also have the option of storing them as raw arrays.

We have a data type called Unicar, which represents a single UTF-16 value that you can use. On disk, either in your own documents or in a .strings or .plist file for your app's user interface, you can use either UTF-8 or BigND and UTF-16. The UTF-16 format is preferred, but UTF-8 will work just fine.

Well, now that you've got some Unicode text, the first thing you're going to want to do is process it. And that's where you can start to run into trouble. The first set of issues I'm going to talk about have to do with handling characters. Now, you might think that characters are pretty simple once you have 16-bit characters, but as I've mentioned in the slide on normalization forms, there's actually several different ways to represent them.

So one issue in processing characters is dealing with clusters. An example of a cluster is the base character plus combining form, which I just showed you, such as the E with a cute accent. But it can also occur in other languages. For example, the Hidagana character ga there, which is a base character ka plus a combining dakuten.

Another issue with representing characters is surrogates. Now, I mentioned that characters that are above the first 64K of UniCode require two UTF-16 values to represent them, and those are called a surrogate pair. So the character on the left is in the first 64K of UniCode, so it's represented by a single UTF-16 value.

But the character on the right is not, and so it requires two UTF-16 values to represent it. And finally, because there are different ways to represent the same character data in Unicode, there's different spellings, if you will, for the same character data. The pre-combined form on the left or the decomposed form on the right.

So what happens if you're not aware of these issues when you're processing data? Well, one example on the top there is if I'm not aware that I can have a base character plus a combining accent, if I were truncating that text there, "long resume," I might think that the combining accent is a separate character from the base character. And so when I'm truncating, I might just lop it off. As a result, I've now got an unaccented "e," which is not the result that the user wants to see.

If I have a surrogate pair for a character that's not in the first 64K of UniCode, and I think that those are two separate characters because they're two separate 16-bit values, then I split the character in two and I wind up with garbage like that. And finally, if I'm doing sorting or searching and I don't recognize that the precomposed form of "e" with a cute accent is the same as the decomposed form, then I might think that all of those words are different when, in fact, they're all exactly the same.

Of course, the way to avoid all these problems is to use the APIs that Mac OS X provides to your application. So for Cocoa applications, NSString has a pair of APIs. Range of Composed Character Sequence at Index will tell your application the boundaries of a character cluster. So the base character plus the combining accents, or in the case of a surrogate pair, it will tell you the boundaries of that surrogate pair.

So if you want to truncate a string or get a substring out of a string, you can use that API to find a safe place to break the string. The NSString Compare API will compare two strings in a Unicode-savvy and locale-sensitive way so that things like multiple spellings for the same data are taken into account.

On the Carbon side, the Unicode utilities have two APIs, UC Find Text Break and UC Compare Text, which will find boundaries in Unicode text and compare text in a Unicode-savvy way. And finally, you can truncate your text while you're drawing it without actually having to change the underlying data by using the Katsu line truncation tag. And again, that's done in a Unicode-savvy way.

But wait, there's more. When you started to get into the really complex writing systems that Unicode can support, there are even more problems, particularly in support of bidirectional and cursive scripts and rearrangement. So let's take a look at some of these more complex writing systems. Arabic is a right-to-left language. It's written right-to-left. And in addition, it's also a cursive script, which means that the letters flow together and form ligatures. So in this example, let me use the mouse pointer again.

The first character, which is on the left here, winds up on the right side of the final result. And the individual letters flow together in a cursive way. Hebrew is also bidirectional but is not cursive, so it's a fairly straightforward alphabetic language, except that it also goes right to left.

But even English can have some of these attributes. For example, the Zapfino font, which is part of Mac OS X and has been since the beginning, is a cursive font and turns English into a cursive writing system. So the reason to support these kinds of writing features is not just for languages that you may not care about. It's also to support nice typography in English.

And as we saw earlier when I showed an example of typing in Devanagari, there's also rearrangement where rather than left to right or right to left, you have something even more complicated where things just wind up in different positions altogether. So, for example, you can see that the second character in this Devanagari string winds up as the first glyph on the screen.

What happens if you don't take these issues into account? Well, probably the biggest problem in supporting bidirectional or rearranged text is style runs. Many, many applications will draw style runs one at a time. So you find the first range of characters that has a given font or a given style. You draw that. You take the next range. You draw that. That's the way Quick Draw works. The problem is if you do things that way and you don't take bidirectional issues into account, you get the wrong result.

So, for example, the first string, which is drawn correctly, has an English word, an Arabic word in a large font, and the same Arabic word in a small font, followed by another English word. And because Arabic goes right to left, it's important that the first word come on the right. But if you just draw style runs left to right, you'll wind up putting the word with the large font on the left. Because it's the first one you encounter in the text stream, and the result is wrong.

For scripts with rearrangement like Devanagri, you get even worse results if you assume that things go left to right. The mapping between a character index and a glyph index is very complicated, and if you assume that they're the same, hit testing, highlighting, moving the cursor with the arrow keys, none of that is going to work correctly.

Of course, we provide APIs to work around all these problems. Because of bidirectional issues in Unicode, to do bidirectional layout correctly, you have to work on an entire paragraph at a time. You can't just do a little bit at a time. That's the way the Unicode bidirectional model works. And by a paragraph, I just mean the text between one hard line break and another hard line break.

So both Cocoa and Carbon provide APIs to do this for you. On the Cocoa side, there's NSAttributedString, which lets you store an entire paragraph with style runs and all, and NSTypeSetter, which will lay that out. On the Carbon side, we have AtsuTextLayout and AtsuStyle, which will also let you represent an entire paragraph or more of styled text. That lets the system have all the text in the paragraph at once so that it can handle these issues for you.

If you're mapping between character offsets and coordinates on the screen, there are APIs available that will do that for you. ATSU, Offset to Position and Position to Offset. The ATSUI APIs for moving the cursor with the arrow keys. And on the Cocoa side, NSLayoutManager will handle all of these issues for you.

So all is well and good. You have your Unicode application working perfectly. There's only one problem. All the data out there is not in Unicode, or a lot of it isn't anyway. Your customers have lots of documents and information that are not stored in Unicode, and you need to get that into your Unicode application. So how do you do that? Well, there's many APIs in the system for doing encoding conversion, but one thing you need to know is which encoding the data is in. And quite often the data isn't labeled as to which encoding it's in.

There are some APIs you can use to make a good guess. Probably the one that you should use the most often is Get Application Text Encoding. And what that does is return the character encoding that corresponds to the language that your application is currently running in. So if I have an application that's localized into several different languages, and I launch it and it's running in English, that encoding will be Mac Roman.

If I launch it and the Japanese version runs because of the user's language preference, then that encoding will be Mac Japanese. So that's the encoding, the legacy encoding, that corresponds to the language of the user interface. Usually that's the right choice. Sometimes it's not. Sometimes you want to use the encoding for the user's most preferred language, regardless of what language your application is running in.

For that, you can use CFStringGetSystemEncoding. And that will return... the legacy encoding that corresponds to the language that the user put first in their language preferences. If neither of those are right, then it's pretty much application dependent. For many Internet applications, the character encoding information will come in as part of the Internet header or will be embedded in the document, and you can use that.

Once you know the right character encoding to use, there are several APIs available to convert to and from Unicode. CFString and NSString both have APIs for converting between text encodings. At a lower level, the text encoding converter is available, and that supports converting between a wide variety of character encodings. And all of these handle not just legacy Mac OS encodings, they also support internet, encodings that are used on the internet and encodings that are used on Windows. It's just a wide variety of them.

Now I'd like to go over the APIs we looked at today and summarize their capabilities. We looked at CFString and NSString. Those provide storage for UTF-16 characters and other encodings as well, but the basis for those string APIs is UTF-16 and UniCode. They provide APIs for string manipulation and encoding conversion. In addition, NSString provides more support for Unicode. It provides APIs to find cluster boundaries, which are very important if you're going to break a string in the middle, and for doing locale-sensitive collation.

At a lower level, the Unicode utilities provide many functions for doing Unicode operations in your application. You can find text boundaries, that is the boundaries between characters, between clusters, between words, and between lines. So it contains support for all of those, both going forwards and going backwards. They provide APIs for doing cursor movement. So if the user hits the arrow key, you can figure out where to move the cursor to in either direction. There are APIs for doing text comparison or collation for comparing two strings of Unicode text. And again, that's done in a locale-sensitive and Unicode-savvy way.

And the Unicode utilities also provide APIs for mapping between the locales that are used in our more modern APIs and the regions and script codes that were used in the script manager. So if you have some data that has a script manager script code or a region code, the Unicode utilities provide a way to convert that to a locale. There are also APIs in the Script Manager for converting a script code or region code into a text encoding that you can use for character set conversion.

Well, a lot of this has been very low level, but fortunately there are also higher level APIs that will do even more for you so that you have to worry about less in your application. Carbon, there's MLTE, the Multilingual Text Engine, and that's a complete text editing solution for Unicode. If you use MLTE, you don't have to worry about cursor movement, you don't have to worry about breaking text between characters, words, or lines.

All that is taken care of for you. You just put the text in, and MLTE takes care of displaying it and supporting editing for you. It's a superset of the functionality that was available in the old text edit APIs that we've had for many years in Mac OS since the beginning. On the Cocoa side, there's NSTextView, which similarly provides a complete solution for Unicode text editing and display.

Some other APIs that will be useful for your application are the Text Encoding Converter. Again, it supports a large number of encodings, not just for Mac OS, but for various systems. And it supports multiple forms of UniCode, including, new for Jaguar, support for converting to the canonical composed form of UniCode.

If you're doing your own text input, if you're not using MLTE or if you're not using NSTextView and you have to receive Unicode input directly, then you'll want to use the Text Services Manager. And that's necessary to support input either for Unicode or for double byte languages or for both. And again, if you use MLTE or NS TextView, you don't have to worry about that.

I'd like to recap what we're planning to do in terms of Unicode support for Jaguar. First of all, we'll be adding more Unicode coverage. We'll be adding fonts to support the GB18030 Chinese standard. And those fonts have over 32,000 characters in them. We'll be adding a wide variety of languages that we supported previously on Mac OS 9 via the language kits, as well as a couple more besides.

We're adding supports that you can drop in new keyboard layouts without having to edit the system. And among the keyboard layout formats that we support is a new XML text file format, which makes it really easy to edit a keyboard layout yourself and add new keyboard layouts to the system. And finally, as I mentioned, we'll be adding support for converting to the pre-composed form of Unicode, which should make it a lot easier to send data out on the Internet or to Windows systems, where that's the preferred format.

Well, there is a lot of information in this presentation, but there's even more information out there, and here are some of the places that you can go to get that. The best source of information about Unicode is, not surprisingly, the Unicode Consortium, and the website is www.unicode.org. And there you'll find all sorts of information. You'll find all the technical reports I mentioned.

You'll find a complete set of code charts showing a representative glyph and the character code for every single character that's in Unicode. And you'll find sample code for processing Unicode data. If you want all that in an easy-to-reference format, you can buy a copy of the Unicode 3.0 book. There's the ISBN number. And that's available at technical bookstores everywhere. And you can order it from Amazon.com and all sorts of other places as well.

We have a special web page for users concerned with international issues, and that's developer.apple.com/intl, and that summarizes all of our international technologies. So it's kind of one-stop shopping for finding references to information that you need to make your application Unicode and international savvy. If you're a font developer or you're interested in fonts, we have a web page that's specifically for supporting font developers, and that's developer.apple.com/fonts. There you'll find complete specifications for the TrueType font format, information on how to add a morph table to your fonts, plus a wide variety of font tools that you can use to edit your fonts and add more features to them.

There are some other sessions that you'll probably find useful if you're interested in the kind of topics we cover today. This afternoon, there's session 202, Drawing Text with Atsui, and this will discuss our Carbon APIs for drawing Unicode text. And that's at 3.30 in Room J. And if you are a Carbon developer, I strongly recommend attending this session. You'll find out a lot more about how to draw Unicode text in your application.

Tonight, at 7.30, there's the International Birds of a Feather session, which is in Room N downstairs. And that will be your opportunity to meet the engineers who work on these technologies and ask them all sorts of questions that you might have that weren't covered in the presentations. On Thursday, if you're looking for a higher level solution for Carbon text, there's a session on the multilingual text engine. That's Thursday at 9 a.m. in Room A2. And again, I highly recommend that if you're a Carbon developer.

And finally, all of this stuff works perfectly, but just in case you want to talk to us for another reason, there's the International Feedback Forum, which is Thursday at 5 p.m. in Room J1. And I'm sure you won't have any feedback to give us, but you can come anyway.