Unicode for Japanese, Chinese, and Everything Else! - WWDC 2003

Application Frameworks • 1:06:52

Learn how Unicode can give your application new international muscle. Customers and governments in Japan, China, Hong Kong, and Taiwan are demanding support for extended character sets--characters that can be accessed only via Unicode. We'll discuss how to meet those requirements, and cover other technologies that can give your app a boost in the Japanese, Chinese, and Korean markets. We'll also cover other planned improvements for Unicode support on Mac OS X.

Speakers: Deborah Goldsmith, Yasuo Kida, Mike Grady

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, everybody. My name is Xavier Legro, and I'm the Mac OS X evangelist in developer relations. And I'd like to welcome you to session 404, Unicode for Japanese, Chinese, and Everything Else!. I had to look at the title. It's kind of long. Before we start with this session, where we have great content, we're going to give you an update on what we've been doing with regards to Unicode in Panther. I'd like to take just a moment, because yesterday night--I don't know if you went--but we had the Meet the Evangelist event downstairs.

And I represent a lot of technologies on Mac OS X to developers, but most of the time, I get a lot of questions on, "Well, why should I use Unicode? Why, as a developer, should I really focus on using Hatsui, or using the Cocoa layout engine, or using MLT? Why do I need to use Unicode in my application?" And that was very interesting, because it seems still a lot of developers don't understand all the benefits with Unicode. We've been talking about it since the beginning of Mac OS X. We had sessions for the last three years. But let me give you just a quick rundown on why, as a developer, you should really focus and use Unicode as much as possible in your application.

First, if you're a CDK developer and you're developing an application for China, Japan, or Korea, you just have to do it now. Apple has been investing a lot of money and effort in our Japanese support, for instance, and the only way you can take advantage of some of the features we're going to be talking today, like, for instance, accessing the 32,000 glyphs that we have in Hiragino, is by using Unicode.

But then, second, it's important for you to understand that Unicode is where, actually, Apple is putting all its efforts. We're not doing any more things. We're not supporting any new languages in Word Script. And for you, you have to see that as a way of evolving your application.

So please take good note of all the content of these sessions. We have brand-new features for our customers in the CDK countries. And for that, I'd like to invite on stage Deborah Goldsmith. who's the manager of a font and is the UniCode liaison with the UniCode consortium. Deborah. DEBORAH GILLETT: Thank you, Xavier.

Good afternoon, everyone. Today we're going to talk about Unicode and Mac OS X. So here's a quick introduction to what we're going to be discussing today. The market in Japan and China has changed, and in order for your application to be competitive there, you need to support Unicode. Luckily for all of you, Mac OS X has great Unicode support, and we'll talk today about the tools that are available for your application.

Specifically, we'll be discussing how governments and customers in Japan and China are asking for new features and new characters in particular. We're going to discuss why only Unicode can meet those new requirements. We'll talk about some great new features in Panther that are only available through Unicode and to Unicode applications. We'll talk a little bit about how Unicode is different from WorldScript from what you may have been doing before and what you need to do in your application in order to work with Unicode.

So there are a lot of reasons to move to Unicode for the Japanese and Chinese markets, but here's the biggest problem by far, and that is that customers are demanding more characters. Well, why is that? In order to give you -- if you're not literate in Japanese or Chinese, in order to explain by analogy, suppose that your name is Smith, but you spell it S-M-Y-T-H-E.

Now, you may go to a website, maybe it's Amazon.com, and you want to enter your name to give an order, but when you do that, the website comes back and says, "I'm sorry, in order to use this website, you have to spell your name S-M-I-T-H." Well, that's pretty bogus, you think. Why should I have to change the way I spell my name in order to use this website? Well, that's exactly the situation that many customers in Japan and China find themselves in.

And the reason is that -- I think it's interesting that the number -- the variety of characters that people use to write their names is much larger than what has traditionally been in the Japanese and Chinese character sets on Mac OS. So I have an example here. This is five different ways -- let me see if I can use the laser pointer.

That's kind of tiny, but that third line there is five different ways of writing the Japanese family name, Watanabe. And you may not even think that some of them even look different, but all of them are different, and there's actually a lot more ways than that for writing that family name.

And so customers, not surprisingly, want to be able to write their name the way they write it on everything else. They don't want to have to fall back to some different standardized form of the character when they're using a computer. And this doesn't just affect people's names. For example, using the Mac traditional Chinese character set, it's not possible to write the names of all the subway stops in Hong Kong or even the name of the new international airport there. So it's a big problem. There's just not enough characters for what customers want to do with computers today.

Governments are also specifying more characters. All of the major character set standards in Japan and China have been revised over the last few years. There's new versions of all of these character sets, and they all specify many more characters than were in those character set standards before. Some of them are not just specifications. In particular, the GB18030 character set in China and the HKSCS character set in Hong Kong are government requirements. The government requires that software support these character sets.

So in order to meet all these new requirements, we have to support more characters. The problem is that the WorldScript system, which has been in Mac OS for a long, long time, can't support any more characters. It has limitations, and it just can't support the number of characters that are needed.

So the answer is Unicode. Unicode is an industry standard. It's one encoding that handles all the living languages in the world today and a large number of dead ones besides. Because it's a single encoding, a character is a character is a character. The meaning of the character doesn't change depending on what font you have. If you've used Japanese or Chinese or Korean or what have you on Mac OS 9, you might have been in the situation where you used the wrong font and you see something like that, garbage characters. Would anybody like to guess what that really is?

It's Korean, but you wouldn't know that unless you chose the right font. With Unicode, this doesn't happen, because the meaning of a character doesn't change depending on the font. Unicode solves the character problem because it's got plenty of room for all the characters that customers and governments need. The latest version of Unicode, 4.0, which was just released a couple of months ago, has over 96,000 graphic characters. So that easily covers all the needs that customers have, and it covers all of the new Asian character set standards.

Let's talk a little bit about what kind of UniCode support we have in Mac OS X. Our main human interface font is Lucida Grande, and in Panther, it now covers all of the Roman characters in UniCode, and all of the Greek characters. And it covers several other scripts besides. Our other core Roman fonts, like Times and Helvetica, also have a large Roman repertoire, although they don't cover the entire set of Roman characters in UniCode.

But beyond this, we've got lots of great font coverage in Mac OS X. Our Japanese support is outstanding. We have six beautiful Japanese desktop publishing fonts. The family name is Hiragino. They're in Type 1 format. And as you can see, they're really beautiful. There's an example up there on the screen.

And these fonts have greater character coverage than any other Japanese fonts on the market today, and they cover all of the major standards that you might be interested in for the Japanese market. Not just 213, but also Adobe Japan 1.5, characters that are used for photo typesetting, and even the complete set of government shape recommendations from the National Language Committee.

But we don't support just Japanese. We also have great Chinese support. Mac OS X, since Jaguar has had support for the Chinese GB18030 standard--these are also beautiful fonts, there's another example up there. The GB18030 fonts have over 32,000 glyphs and support all of the Chinese characters in plain zero of UniCode, every single one, as well as minority languages like Yi. For Mac OS X Panther, we're adding support for HKSCS from Hong Kong and Big Five E from Taiwan for traditional Chinese, and these new fonts have over 22,000 characters.

But we support even more languages. Also new for Panther, we've extended our Arabic coverage. It doesn't cover all of the Arabic in Unicode, but we cover a bigger chunk than we did before, and support more languages. We support some of the scripts you see here that we also supported in Jaguar.

We've also added support for native North American languages, like Inuktitut and Cherokee. And we've also added a new font that we built ourselves to cover a lot of the symbol blocks in Unicode. And I won't read all these off, but you can see that we have a lot more symbol coverage that we had in earlier releases of Mac OS X.

Okay, this time it went. But we haven't just added fonts. We've also made other improvements to our international support. As you may have heard at the State of the Union session yesterday from Mac OS X, Unicode text drawing is much faster, over twice as fast as it was in Jaguar.

We've also improved our bidirectional support. You no longer have to specify to the system whether a paragraph is left to right or right to left. It will figure it out heuristically. So your users can just type, and it will determine whether it's a right to left or a left to right paragraph and put the punctuation in the right place.

We've updated our bi-die algorithm to the Unicode 4.0 standard, and we've also made several bug fixes. Now, I should mention that the seed that you see here is a bug fix. The one that you folks have received doesn't have all the latest stuff that we're working on, so you may not see some of this until you get the GM release of Panther.

We've also made some fixes to our support for Indic languages. In 10.2.4, we introduced dictionary-based Thai word break. But with Panther, we're now adding the ability for users to specify their own dictionary and supplement the built-in one, so users can now put their own list of Thai words in, and that will affect word break in every application in the system.

And finally, Apple supports 16 languages for localization, and we have an expanded that list in Panther. However, we've always had a longer list of languages than the ones we support ourselves, so that you, the developers, can localize your languages into languages we don't support. Just as an example, I noticed that somebody released a Serbian localization for Safari. We don't support Serbian ourselves, but because of the extra languages that are available, you can do that if you want to. And we've expanded that list for Panther.

Well, Arabic and Hebrew and Japanese and Chinese are fine, but you say, if I don't have Hanunu's support in my application, my company is toast. What are you going to do? Well, it's not a problem because Mac OS X allows you to add support for new languages yourself. There's two easy steps. All you need to do is add a font, and we have a font developer website that you can go to to find out how to build fonts and how to enable them to work with Mac OS X.

And then you need a way for users to input your new language, and you can do that either via keyboard or via an input method. And we have a tech note on how to do keyboards, and we have sample code for input methods. So, rest assured you can add Hanunu support yourself. Hanunu is a Philippine writing system, by the way, in case you were wondering.

We've made improvements at the API level. In Jaguar, we introduced the Variant Glyph Access Protocol. And the reason that we have that is that even though Unicode has over 96,000 characters, there are some ways of writing Chinese ideographs where you have different variants even though they're considered the same Unicode character. There are still slightly different ways of writing it. And that's actually very similar to what you see in Roman fonts.

You can have Roman fonts that have different ways of writing the same character. Some of the fonts that we include in the system, like Zapfino or Apple Chancery, have multiple versions of the same letter that are useful in different situations. So the Variant Glyph Access Protocol lets you specify exactly which variant of a character you want, or lets the user specify that.

In Panther, we're also introducing a new protocol that gives an input method access to the entire contents of a user's document. And that lets the input method do two things. It lets it give much better accuracy for conversion, and it also supports some new human interface features, and we'll actually see that a little later on. And our Japanese input method, Kotori, takes advantage of this.

We've also made extensions of the Japanese input method, Kotori, to allow access to a lot more of the capabilities that fonts have that have been hidden up until now. So you can now get at those through the font panel, and that information can be passed to your application.

We've also improved the input menu. For those of you who are not familiar with it, the input menu is the little menu that looks like a flag, or at least it did in Jaguar, that you get when you have more than one keyboard layout enabled or if you have input methods or so on and so forth.

We've greatly streamlined it and improved the human interface. There's no longer a pencil menu. The pencil menu is specific to each input method, but we've taken the contents of the pencil menu and merged it into the input menu, so now there's just one menu. And input methods that have been revised to take advantage of this new human interface can have a much more streamlined UI and have their modes appear individually so that instead of having to choose the input method and then choose the mode, you can just go through the menu and see what's going on. It goes straight to the mode you want. Of course, older input methods continue to work flawlessly and transparently. There's no need to revise an input method unless you want to.

We've also made improvements in our input methods themselves. As I mentioned, Kotori has much better accuracy and a better interface with lots of new features, which we'll see in a moment. In 10.2.4, we introduced a new traditional Chinese input method, HANIN, which has much easier input for traditional Chinese. For Panther, we've expanded our simplified Chinese input method to allow access to all of the characters in GB18030.

And that's a lot of them. As I mentioned, our GB18030 fonts have over 32,000 glyphs. And finally, we've added more plain keyboards for more language support. And now to show you some of these new features, I'd like to bring up Yasuo Kida and Michael Grady for a demonstration. Michael?

If we can switch to demo machine three, please. Hello, everybody. I'd like to give you a brief demo of the UI improvements we've been making to the text input menu for Panther. This is not a Panther machine. This is Jaguar. I just wanted to go over some of the problems we tried to solve with the input menu in Jaguar.

You have the US flag. One problem we found from a focused user group study was discoverability was a big problem. Non-Mac users had no idea that this menu contained anything to do with input. And many of them could not figure out how to switch from US or a Roman keyboard layout to the Japanese input. So it was clear in Panther that we had to improve the icons we used throughout the system.

[Transcript missing]

The first problem is that the presence of the menu itself in the menu bar can sometimes interfere with the apps menu list and sometimes even clip portions of it off. So it's clear that we had to bring down the amount of real estate that we use in the menu bar. So let's switch over to machine number two, please, and show what things look like on Panther.

Here is an icon. You'll notice there's only one of them. And, uh, are we there yet? No. Machine number two, please. There we are. There's the icon I was speaking about. It's much more obvious, intuitive to users that there might be access to additional input modes or input sources in there. And it's in the right side of the menu, so it will not follow along at the end of the ask menu list and cause distraction to the user. Let's see how this works. How can we take what used to be implemented as two menus into one?

And here's the answer. You'll notice that we have a number of input sources here. And you wonder, what are those? They're not keyboard layouts. They're not input methods. They are the input modes implemented by a particular input method--in this case, Apple's Japanese input method, Kotoweri. They're all there. They all belong to the same input method. And what we referred to in the past as a pencil menu, the second menu in the Jaguar menu bar is flattened into this menu right here.

Now, it's interesting that these input modes have become first-class input sources, the same level as what input methods used to be and keyboard layouts. They are the preferred input source that the user should see. and in any system UI, they will be shown side by side with those other input sources at that level. Another system provided UI is a new palette reminiscent of the pallets, the input mode pallets provided by the input methods themselves in the past.

And lastly, we can bring up the international prefs and have a look at the improvements there. Before we get into that, I'd like to note that for those applications that have particularly large menu lists, or if you just don't like the input menu around anymore, you can command drag the menu out of the way. And that can simply be reinserted by this checkbox in the press panel. You'll notice the hierarchical nature of input methods, and how they advertise input modes, letting the user choose subsets of input modes they would like in the menu.

And you'll also notice the US layout, by the way, still has the old flag icon, but this is being changed actively. We didn't have it ready for this demo. It can be inserted in the menu and removed, which is not something we could do in the past. In the past, whether or not you were using input modes specific to a single input method, the US layout or the Roman default layout always showed up in the menu, and it can now be removed. The input method itself can be disabled, but the modes that would be enabled if you were to reactivate it also show up.

You might be wondering about the existing input methods and what compatibility we have with those. You'll notice that they are fully supported. It's completely transparent. Choose that input method. And you'll notice that the input method-specific pencil menu shows up here automatically. The input method did not have to change, but of course we want to encourage input method developers to adopt this new input mode protocol to give the users the benefit of a single user interface for choosing input modes. And that's what I have for the text input menu. Now I'd like to bring up Yasuo Kida, who will discuss improvements in our Japanese input method.

Thank you. Hello, everybody. I'm very glad to be here with you because I'm very excited about the great improvements we are making for Panther. One of the big changes is the text input menu Michael just mentioned. And I'll tell you about--I'll show you Kotori 4. There are three major new features in Kotari.

One of them is very high conversion accuracy. We've been continuously improving the conversion accuracy since Mac OS X 10.1, and we believe we achieved a milestone with this release. Not only we improved the engine itself, we applied a new technology called Latent Semantic Mapping, or LSM, in order to resolve the class of ambiguity which no other input method can right now, which is to find out a topic of the document you have. Say, "I'm going to use this document to Consider a word "hot." If you're talking about summer, the "hot" probably means about temperature. But if you're talking about Thai food, for example, it's probably about spicy hot. It's like that. I'll show you how it works.

Say, the document in the left-hand side says--talking about the Jazz Festival--it says, "Monterey Jazz Festival is the oldest jazz festival in the world," which is true. The document at the right-hand side says, "Boston Marathon." Boston Marathon is one of the most oldest marathons in the world. And in Japanese, both player and runner are pronounced the same, "sousha." And when they entered in a different sentence like this, The traditional input method couldn't resolve those ambiguities. But in case of this new Kotori 4, it can look at this context and find out the correct conversion for each cases.

Socia, here, and Socia, here. And Covert, Covert. Please. Please look at this first character. This means "play," and at the right-hand side, this "play" means "run." It converts the word correctly depending on the context of the document. The other improvement we are making for Kotori is UI to correct conversion errors and typical mistyping. The first one is reconversion, which is "saihenkan" in Japanese.

[Transcript missing]

Click and double-type the Kanaki. Oops. Oh, yeah. It gets you the candidate window.

Say you confirmed the text before you meant to. By double-typing the Kanaki, you can get back the conversion state.

[Transcript missing]

Say you start typing "konnichiwa" and you suddenly notice that, oh, this is a wrong, incorrect mode. You type "kanaki" two times, and you can continue typing. The third--thank you. The third feature is we put an MSI/ME compatible mode for those people who switch from Windows and comfortable with using MSI/ME compatible keystrokes, and also those who are using two environments back and forth, and so that they want to the same keystroke between Mac and Windows.

And please note that many of those features require your help. Many of those features uses--fully utilizes the Document Access Protocol, the new API. So in order to provide a constant user interface for your customers, you need to--you need to adopt those APIs. Deborah will mention those in details. I'll--yeah. Return the talk back to Deborah. Thank you, Kida-san.

Okay, so those are great new features, and as Kidasan mentioned, we need your help in order to make them available in all applications. There are even more improvements at the API level that I'll go into now. One big one that people have been asking us for, people who've converted their Carbon applications to Unicode have found that there's a sticky point.

There has been no support up till now for formatting or parsing dates, times, and numbers in Unicode. If you wanted to do that in a Carbon Unicode application, you had to use the old Script Manager APIs and then convert the text to Unicode. Well, in Panther, we're introducing a new set of APIs in Core Foundation that lets you format and parse dates, times, and numbers, CFLocale, CFDateFormatter, and CFNumberFormatter. So now you can have a totally Unicode application.

For both for formatting and parsing and also for sorting, we're now supporting many more locales for Unicode. And the reason we're able to do that is that we're taking advantage of an open source library called ICU or International Components for Unicode, which is now part of Panther. Now, we're not yet allowing applications to access this library directly.

The reason is we want to make sure that we're capable of supporting binary compatibility from release to release, but that is something we're looking at, so that may become available to your applications in future releases. Another side benefit of using ICU is that our collation is three times faster than it was in Mac OS X Jaguar. So sorting applications will get much faster.

[Transcript missing]

Another thing that you have to watch out for is when you save attributed text. A lot of the information that implements the new features that we've been showing, things like variant glyphs or font features and font capabilities, those are saved as attributes on text. So if you save attributed text yourself and you enumerate what you think is the complete set of attributes, you might lose this information when you save it to a document. So it's important when you save attributed text to save all of the attributes so that information that the user enters, like a particular variant glyph that they use to write their name, doesn't get lost when they save and then reopen the document.

Things are pretty easy for Carbon applications too, especially if they're using MLTE, the Multilingual Text Engine, or the new HITextView, which is based on MLTE. That makes things pretty easy because all of this stuff is supported by MLTE, and you don't have to do very much. If you use CFString for your UniCode text storage, support the font panel, and allow access to advanced font features, and you're set. It's pretty easy.

A lot of you for historical or performance or what have you reasons have your own custom text engine that's necessary for your specific application. And in those cases, things are a little bit harder but it's still possible to support all these features and we'll go through how you can do that in your application.

You still need to store your text as Unicode because many of these new features are only available to Unicode applications. You can use either CFString, the core foundation class for Unicode text or you can just store an array of 16-bit Unicars. Either way works. If you have Unicode text, you need to draw your text using a Unicode text drawing API and for Karbin that means Atsui. Fortunately, as I mentioned in Panther, Atsui is over two times faster, so there's really no reason not to use Atsui for UniCode text drawing in your application.

For input of UniCode text, you need to use the Text Services Manager. And if you were already supporting Japanese or Chinese input methods, you're probably already using TSM. One thing that's new for Panther is that new features, like the Document Content Access Protocol, are only available via Carbon Events, not via the Apple Events that we also supported in the past.

So if your application is using Apple Events to interact with TSM, you'll have to move to Carbon Events in order to take advantage of the latest features. And once you support TSM, there's basically three categories of interaction that you need to worry about. One is supporting the active area, which has always been true for input methods. Another is the new Document Content Access Protocol, and we'll talk in more detail about that in a moment.

And the final one is supporting input and storage of variant glyphs, and we'll also talk about that. And finally, as for the other, as for the easier approach to Carbon applications, you want to support the Font Panel so that users have access to all the capabilities that fonts have to offer.

So before I go into a little bit more detail, I want to give a quick review of what it is about Unicode that makes it a little bit more challenging to implement an application. It's quite different from the WorldScript approach that you might be used to. The most important concept for Unicode is what's called the character glyph model. And it makes a distinction between characters and glyphs. You can think of characters as the form of language that's spoken. It's the semantic content. It's the way you would speak the language.

Glyphs, on the other hand, are the shapes that show up on the printed page or that you see on a display monitor. And you can think of them as the written form of the language. Now, usually there's a very direct correspondence between the spoken form and the written form, but that's not always the case. It's certainly not the case for complicated writing systems like Arabic or Indic languages, but there are even cases in English and Japanese where there is not a direct one-to-one relationship between characters and glyphs.

And it's the job of a Unicode text rendering engine like ATSUI or CocoText to map between characters and glyphs. And here are a few examples that show why that's a challenging problem. The first line is Hindi, and in Hindi between the characters and the glyphs things move around and in fact some of the things that are independent characters when they're rendered as glyphs wind up as decorations on other glyphs. So there's both rearrangement and formation of clusters and ligatures.

The second line is Arabic, and as we all know Arabic is a right to left language and so the characters and the glyphs are in opposite orders. But beyond that, Arabic is also a cursive writing system and so the glyphs flow together to form ligatures. And you can't really map directly between characters and glyphs. There's ordering and ligature issues that you have to deal with.

But even for Roman text, here's an example where we have the word "resume," and the "e" with an acute accent is stored in character space as an "e" with a combining acute accent. And when that's drawn, that has to become an accented "e." So there's an example of where in a Roman language, there isn't a straightforward mapping between characters and glyphs.

So what are some of the problems that you can run into in an application if you don't keep the character glyph model in mind? Well, one thing that's particular to Unicode, we all think of Unicode as a 16-bit character set. Whoops. Okay. I didn't press the bad button, so.

There we go. OK. We think of Unicode as a 16-bit character set. But I mentioned earlier that there's over 96,000 characters in the latest version. And a little arithmetic shows that you can't fit that in 16 bits. So what we think of as the 16-bit version of Unicode is called Plane Zero, or the basic multilingual plane. And that's where all the commonly used characters go.

But Unicode also supports a lot of rare and less commonly used characters. And those are allocated in planes 1 through 16. And in order to represent those characters in your text, you need to use two 16-bit values. That's called a surrogate pair. And there's an example. That's from our Hidagino font. It looks like any other ideographic character. But it's stored as two 16-bit values, because it comes from plane two of Unicode.

So that's one issue you have to worry about. As we saw in the previous slide, you can have composing sequences where multiple characters in the UniCode sense form a single--what the user thinks of as a single character. So the base character E with a combining acute accent is one example. There's lots of other combining marks like that. There's clusters in Indic. There are ligatures in Arabic and in English.

For Korean, there are jamos that come together to form hangul, and so forth and so on. So there's really not a direct one-to-one relationship between characters and glyphs. In addition, UniCode also has multiple ways of doing the same thing. So on the last slide, we saw the E with combining acute accent, but UniCode also has a single character that's an E with an acute accent.

And that's mostly for historical reasons and for compatibility with earlier character set standards. And there are a lot of cases like that. So there are often multiple--you can think of them as spellings for the same string of text. It can be represented in UniCode in multiple ways. So here's one example. On the left, I have Korean hangul, and on the right, I have the three jamos that make up that hangul, and they're both equally valid ways of representing the text.

So of course, that makes things like comparison and searching a little bit more challenging. And finally, for more complicated writing systems, you can use the same string of text as you would use in other writing systems. You have issues of directionality. Languages like Arabic and Hebrew go right to left. You can have them in the same paragraph with text that goes left to right. Languages--the whole Indic family of languages has rearrangement where characters move around when you write them compared to when you speak them.

And so the glyphs and the characters are really--you can't count on them being in the same order at all. And that doesn't just affect the order of glyphs within a style run, it also affects the order of style runs within a paragraph. So if you have a paragraph of mixed English and Arabic or English and Hebrew text, whole style runs can move around, and you really need the system's help to figure out where everything belongs.

So fortunately, so that you can avoid these problems in your application, we have lots of APIs in the system that you can use to make sure you do the right thing. In terms of figuring out where characters begin and end, there are lots of system APIs for finding text boundaries.

Not just characters, but also clusters, words, lines, and paragraphs. There are APIs-- and I'm not going to go into great detail on this-- all of the documentation for this is available online. But there's APIs in Cocoa for finding character and cluster boundaries, in Carbon for finding boundaries of all sorts.

And if the reason you're looking for a character boundary is in order to truncate text, you don't even have to do that yourself. You can actually ask ATSUI to truncate your text for you. You just pass it an option, tell it how wide you want the text to be, and it will find a linguistically correct place to truncate the text and add a truncation. . character.

Because of the problems with multiple spellings that I talked about before, there are system APIs that can help you with that, that will do comparison or searching of text. As I mentioned, due to directional issues, text can move around within a paragraph. And so when you're drawing, you need to deal with an entire paragraph at a time. And there are APIs in Cocoa and Carbon that will help you do that.

For Cocoa, you can use the text system directly or use attributed strings and typesetters. For Carbon, of course, there's Atsui. And as long as you let the system know about an entire paragraph, it will figure out where everything belongs, and then you can figure out where the line breaks are and draw the lines individually. Of course, because there isn't one-to-one mapping between characters and glyphs, that's also an issue for moving the cursor with the arrow keys, or clicking with the mouse, or highlighting text, and there are APIs that can help you do that.

One issue that every Unicode application has to deal with, unless it's brand new, is how to handle legacy data that's not in UniCode. Now, we've had APIs in the system for a long, long time to convert between UniCode and other character sets, so I'm not going to go into that. One issue, though, is to figure out what character set should I use? What character set should I assume the text is in?

Well, if the character set is marked in the document somehow, then you're set. You know what the character set is. But very often, you're dealing with plain text or other text that doesn't have any information on what the old character set was that it's encoded in. So then you have to guess, and there are a couple of APIs that can help you do that. If you think it's going to match the language that your application is running in, then you can call getApplicationTextEncoding. That will return an encoding that usually matches the language that's been selected--the localization that's been selected for your application. Okay.

It might be more appropriate to pick an encoding that's associated with the user's most preferred language, because maybe your application doesn't support that language, but the user's data is quite likely to be an encoding that's associated with it. And CFStringGetSystem encoding will return an encoding that usually matches that language. Now, why do I say usually?

Well, the reason is that there are languages that Mac OS X knows about that were never supported in WorldScript, that were never supported on OS 9, and they don't have legacy encodings associated with them. Some of them, like Vietnamese, do have a WorldScript encoding, but that doesn't mean that you can draw the data with QuickDraw text.

It's just something that you can convert using an encoding conversion API. Other languages, like Hawaiian, have no non-Unicode encoding associated with them at all. So if your application is running in Hawaiian, or the user's most preferred language is Hawaiian, you're not going to get a sensible answer from these two APIs.

If you're writing an internet application, then you shouldn't be using Mac OS encodings at all. You should be using the standard encodings that are defined by internet standards bodies. And you can go to IETF and IANA websites to find out about those. And there are APIs that will help you convert those names into a text encoding that you can use internally.

I'll talk a little bit more about the new APIs for formatting and parsing dates, times, and numbers. It's in core foundation. You can either get the current locale, or you can get a locale from a standard ISO locale string, which has a language code followed by a country code.

You can also take information from the WorldScript world, like language, region, or script, and convert that to an ISO string, which you can then use to get a locale. The new classes in Core Foundation, as I mentioned, support both formatting and parsing. There's support for currencies, and you can go back and forth between internal representations, including Core Foundation types, but also standard C types, and a formatted CFString. And there's also lots of customization options you can take advantage of. And for more information, you can look at the seed release that you all received.

And the last topic I'd like to cover is to -- the last couple of topics I'd like to cover are TSM and variant glyph access. So for those of you who supported the tech services manager in the past, the thing that's different for Unicode support is that you need to create a TSM document of type UDOC. To take advantage of the latest features, you need to move to Carbon events instead of Apple events, but as I'm sure you've been hearing elsewhere at the conference, there's lots of good reasons to move your application to Carbon events.

Supporting the input method active area is something that's been around for a long time, but if your application is not a Unicode application yet, you'll also need to move to supporting Unicode input, and again, GNU and Panther is the protocol for accessing the entire contents of your document, and that's critical to provide some of the user interface and conversion accuracy features you saw demonstrated earlier. KotoAdi can't analyze the content of your document to give great conversion results unless it can find out what the content of your document is.

So I'm going to go through some of this rather quickly, because we don't really have time to dive into it in detail. For Unicode input, there's a single Carbon event that has Unicode text. It can also have glyph variant information, and we'll talk about that in a little bit. The input method active area support protocol is pretty much the same as it's always been. There's just a few Carbon events you have to handle, and there's nothing new here.

The big new thing is the document access protocol, and I don't have time to go into this in great detail. You might think from this long list of Carbon events that it's pretty complicated, but it's not. The model is really simple. The way this works is that it makes your document look like a CFString to the input method, so the Carbon events that you respond to are just the same things that CFString supports. It's really a very straightforward model. So if you implement support for these Carbon events, input methods can access the contents of your document, and you get the improved conversion accuracy and new UI features like easy reconversion.

I'll talk a little bit now about variant glyph access. This is optional information that comes with a Unicode text input event. You get an array of glyph information records, and each record has this information in it. First of all, there's a range of text, and that can be more than one 16-bit Unicar. And the reason for that is, it could be a variant version of a surrogate pair.

So it could be more than one Unicar for that reason. Or it could be a variant version of something like a ligature. For example, the Zapfino font that comes with Mac OS X has different versions of the Fi ligature. And to allow the user to pick which one they want, they can do that via the variant glyph access protocol. In that case, the range of text would be the F and the I. So it can be more than one character. You also have to specify the font that the variant is coming out of.

There's two ways to identify which particular glyph you want. One is via a font-specific glyph ID, and that's used, for example, with TrueType fonts. But it can also be a glyph ID from a published glyph collection, like Adobe Japan 1.5. And the record will identify which of those two approaches is being used.

You don't have to worry about that too much, because AtSuite provides a style tag, and all you have to do is take the information out of the -- you don't have to worry about that too much, because AtSuite provides a style tag, and all you have to do is take the information out of the Carbon event, stuff it in this style tag, and give it to AtSuite. And this is all covered in Techno 2079.

I won't talk about this very much at all. This will be covered in a session that's coming up right after this one across the hall--what you need to know about fonts in Mac OS X. This is how to support advanced font attributes via the new font panel in Panther. There's already a Carbon event for font selection via the font panel, and we've just added more information to it.

There's now a complete dictionary with all the information that's specified in the font panel, and all you need to do is extract the data from that and just pass it to Atsui. You don't have to worry about what it means, you just basically have to funnel it through your application. And for more details on that, you can go to the FON session, which is session 406, and coming up right after this one across the hall.

Okay, I'd like to bring Kida-san back up on stage one more time to talk about our Chinese input methods and the character palette. Kida-san? Here. Let me get rid of those windows. And let me add one more thing here. Hello again. I'll show you a few more features before wrap-up. The first one is simplified Chinese. No, traditional Chinese.

On system 10.2.4 we added HanNin traditional Chinese input method, which is very popular input method online, and we were providing this input method only for localized system 9, Mac OS 9, and we are offering to everybody on Mac OS X. This is a word-based pin-in input method, pin-in and both of them are input method, and it's much, much easier to use. I'll show you how, even I can use. I need a cheat sheet. One, e, e, da.

Ja, lai, ta. No.

[Transcript missing]

The next one for traditional Chinese is we added support for HKSC-S and Big Five-E. Those are additional character sets on top of what we have today.

[Transcript missing]

Those characters marked in red were missing in previous standard, which is peak 5. And you might be surprised how many characters are missing. And actually, this thing, this is the name of the new airport in Hong Kong, Hong Kong International Airport.

And these two at the bottom, asked names of subway stations in Hong Kong. You couldn't even write your airport name or subway station name without this extension. And if your application doesn't support

[Transcript missing]

Our simplified Chinese input method. We extended the input method--ABC input method--so that it covers all the characters in GB 180 or 30.

Um, let me pick simplified, and let me pick the mode. And by the way, you don't need to do this if the simplified Chinese input method are revised and support a flattened mode. Yeah, this is one of the characters which is only in GB 80 or 30. And the other example is Mu-Ti. Mu-Ti? Mu-Ti? Maybe? Oh, yes. This is too.

Thank you. The last one is our character palette. We introduced the character palette first in Jaguar and we found out many customers loved it and we also got many feedbacks. One of the feedback we got is some of you want to enter character, exactly what you see on the screen, I mean character palette.

Because character palette honors the font setting in application, usually you get different font between character palette and application. But we got a feedback that you want exactly the same character between the application and character palette. So here's a character palette, which looks like one we have in Jaguar. And you have this little disclosure triangle here. It says font variation.

If you open it, here's a list that lists the selected character in all fonts in the system. So you can browse this character A using all fonts in the system and pick one you like. And if the character you've selected in this list happen to have variant glyphs, it lists those variants in this variant field.

So, for example, This is Apple Chancelary. And you want a different A for this, like, say, here. A long head. And here you have insert with font at the bottom right. And if you press this button--oh. You insert the character into the document. Let me try a different one.

Say this one. And you insert a different glyph. Let me try to find one. You can drag the character to this area to go to that character. Say I want Hiragino Minture Pro W6. And here you have different . And say your name is, say, the one which have-- oh, I already did that. Say two dots up here. You can insert, and you now have two dots.

Also, you can track the character, rare character, and find which one has this character, like this. So now I bring Deborah back to the stage for the wrap up. Thank you. Thank you, Kida-san. I'd like to emphasize again how important it is to support Unicode and the document content access protocol on your applications so that your users have access to all these great features.

So I didn't have a prop budget for this talk, so I don't have a coffin to roll out on stage, but WorldScript is dead. Quick Draw Text is dead. They can't begin to cover some of the requirements that we're seeing in the Japanese and Chinese markets today. We're spending all of our efforts, all of our focus is on Unicode. We're not spending any time on WorldScript. We're not spending any time on making enhancements to Quick Draw Text.

So Unicode is it. Unicode will give your application great competitive advantages in the Japanese and Chinese markets, so you really should focus on adding that. And if you do that as a side benefit, you get the rest of the world besides, which is not a small thing. Thank you, everyone.

So I'd like to wrap up now. Here's a couple of other sessions, or more than a couple, that you might be interested in. Immediately after this session is what you need to know about fonts in Mac OS X. The whole name didn't fit on the slide. You can find out about the font panel, and the typography panel, and lots of other useful information about using fonts on Mac OS X. And that's in the Mission Room, starting at 3:30.

On Friday at 5 o'clock, in the Presidio Room, there's the Cocoa Text session. You can find out about new features for Panther, and all the other great things that are going on in the world of Cocoa Text. Unfortunately, at the exact same time, also at Friday at 5:00 PM, in Nob Hill, is a session on our new Ink APIs, which, if you're interested in enhancing your application support for handwriting, you can find out about how to use these new APIs to do that.

And finally, if you want to let us know what's bugging you or what you think is going great, our International Technologies Feedback Forum is Friday at 10:30 in the North Beach Room. And we'd love to have you come and give us feedback on what we could do better and what we're doing right.

So if you have further questions, the first person you should be talking to is Xavier, and his email address is easy to remember. It's [email protected]. If you have any questions when you're done talking to Xavier, you can also contact me. And my email address is goldsmith, without the H, at apple.com. You don't need to scribble a lot of stuff down because the URL that you see at the bottom of the screen-- developer.apple.com, WWDC2003urls.html-- will have all the contact information and all the URLs from all the talks at WWDC.

Here are some places you can go for more information. There's our documentation library, of course. We also have a nice summary page for international technologies. That's developer.apple.com/intl. If you want to develop fonts, we have a font developer web page. That's developer.apple.com/fonts. There's, of course, references for the app kit and for Atsui. UniCode Utilities, which is used for finding text boundaries and comparison and searching. There's a specialized set of topics on Cocoa text handling. There's documentation on CFString.

Here's a handful of useful Tech Notes and sample code. Tech Note 2056 is on how to do your own keyboard layouts. 2079, the Variant Glyph Access Protocol, it has much more detail than I was able to go into. There's a sample app for Atsui and how to draw UniCode text, and how to do your own input method.

and some pointers outside of Apple, the Unicode Consortium has a website for more information about Unicode. That's the best place to start. There's a new version of the Unicode book coming out for the new 4.0 version of the standard and much more readable than the standard itself is Unicode Demystified by Richard Gillum. I highly recommend that as an introduction to Unicode if you want to learn more about it. And finally, the open source international components for Unicode Library has its own website that's hosted by IBM.