Moving Applications to Unicode - WWDC 2004

Application • 58:13

The Unicode-based APIs in Mac OS X are critical to producing a best-of-breed application and meeting customer expectations around the world. Find out how to move your Script Manager-based application to Unicode or create a brand-new Unicode application from the ground up. Learn about the importance of Unicode for the Japanese and Chinese markets, and discover what's new in Mac OS X international support.

Speakers: Deborah Goldsmith, John Jenkins

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Today we're going to talk about how to move your application to Unicode.

[Transcript missing]

So today we're going to cover in detail what's required to move your world script based application to Unicode. And we're going to do a lightening tour of all the aspects of an application, how to store text, your human interface and your localization, drawing, editing, and input, sorting, doing transformations on text and analyzing it, and also formatting and scanning of dates, times, and numbers, and also calendar manipulations. So we're not going to go into incredible detail on any of those areas, but to sort of give you a tour so you know which APIs to use to convert your application to Unicode.

And all of the APIs we're going to talk about today are in the core foundation and Carbon areas of Mac OS X. And we're going to show how to map from the older technologies, the Script Manager, text utilities, and so on, to these newer APIs. But before we do that, I'd like to just spend a couple of slides talking about what's new in Tiger in the international area. And we do have some new features. And the first one is something that people have been asking for for a long time. In Tiger, we have the first stage of our support for OpenType font layout tables.

So this allows OpenType fonts to work in Unicode applications. If you're using our standard Unicode APIs, you don't have to do anything special in your application. OpenType fonts will just work. So we're supporting features like ligatures and language shaping in certain cases. And you'll see the support for OpenType layout increase more and more as time goes on.

Something that was missing from our Unicode API suite was string transliteration. This is something you could do in the Script Manager. There was an API in the text utilities to do it, but there was no Unicode equivalent, and we've now got that in Tiger. We have even more locale data available, and as with Panther, much of that locale data is only available through Unicode APIs.

So it's very important to move your application to Unicode so you can take advantage of all the languages and all the locales that Mac OS X supports. Just as a side note, this is a Carbon session, but there has been a Carbon date control for a long time, and new in Tiger is equivalent Cocoa date control, so that's available.

We had some support for non-Gregorian calendars in Panther, and we're improving that support in Tiger. So in addition to the Japanese and Thai Buddhist calendars, which we had in Panther, we're adding Islamic and Hebrew calendar support. And in addition, in Panther, you could only use one of the non-Gregorian calendars if you were using the date and time formats that went with it.

So for example, you could only use the Japanese calendar with the Japanese locale or the Thai Buddhist calendar with the Thai locale. In Tiger, you can select the calendar separately from the date/time formats. So here you see examples of the Islamic calendar and the Japanese calendar being used with the U.S. English locale.

We've also added more control over number and date formatting. So we introduced CFDate Formatter and CFNumber Formatter in Panther, but now you have more control over how they operate. There are more options. There's also a new option for number spell-out. So in addition to all the formatting options we had before, you can spell out numbers. There's an example there, 123.45. And this is not just for English. It works with any of the locales that Mac OS X supports. So that's a new feature.

Every release, we try to extend our Unicode coverage a little bit. And so this time, we're moving more Roman and Greek and Cyrillic support into our core fonts. So we used to have separate fonts, for example, for Cyrillic support. And we're extending our core fonts, and that's Times, Helvetica, and Courier, to support a wider variety of Roman and Greek characters and also adding Cyrillic.

In addition, we're covering some new Unicode blocks, and that's Tamil, Braille, Yijing hexagram symbols, Taishuanjing, and Taishuanjing symbols. And not all of this is in the preview release that you have in your hands, but it will be showing up. And there are possibly more blocks we might be covering that aren't listed here that are still a little bit up in the air, but every release, we try to add a little bit more. So, oh, and one last thing. Okay.

So some of the language IDs that we've used in Mac OS X up till now have not exactly followed the standards in the area. So, for example, we use ZH underscore TW for traditional Chinese. And in Tiger, we're adding support to move to canonical language IDs. There are some examples. And there are new APIs that have been added to CFLocale that can help you canonicalize language IDs.

So that if you have... A localization for your application that uses an old ID and you need to compare it against a new ID, these APIs can help you make sure you do the comparison in a canonical way. So in order to show some of the new features in Tiger, I'd like to ask John Jenkins to come up on stage, and he'll give us a short demonstration of what's new in Tiger.

All right. As Deborah says, there's an awful lot that's new in Tiger, and we don't really have time to go over anywhere near all of it, so I'm just going to hit some of the highlights, some of the really exciting things. And we'll start out with an exciting thing, which is the transliteration APIs.

So here we have a sentence or a word, I guess, which is in Latin, and I want to know what this would look like, for example, in Greek, and I can change it. Or I want to know what it would look like in Katakana, and I can change it. Although I'm not quite sure that the accent works, or the upside-down exclamation point really works in Japanese, but that's okay.

Also useful is turning it to XML hex. This is very handy. This goes through and takes all of the non-ASCII letters and converts them into the numeric entities that you would use on a web page or in XML. Or if you really want to know what's going on, you can always get the Unicode name of the character that's not in ASCII, which is also handy. And of course, we can go the other way. We have a lot of things that will let us transform to Latin.

If I get a sentence here, for example, this is the beginning, the first line of the Iliad for all the people who are fans of the movie Troy, I can turn it into Latin. I could also strip it of its combining marks if I wanted to, remove all of those. I can take other examples and turn them into Latin. So I have an Arabic sentence, I hope, and I can turn that into Latin, or I can take something that's in Japanese.

And again, I can turn it into Latin. So I can get a first approximation transliteration. One thing that's useful for Chinese, here I have an instance of a sentence which is partly in Cyrillic, partly in Chinese. I can turn the whole thing into Latin if I want, or I can just take the Chinese part and turn it into Latin. So a lot of useful transliteration APIs are available now. Now, Deborah also mentioned that we have a lot more locales.

Let's just bring up a list of our locales. These are the locales that are available on the system in Tiger, together with some of the information you can get about them. This is pretty useful information, too. The one thing which I don't think is wildly useful is metric, that is, whether or not this locale uses the metric system, because it amounts to whether or not you are in the United States, but that's okay.

Currently, it defaults to showing them in the global locale. But we can switch back and forth so I can see what each of these looks like. And you'll notice that some of these are showing up in the last resort font. These are locales that we don't have system support for, but that's okay.

On Mac OS X, it's very easy for third parties to add support. So if I want to take one of these and I want to see a date, say, in the Islamic calendar, let's not use an unsupported locale. Okay, German. So I want to see what today's date is. This is today. Today's date in the Islamic calendar as shown in Latin letters in the German locale. So we have a great deal of flexibility for date and time formatting that we didn't used to have. All right.

So that covers two. The third is the really exciting thing, I think. This is something, as Debra says, that people have been desperately asking us for for a long time, and that's open type support. So it's not fully wired up yet, but it is enough there that we can show it to you. Okay.

So I'm going to take -- this is World Text, which is a standard application that comes with the developer tools. And I'm going to switch to Adobe Caslon Pro. This is unaltered, nothing up my sleeves, straight out of the box Adobe Caslon. And I can start typing. Let's see. Let's come up with something here. Okay. And you'll notice that as I typed, the ligature formed automatically. FL formed automatically. FI formed automatically.

[Transcript missing]

I always get the code name. Intend 3, there we go. You can see more of what's available in the font. For example, I can turn on rare ligatures. And as the system does now with AET fonts, it does with OpenType fonts and Tiger. It scans through. It sees what is in the font.

It gives us support for all of these things. I can turn on lining figures. I can turn on superiors. Let me see if that works. Yeah, well, that's kind of cool. And so on. So there's a lot of flexibility here that's available. The font has it built in. The system just picks it up. So look forward to it. Thank you. Thank you, John.

Okay, so now that we've seen what's new in Tiger, let's start moving into the detailed part of the presentation. We're going to go on our whirlwind tour of the WorldScript APIs and what to replace them with. So before we start talking about things that you can do with text, we have to talk about how you store your text in the first place. So that's our first topic.

Before we do that, let's have a quick refresher on what's different about Unicode compared to WorldScript. The Uni in UniCode means one, and the most important thing about UniCode is there's only one character set you have to worry about, unlike WorldScript, where there were many. Unicode stores characters in 16-bit units in the UTF-16 form, which is what we use in Mac OS X in Cocoa and Carbon, but since Unicode has more than 96,000 characters, how do you fit that in a 16-bit unit? Well, the answer is you can't, and some characters need more than one unit to be stored, and there's an example right there.

The Unicode character 2000B, which is from the plane to Han characters, is actually stored as two 16-bit units called a surrogate pair. Now, when I talk about a character here, a Unicode character, that's the programmer's concept of a character. What the user thinks of as a character can actually be larger than that. What the user thinks of as a character we call a grapheme or a cluster, and it can consist of one Unicode character or two Unicode characters.

So here's a couple of examples. We have the word resume, but the accented E's are represented by base letter E plus what's called a combining acute accent. So you have two Unicode characters, the E and the accent, that represent one user character. In the next example, there's even more. This is the Vietnamese word for Vietnamese, and you can see that we have an E with two combinators, two combining accents, a dot below and a circumflex above. So this is three Unicode characters that represent what the user thinks of as a single character.

To make life a little more interesting, there's actually multiple ways to do this in Unicode. In addition to the base letter and combining marks that we used in this example, there are also pre-composed versions of these characters. So there is an E with an acute accent that's a single Unicode character, and there is an E with a dot below, and a circumflex above that's a single Unicode character.

But you can't always represent every character in a totally pre-composed form, and conversely, you can't always represent a given character in a totally decomposed form. So even though there are versions of Unicode that we call pre-composed and decomposed, they really mean as pre-composed as possible and as decomposed as possible. So even in pre-composed Unicode, you do have to worry about things like combining marks, because they can be present.

Okay, so in the world script world, the way you stored your text was in a Pascal string or in a C string. Those don't support Unicode, or at least not in the form that we needed for Carbon and Cocoa, so what do you do in the new world?

Well, if you're using Core Foundation, you can use CFString or CF mutable string, or new in Tiger is CF attributed string. If you need to work at a lower level, you can just store Unicode text as arrays of Unicar, which is a type defined in Carbon, or actually at a very low level. So there are a lot of APIs for CFString and friends, and I don't have time to go through them all, but just to give you a flavor of how the API works, here's a few examples.

You can create a CFString using an array of Unicars. In this example, we pass null, which indicates the default storage allocator for Core Foundation. We pass an array of Unicars, and we pass their number, and that will give us back a CFString object. You can also get characters back from a CFString.

In order to get the best performance, you can use an inline buffer. And the way you do that is you set up an inline buffer on a CFString, and then you can ask to get a character at any index, and the inline buffer will take care of batching access to the string so you get the most efficient access possible. Nguyen Tiger is attributed string support, and an attributed string wraps an existing CFString as opposed to you putting the characters into the attributed string directly.

So you can create an attributed string by passing a CFString and a dictionary of attributes. They can be totally arbitrary attributes that's not just a fixed set, although there are a predefined set. And you can also get the attributes at a particular point of an attributed string. So you pass the attributed string, the index where you want to get attributes, and also a pointer to a range. The range gets set to the run. of the attributes so you know. How big a stretch of text has those particular attributes? And then the function call returns the dictionary of attributes that apply to that range.

Now, something that those of you who've programmed in WorldScript know is that when you're dealing with double-byte character sets, you can't just break a string at an arbitrary byte offset because it might be in the middle of a double-byte character. And you used character byte type to determine if there was a safe place to break.

We can't use character byte type in a Unicode application, but there's a similar issue to worry about, and that is the user characters that I talked about earlier, or what's called a cluster or a grapheme. You don't want to break in the middle of that because if you do, and then you only display the first part before the break, you'll actually mangle what the user thinks of as their character and display the wrong thing. So there are APIs available to help you find a safe place to break.

If you're using a CFString, you can use CFString get range of composed characters at index, and that will find a safe place to break. If you're using a Unicar array, then you can use the Unicode utilities find text break API and look for a cluster boundary, and that will also tell you a safe place to break. So here's an example. We have a string and an offset, and we want to figure out a safe place to break.

So we call CFString get range of composed characters at index. We pass our string, and we pass the index, that is, the place where we would like to break. And what the API returns is a range, which is the beginning and the end of the user character or the cluster that corresponds to that offset.

So in this case, we take that range, and we go to the end of it. We take the beginning location and add the length in, and that's the place where we can break. We could also just use the beginning part of the range instead of the end. That depends on how you want your application to work.

Another thing you need to do is to figure out what kind of character a given character is. Is it a letter? Is it a digit? So forth. In the WorldScript world, you used character type for that, but that doesn't work with Unicode, so we can't use it anymore.

In the Unicode world, there's two ways to do this. There's a CF character set, APIs, and core foundation. And at a lower level, there's UC get car property. So here's an example to determine whether a character is a decimal digit. Now, you might think to determine whether something is a decimal digit, you can just say, well, is it ASCII 0 through 9? But it turns out in Unicode, there are lots more decimal digits than just those. There are decimal digits for Indic languages, for Arabic, and all of those are just as valid. So we can call it as decimal digits as the ASCII versions that we're used to.

So to test whether any character is a decimal digit, we can use CF character set. So first, we get a predefined CF character set, in this case, the set of decimal digits. And then we can call CF character set is long character member in order to determine whether a given Unicode character is a member of that set or not. And then we can branch one way or the other. depending on the answer.

Well, it would be wonderful if your application could deal only with Unicode and never have to think about anything else, but there's still a lot of data out there that's not in Unicode. There's documents that users have. I have documents on my system that date back almost to the time the Mac was introduced, and those are definitely not in Unicode because it hadn't even been invented then.

There are protocols on the Internet that still require non-Unicode character sets. The Web is a big example. You can use Unicode on the Web, but many, many Web pages are not in Unicode. So you need to be able to move between the Unicode world and the non-Unicode world, and we've had support in Mac OS X for this for a long time in the form of the Text Encoding Converter, which is a fairly low-level API. That even actually dates back to Mac OS 9, but there's easier ways to do it using CFString. And again, there's a wide variety of APIs that you can use to do this, and we're only going to go through a couple of them.

In the first example, you can create a CFString using a C string, and all you need to do is pass the C string, which is null terminated, and a text encoding to use. And that will give you back a CFString, which is in UniCode. If you want a little bit more control, for example, your string isn't null terminated, you want to control what happens if the data is not in the C string, then you can pass the C string, which is null terminated, and a text encoding to use. If the data can't be converted completely, then you can use CFString create with bytes, which gives you finer control.

So one question you might have is, what do I pass for that text encoding? And that's actually a non-trivial question. It depends a lot on where the data is coming from. If you're lucky and the data is coming, say, from an internet protocol and it's tagged with its character set, then you know what encoding to pass. But sometimes you have to guess.

And two good guesses are the encoding that corresponds to your application's human interface. So if, for example, if your application is running in Japanese and you call get application text encoding, you'll get Mac Japanese back as the encoding. A different encoding is CFString get system encoding, and that's the text encoding that corresponds to the user's most preferred language.

Now, the user's most preferred language is not always the same as the language that your application is running in. And the reason for that is that the user's most preferred language may be one that your application is not localized into. So, for example, if the user's most preferred language is Inuktitut and you don't have an Inuktitut localization in your application, then you're not going to be running in Inuktitut. In that case, the application text encoding and the user's text encoding are not going to match. So which one of these you call depends on your application. And where the data is coming from.

Another thing you have to worry about on the Internet or when sending Unicode to Windows is that other systems do not deal with the decomposed form of Unicode quite as well as Mac OS X does. And so it's better to convert Unicode to what's called Normalization Form C, which is the as pre-composed as possible form before you send it to those systems, and you can use CFString Normalize to do that.

And a new feature in Tiger is that you can determine the text encoding used by MLTE. So if you're using the multilingual text engine, which is the Carbon-Unicode text engine, you can now specify the text encoding to use when opening or saving to plain text files. And that's a new feature in Tiger.

Okay, so we've covered the basics of how to store your text and how to get it in and out of your application, but there's more text to your application than just what's in the user's document. There's also the text that you create yourself for your human interface, and let's spend a little time talking about that.

Well, in the old world, you used the Resource Manager to store the localized pieces of your application. I used resources like D-log or Menu, or if you're using Power Plant, maybe you're using PPOB resources. Well, those resources are all based on the old-world script world, and they can't support Unicode. So the modern equivalence for a Unicode application, or indeed any modern application, is the bundle, which I'm sure you've all heard of about already, but I'll just give a very brief review.

An application bundle is a directory tree in the file system that's made to look to the end user as if it's a single file. You can store non-localized files, localized files, files of any type, actually. It's totally up to you. Movies, strings, what have you. Localized files are stored in an LPROJ directory, and the LPROJ directory is tagged with the ISO language code for the particular language that that localization corresponds to. So, for example, EN for English, JA for Japanese. One of the most important kinds of things you can store in your application bundle are interface builder files, or NIB files.

And those are the files that contain UI elements and replace the old resources that were used with the Control Manager and the Dialog Manager and so forth, the ones that didn't support UniCode. And there's a small set of APIs you can use for NIBs with Carbon applications. You can create a NIB reference from your application bundle, and once you have that, you can get your menu bar out, you can get menus out, you can get windows out with HIView hierarchies. It's very straightforward to use.

[Transcript missing]

Okay, well, a big part of any application that deals with text is drawing it, editing it, and inputting it. And there are several APIs available to do that. Now, when you talk about drawing text, you can sort of partition applications into two classes, or at least you can partition text drawing into two classes. First is drawing short strings.

And in the WorldScript world, namely Quickdraw text, you did that with either DrawString or TextBox. The UniCode equivalents are DrawThemeTextBox, which is very straightforward. It just takes a CFString. And you can use that when you're happy to just use one of the standard theme fonts. If you need more control, you can call one of two MLTE APIs, either TXN DrawCFStringTextBox or TXN DrawUnicodeTextBox.

And the only difference between them is one takes a CFString and the other takes a Unicar*, so depending on how your text is stored. And that gives you actually a lot more control, not just fonts, but also you can specify a CG context. You can control things like rotation and so on. and so on.

Now, sometimes an application has to draw large amounts of text, and by that I mean drawing a document, implementing a text editing engine, implementing a web browser where you have to paint large amounts of text, and the APIs on the previous slide are not really appropriate for those kinds of tasks. Also, sometimes you need a lot more control over the way text is rendered, and again, the previous APIs are a little too simple.

Well, in the Quick Draw text world, we use things like Draw Text, Measure Text, a few supported bidirectional texts you had to call Get Format Order. There are a whole bunch of APIs to call, and it's too complicated to go into in a talk like this. The equivalent set of APIs to use in the UniCode world for Carbon is Atsui, Apple-type services for UniCode imaging, and again, that's a rather large API set. And rendering complex text is a sufficiently difficult problem that I'm not going to get into it in the two or three minutes I would have to cover it in this session.

So there's a great online reference, Rendering UniCode Text with Atsui. I strongly recommend you start there if Atsui is new to you. In addition, there's a session on Friday, Session 425, Modern Text Layout and Editing for Carbon Applications, where you can go to hear all about Atsui and MLTE and to talk to the engineers who work on it.

Now, a much more ideal way of dealing with text is not to have to render large amounts of it yourself, but to use one of the built-in text editing engines. That's a lot easier than building your own. The text editing engine in the WorldScript world was called TextEdit, and there's also a control to go along with it, the EditText control, but unfortunately, they can't support Unicode, and they're now deprecated. So the modern UniCode equivalent is MLTE, the Multilingual Text Engine, and again, I'm not going to go into the details of the MLTE.

Whoops, oh, it's up there, but it's not down here. I'm not going to go into the details of the MLTE API, but there's a very nice online reference that you can read. And a new option that was introduced, I think, in--was it Panther?--for HITextView was HITextView, which makes it even easier to use MLTE. It wraps it up in an HIView object, so it can be part of an HIView hierarchy.

And could I... My monitor picture has disappeared, so it would be nice to get some support for that. In addition to HITextView, there's also a Unicode version of the EditText control, so that basically gives equivalent functionality but supports Unicode. And I have here a few API examples just to give you a flavor. You call HITextViewCreate, and that will create a new HITextView for you that wraps up an MLTE object.

And the nice thing about HITextView is it's not totally opaque. You can get at the underlying MLTE object so that you can do more advanced operations with it. You can save and open documents and so forth and so on, and you just call HITextViewGetTxnObject to get that out. And UniCode Text Control is very easy to create. You just call Create, Edit UniCode Text Control.

You'll have to forgive me as my head swivels around for a while as I've lost my monitor here. Maybe I'll move over here so I can see the podium monitor while they're taking care of that. Okay, another problem, if you are implementing your own text editing engine or for some other reason you have to handle text input directly, then in the very, very, very old world, you might have called wait next event, or in the ancient world, even get next event. Hopefully nobody's calling that anymore. If you are supporting languages like Japanese or Chinese, hopefully your application is already using TSM and you are calling new TSM document and specifying a text service document interface type.

Well, unfortunately, that doesn't support Unicode, but there is a new document type, Unicode document interface type, that you can call new TSM document with, and that will create a TSM document that supports Unicode. In the old version of the OS, that was done with Apple events, but for the last several releases, it's been done with Carbon events.

And you want to avoid the keyboard class Carbon events, because those are raw keyboard events, and if you look at those, that will be before the input method has a chance to work on them. So you want to look at the text after the input method has processed it, and the two Carbon events for that are the text input UniCode for key event, and that's what comes from input methods or keyboard layouts.

And then there's the text input UniCode text event, and that's what comes from non-keyboard entry methods such as the character palette or ink. And you can basically handle those pretty much the same way. Now, if you're a TSM aware application, there's several more Carbon events you have to deal with, but those are the same between UniCode and non-Unicode applications, and so we're not going to talk about them today. Okay, so we know how we're storing our text, we know how we're getting it into and out of our application, we know how we're drawing and inputting it, but there's also operations on the text itself.

Something that's important in a lot of applications is sorting and searching. In the old WorldScript world, we only supported sorting, and you would call string order or text order in order to do a comparison of two strings. And of course, that depended on what the current script system was.

In the UniCode world, there are several APIs available. The easiest one to use is CFStringCompare, and you just give it two CFStrings and some options on how you want the strings compared, and it will tell you whether they're the same or one is less than or greater than the other. If you're working with arrays of Unicars, you can call the lower-level API UCCompareText.

Now, if you're going to be doing sorting, you're going to be doing a lot of key comparisons in your sort, and you may be comparing the same key multiple times. There is some overhead involved in doing a language and Unicode-sensitive comparison, so if you're going to be doing something like sorting a large amount of data, it's more efficient to get something that's called a collation key.

And a collation key is a string of bytes that does a binary compare the same way that the underlying string would do a language and Unicode-sensitive compare. So what you can do is call the Unicode utilities get collation key for a given text collator and a string of unicars, and you'll get back a binary key that you can just compare using binary ordering, and that can make your sort go significantly faster.

If you're something that you couldn't do in the WorldScript world but you can do in the UniCode world is search for substrings. And again, CFString makes it very easy. There's CFString find. You give your target string and a substring that you want to look for in that target string and search options, and it will find the instances.

You can step through them. You can also look for more than just a substring. You can also find instances of characters in a CF character set. And again, this is just a sample of the APIs that are available. There are a lot more APIs available for sorting and searching, and I urge you to check out the documentation for CFString. It has a lot of capabilities.

Sometimes you need to change the case of something, and we have uppercase text and lowercase text available in the text utilities for doing that, but they don't work with Unicode. The modern equivalent for a Unicode application is in CFString, and there's CFStringUppercase, which converts everything to uppercase. You'll notice that it takes two parameters, a string and a locale. The reason for that is that the rules about how to convert uppercase to lowercase or lowercase to uppercase differ a little from language to language.

For example, in Turkish, the rules are different from English, and so you need to pass a locale if you want the case conversion to be done in a correct language-sensitive fashion. Something that you can do with CFString that you couldn't do in the Script Manager is caching. If you want to be able to do that, you need to use a C-file. You can use a C-file if you want to use a C-file, but you can't use a C-file if you want to use a C-file.

You can use a C-file if you want to use a C-file, but you can't use a C-file if you want to use a C-file. You can use a C-file if you want to use a C-file if you want to use a C-file. You can use a C-file if you want to use a C-file if you want to use a C-file.

[Transcript missing]

And again, this is new in Tiger, and this is in the WWDC preview release that you've received, so you can experiment with it. There's also other manipulations on strings, just basically moving parts of strings around. And in the WorldScript world, we had Munger. Munger just works on bytes.

In addition, it requires that your text be in a handle. There are several options available to replace Munger if you're working with Unicode. CFStringReplace is very easy to use. You take a mutable string, a range of that string that you want to replace, and what to replace it with.

Very straightforward. There's also CFStringCreateWithFormat and CFStringAppendWithFormat, which work a lot like Printf. And again, those are fully UniCode compatible. There's also CFStringTrim, which will remove constant strings from the beginning or end of a CFString, a mutable string, that is. And also CFStringTrimWhitespace, which will remove whitespace characters. And if you really need to just move bytes around, then there's the standard C library routine. And there's the standard C library routine memmove, which handles arbitrary byte moves and deals with issues like overlapping source and destination.

If you have an application that displays text in a list, or presents text in a fixed size space, if you have a string that's too large for that space, or in a list if it's too large for the column, then you need to truncate the string. And that needs to be done in a unicode and language sensitive way.

In the script manager world, we had trunk string and trunk text to do that. There's two ways to do that in the unicode world. One very nice option if you're using ATSUI directly is to use ATSUI's line truncation tag. And what that will actually do is truncate the string while it's being drawn. So you don't actually have to modify the string itself in memory.

What you can do is tell ATSUI that you need to draw the string in a fixed width, and if you specify the line truncation tag, if it fits by itself, that's fine. If it's a little too big, ATSUI will try to squish it down a little bit first so it can draw the whole thing. And if it still doesn't fit, then ATSUI will truncate the string and insert an ellipsis. If you want to actually truncate the data itself, which is the way that trunk string and trunk text worked, then you can call truncate theme text, which is a unicode equivalent.

Something that's very important for applications that deal with text is finding appropriate boundaries. So we already talked about a cluster boundary, which corresponds to what the user thinks of as a character. But there are other boundaries as well. So let's take a look at this slide. There's an example at the bottom that illustrates line and word break.

And you'll see that line break and word break are not the same thing, although they're often thought of as being the same thing. So, for example, if I'm doing line breaking, it's acceptable to break after the hyphen. But if I'm doing word breaking, that is determining what constitutes a word, either for double-clicking or for doing whole word searching, then breaking in the middle of that is not acceptable.

So line breaking and word breaking are different. At the moment, the only APIs that are available for doing this kind of breaking operate at the Unicar array level, so that's the Unicode utilities. The first step is to create a text break locator by calling uccreateTextBreakLocator, and you specify when you create it which kinds of text boundaries you're interested in, whether it's a cluster boundary or a word boundary or a line boundary.

And then you can call ucfindTextBreak to iterate through the breaks in your text, either in a forward or backward direction. If you're interested in cluster boundaries, then as I mentioned earlier in the talk, there's CFString, get range of composed characters at index, which works at the CFString level. But if you need line or word breaks, then you need to call the UniCode utilities.

Okay, the last topic that we're going to cover is dates, times, and numbers. So there are several things you need to be able to do with dates, times, and numbers. One is to convert a date that's in a binary format or a time into a string to display it to the end user, or the user might have typed a date or a time into a text entry field, and you need to convert it back to a binary number so you can perform an operation on it. And in the old world, there were several APIs available for that. I'm not going to read them all off, but they're all deprecated now. In Panther, we introduced CFDateFormatter, which is a new set of APIs in Core Foundation that do this in the UniCode world.

And so we'll go through a small example here. Again, CFDateFormatter has a fair number of APIs that we don't have time to go into detail on all of them, so I'll just go through a short example. You can use CFDateFormatter createStringWithAbsoluteTime to use a CFDateFormatter and convert time, a binary number, into a string. If you're going in the other direction, you use CFDateFormatter getAbsoluteTime from string. Again, you pass a CFDateFormatter and the string, and you'll get back a binary time.

Thirdly, CFDate formatters have properties that you can set on them that control how the formatting is done. And you can use CFDate formatter set property to set a particular property on the date formatter. So here's a complete example we'll go through. First, we create our date formatter. Again, we pass null to indicate the standard storage allocator for core foundation.

We need to pass a locale to specify what kind of date formatting we're doing, because the date formatting for, say, U.S. English is very different from that for Japanese or German or Dutch or what have you. So we call CFLocale copy current, which gives us back the user's current locale. Now, if you were doing this in a real application, you'd want to save the user's current locale so that you don't keep calling CFLocale copy current over and over again. Because, first of all, you'd get a lot of copies.

And second of all, you want to take a snapshot of the user's current locale so that you get consistent results. The other thing we need to specify when we're creating our CFDate formatter is what style of date and time we want. In this case, we're saying we want the long date style and the long time style.

And the next thing we're going to do is, since in this example we're going to convert a date entered by the user into a binary time, we're going to set the lenient property on the date-time formatter. And we do that by calling CFDateFormatterSetProperty, passing the formatter and the key for the lenient property and setting it to true.

Now, what that's used for is, if you don't set this property, when you try to convert a date or time string to a binary number, CFDateFormatter will try to match it exactly against the template that's used for formatting dates, for converting a date from a binary number to a string. And if it doesn't exactly match that template, the conversion will fail.

What the lenient property does is it sets the date formatter so that it will try as hard as possible to interpret. The input string as a date or time, even if it doesn't match the template that it's expecting. So you pretty much always want to set this unless you're doing some kind of validation.

And the final call we make is get absolute, CFDate formatter get absolute time from string. We pass our CFDate formatter, the string, that's the input. You have the ability to pass some options, but we're passing null in this case. And finally, you pass a pointer to the CF absolute time to be filled in.

Now, sometimes you have to do operations on dates that, other than converting them to strings or converting them back from strings to a binary number, sometimes you need to do calendar operations. An example might be, take this date and add one month, or take this date and add one year.

And so in the Script Manager world, there were APIs like toggle date and validate and long date to seconds and long seconds to date that converted between the binary form of time and a structure which specified the year, month, day, et cetera, separately. So the time for new APIs, the time type for new APIs is CF absolute time.

And for a while, there's been a set of APIs for CF absolute time for doing computations with the Gregorian calendar. And those were, I don't know what release they were introduced in, but they've been in for a couple of releases now. But those APIs can't handle non-Gregorian calendars, which we're adding more support for in Tiger. And so we're introducing a new type, CF calendar. It's a new core foundation type.

And it's a set of APIs that will work with any kind of calendar. And so we're using the new type, CF calendar, to do calendar computations such as toggling dates, validating dates, and getting components of dates. And this API did not make the preview release, the WWDC preview release, but it is something we're working on for Tiger. So I'm just going to tell you a little bit about it today since you can't work with it yet.

CF Calendar can do things like create a set of calendar values to an absolute time. So, for example, if you give it a year, a month, and a day, you can convert that to an absolute time. It can also go in the other direction. It can take an absolute time and pick out the calendar components that correspond to it.

And finally, it can do toggling operations, such as taking an absolute time and adding a fixed quantity to it, such as a year, a month, or a day. So this is the multi-calendar replacement for the Gregorian calendar APIs that are in there right now, and look for it in a Tiger release coming soon.

Well, very similar to dates and times, we also need to be able to convert numbers between a binary format and a string that a user can understand. So, and again, that needs to be done in a locale-sensitive way because different countries have different conventions for the way that numbers are formatted. In the WorldText world, there were APIs available for doing that. In Panther, we introduced CFNumber Formatter, which is the UniCode equivalent.

And again, we'll go through a short example. CFNumberFormatter has several APIs that we don't have time to go into. You can create a string with a value using CFNumberFormatter, and you just pass the formatter. You have to specify the type of the value because it could be, say, a floating point number, a double, a long, what have you.

So you need to specify what type it is. You can also go in the other direction. You can take a string and interpret it as a number using CFNumberFormat or GetValueFromString. And again, you pass the format or the string and some other options, and you'll get a number out.

Finally, you can also set the format that's used for a number formatter. If you create a number formatter with a locale, you'll get the default format for that locale. But number formatters use a formatting string that is very similar to the pattern string that you might see in a spreadsheet program such as Excel.

You can set your own format strings to format numbers in a particular way. You do that by calling CFNumber.formatter.set.format and passing a string that represents the format pattern to use. Here's an example. We'll format a number. We create our number formatter using, again, the default storage allocator. Again, we pass a copy of the user's current locale. And again, you want to save that away as opposed to, getting it every time you make this API call. And in this case, we're saying we want a number formatter that uses the currency style because we're going to be formatting currency.

We have a double, which stores the currency amount we want to format. It's a floating point number 42. We call CFNumber.formatter.createStringWithValue. Again, the default storage allocator. We pass the number formatter that we created two lines back. We specify that we're passing a double. And then we pass the address of the variable. And this API will then return a string with that number formatted as currency according to the conventions of the user's current locale.

So that has been our whirlwind tour of the Unicode APIs that are replacements for WorldScript. Again, we did not have time to go into detail on all of them because there are a lot of APIs out there. But the goal of this presentation was to help you to understand how to translate a particular piece of your existing WorldScript application to the Unicode world.

So hopefully this application, this presentation, gave you the pointers you need to know where to go in the documentation to do that. If you have further questions, the first person you should contact is Xavier Legault, who is the representative for these technologies in worldwide developer relations. You can also contact me, but please do try Xavier first.

Rather than give you a long list of URLs to go to for information on Unicode APIs, there's a one-stop shopping page, and this is the URL. If you go to our Unicode Reference Library page, you'll find links to all the API sets and all the documentation you need to convert your application to Unicode.