Internationalization in Cocoa - WWDC 2006

Application Technologies • 45:59

Making your application ready for world-wide use can increase the size of your market. Learn to take advantage of Cocoa features that make internationalization easy, including support for international text via Unicode strings, and objects that abstract information about a user's locale.

Speakers: Doug Davidson, Aki Inoue

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning, everybody. I'm Doug Davidson, and I'm here to talk to you about internationalization in Cocoa. Now, what is internationalization? Probably everybody who's ever built a Cocoa application has encountered our model for localization, single binary multiple localization, where you have a single code base and then multiple localized versions of your resources for each different localization.

And that's very important for the user experience. But internationalization is a somewhat different and perhaps more subtle matter. Internationalization covers making sure that your application handles your user's data in the manner they expect, no matter what their language, no matter which localization they may be using, even if your application is never localized at all. This is important. Why should you be interested in this? Because it's important for making sure that your application is suitable for a world-wide audience. If you have international problems, you could potentially cut yourself off from a significant portion of your audience.

Now we're going to talk about some of the areas that are most important for this, key event handling, we'll talk about strings, we'll talk about date and time formatters, and then we'll talk a bit about text. Now, the fortunate thing here is that the Cocoa frameworks try very hard to make sure that everything is fully internationalizable by default.

So in a certain sense, all you have to do is not break that. However, there are some assumptions that are very easy to make, particularly for someone like me. I'm a native English speaker. I came into this field a long time ago before Unicode, and ASCII was the hot new standard.

And so there are certain assumptions that it's very easy to fall into that cause real problems from an international perspective. And we've been dealing with these issues for a long time, and we've encountered quite a few of these. And so I decided to see how many of these sorts of problems I could fit into a single application. Here's the application that I came up with. If we could go over to the demo machine.

[Transcript missing]

I know possibly not all of you have had a chance to really work with some of these complex input methods, so I wanted to just give a really basic demonstration of one of them to show, I think this will help make clear some of the issues involved here. And the method I've chosen to use here is is used for Chinese text. And what the user does in this particular method is to type in a phonetic representation of the character that they're interested in.

And then they hit space, and they're presented with a list of characters that they can arrow through and choose from. And then when they find the character they really want, they can hit return and select it. And at that point, it finally gets accepted and entered into the text view.

The thing here is that there are lots of keystrokes that are coming in that are really intended for the input method and not for the text view proper. And if you intervene at the keydown level, you can interfere with this. At worst, at best, you're going to get a lot of spurious keystrokes that you didn't want.

At worst, you can completely break this input method. And that's what this application that I showed you does. By overriding keydown and swallowing these return characters, it swallows the return character that this input method wants to accept the character. And not all input methods are like this, but this is fairly typical. They can do almost arbitrary key handling. So let's go back to the slides. Here we go.

I'm going to explain something that you can do instead. I talked about keystrokes that are bound to particular methods. What are these methods that they're bound to? They're mostly defined in NSResponder. NSTextView is probably the class that implements the most of these various NSResponder methods. There's a list of which keys are bound to which methods. These are methods of things like insert new line, insert tab, move left, right, up, down, and so forth.

In particular, the method that's bound to the return key is insert new line. How could I improve this little demo app? Here's one example of something you can do. Instead of subclassing NSTextView, we're acting as a TextView's delegate and implementing TextView do command by selector. Then we look for this selector, insert new line, that indicates a return was pressed. If we see it, we do something special. In this case, we're returning no. Returning no means the TextView should go ahead and continue handling this event normally. If we return yes, then that would mean we've completely handled it and the TextView shouldn't do its normal thing.

Okay, so now that you've got the text into your application, you probably want to do something with it. And to talk about string handling, I brought in an expert, Mr. Anna String himself, Akki Inoue. Thank you, Doug. Hi, I'm Aki Inoue. I'd like to discuss string handling, mainly focusing on pitfalls you might stumble on.

These are the topics I'm covering today. Let's get started with basics. Cocoa provides rich string handling capabilities via the NSString class. NSString is the string object in Cocoa. It's accepted virtually anywhere from foundation class to Android's book framework. And it allows us to provide efficiency both in API design and performance point of view. So you don't have to use different string representation depending on the framework you are using. Also, Cocoa is designed to work exclusively with UniCode standards, and it also gives us coherent, consistent, clear API. So, we don't have to do something like this.

The UniCode standard is the industry standard character set. It's used by all the cool latest technologies. And the latest version, recently finalized version 5, covers more than 99,000 characters from over 60 scripts. A unique feature of UniCode standard is it's not merely assigning numeric values to each characters, but it also defines rules and behaviors dealing with those characters. For example, comparison or case mapping behaviors. One of those rules is a standard encoding form. UniCode has Three standard encoding intervals defined: UTF-8, UTF-16, and UTF-32.

The numbers represent the bit width of an atomic unit in the encoding. For example, UTF-16 is a two byte entity, just like Chopin. And a string is conceptually an array of UTF-16 characters, and it provides various string-hunting capabilities from comparison, searching, so on and so forth. And you could extend the capability of NSString using one of companion objects like NSCharacterSet, NSAttributeString, or NSCannons. Now we cover the basics. I'd like to discuss HP Folds. Now with the C-strings. C-strings. What is that? Simply, it's this. Now, what about it? Probably, this is one of the most familiar C data type for most of you. But I'd like to forget about it.

C-strings are mostly relying on some implicit string encoding definition. So, It could be Mac Roman, UTF-8, Japanese, or anything. So at some point, probably you are forced to guess the encoding you are using with some arbitrary C string pointers. Another issue with C-String is by using that, you are unnecessarily limiting the number of characters you can represent in your application.

And if you want to have multi-lingual capability in your application, it's much, much harder to keep track of each encoding used in portions of C string. Also, In order to make sure your application works with some arbitrary encodings like Cyrillic or Arabic, you have to probably test your application with the 22 Mac OS encoding we are shipping.

So I know you might be tempted, but stick with NSString whenever possible. It provides Higher-level operation so you don't have to access each of the individual characters by yourselves. And we consider the performance of NSString pretty seriously. And in fact, we know that NSString is one of the most efficient Unicode handling libraries in the industry.

And we keep track of the change in UniCode standard. So we're going to apply the change whenever possible to the major OS releases. There are cases you might still need to use C string. These are the recommendations we have. Always keep track of the encoding explicitly. So if you need to pass around C string pointer, we recommend to also pass around corresponding encoding explicitly. We have API to help you.

Another thing to note is if you need to use one of POSIX-like API, like Open System Code, You need to use specialized methods like file system representation to get file system compatible C string by the pointer. And no, we have already deprecated C string method that use implicit encoding.

Now we cover the strings. I'd like to mention case operations. For English, lower casing, upper casing is pretty simple, one-to-one mapping.

[Transcript missing]

Case mapping operation being worked on. So let me show some of the examples here. The #s character in German language only has lowercase representation. So if you want to uppercase the word like "Strauss" here, the character that-- becomes two capital S characters, thus changing the length of strength. Another example. It looks like two characters, but actually, it's a ligature, a single character looking like two characters. UniCode has those characters so that in addition to lower case, upper case, it needs to have title case. So UniCode has three cases, not just two.

Another example, you need to have the context in order to do the correct mapping. The Greek Sigma character, the Greek capital Sigma character here, requires a context in order to get the correct mapping. It could become lowercase sigma character, or it could be lowercase final sigma form, depending on its inward location. Even innocent-looking ASCII characters are not safe. If you're in a lowercase, the capital I character, it's going to be a special dotless I if you're writing in Turkish language.

So, casing operation is not a simple one-to-one mapping. It could change the length of string, and often requires the context of the whole string. And surprisingly, lower casing and upper casing is not symmetrical. So, You cannot expect a round-trip fidelity between these operations. And that caused interesting point that if you want to have case insensitivity, just by lower-casing and upper-casing character doesn't give you that.

You have to have specific case ins-- you need to use specific case insensitive functionalities in NSString. So if you want to do case operation, we provide these methods, uppercase string or lowercase string, and capitalized string, supporting the title case we mentioned earlier. So please don't look at individual characters when you want to change the case of string.

Next, I'd like to cover character clusters. Probably, it's one of the biggest mysteries in UniCode standards that multiple characters could comprise a semantic, grouped together character. Next, I'd like to cover character clusters. Probably, it's one of the biggest mysteries in UniCode standards that multiple characters could comprise a semantic, grouped together character. One of the common composition is combining marks. It's commonly used to represent accent marks. For example, this Gryff Alpha character has three accents. Thus, you can have six patterns to represent this character in UniCode. And the string handles all correctly and recognizes them as equal.

So you don't have to worry, but there are some exceptions. is equal and hash do not take composition difference into account. So when you want to put these strings into collection classes such as NSDictionary, you need to take extra steps beforehand. So when you want to put these strings into collection classes such as NSDictionary, you need to take extra steps beforehand. From arbitrary composition state string into free-composed or free-decomposed string. We recommend using the decomposed string with canonical mapping for compatibility with file system and also efficiency reasons.

There are other types of compositions. One is there are characters that can actually combine arbitrary characters, surrounding combining characters. So you cannot expect that combining only happens with the form of a base character followed by accent. Actually, you know, a character could bind any surrounding character. Combining Graphium Joiner is one of them. And another type of composition is Hangul characters used in Korean language. Each character could have at least two representations, combined or decomposed.

Yet another type of composition is solid pairs. Solid pairs are a combination of two UTF-16 characters, so it takes two slots inside a string. They are used to represent characters outside of 16-bit range by combining those two 16-bit data into a 32-bit data.

[Transcript missing]

Beyond 60-bit ranges used by customers in China and Japan. So if you're planning going to this market, this is pretty essential to support. Furthermore, the same combining behavior I mentioned earlier applies to the surrogate characters too. So there are already characters assigned in this area that can combine with other characters.

In Unicode, there are those types of composition, but You can forget all the details I mentioned today if you stick with NSString. NSString handles all of them transparently. If you need to process a character cluster all by yourself, you can use range of composed character sequence app index. The method maps character index into a range of character cluster containing the index. You can process each TurkTurk cluster from the beginning of the string, like the code up here.

I'd like to mention source file encodings here. Xcode supports any type of encoding, but we recommend keeping your source code asking only. For example, if you use Noasci byte in NSString literals, represented by a double-code format, the behavior is simply undefined. And often, what you get is garbled text like this in some other localization.

So if you need to embed some NoSQL characters into your source code, use one of those methods that takes one of the Unicode encoding forms. And if you have NoASCII characters, and that can be visible to users, the chances are you might want to try the localized string facility instead of trying to hard-code NoASCII characters into your source code.

Some examples. If you want to embed UTF characters into your source code, you can use simple C numeric values instead of using the actual bytes inside the source code and calling the UTF-8. Another example. This example, we're using the %C format directive so that you can include a UTF-16 character like ellipsis here into your NSString.

I'd like to mention NS-attributed string here a little. If you are looking for some string handling operation and you can't find it in NS-string, the chances are you can find it in NS-attributed string instead. For example, it can provide Grammatical analysis like word or line break information. And also, it's capable of reading and writing a number of plain and rich text formats.

So you might just skip the headache of dealing with plain text encoding issue just by using NSF attribute string. String Handling Wrap-up. Cocoa provides string-rich string handling. And it's multilingual. It makes your application world-ready, transparent. So by sticking to NSString, It makes your application world-ready, transparent. So by sticking to NSString, Now I'd like to bring back Doug Davidson to talk about locale and formatus. Thank you.

Thanks, Akki. So the next set of issues I want to talk about have to do with locale. What is locale? Locale is not the same thing as localization. Localization refers to the language that your application uses to communicate with the user. The locale has to do with the user's preferences for how they want their data organized and presented. And the object that we use to identify a locale is NSLocale.

And it has information about things like choice of calendar, date formats, time formats, sort orders, English versus metric measurement systems, and so forth. And you can use the class method CurrentLocale to get the user's CurrentLocale set of preferences. And this may be one of the predefined sets of formats, or the user might have customized it by choosing variations for some of these items. And in most cases, you're not going to be using NSLocale directly. You might probably use it indirectly through a number of locale-savvy classes like NSString, NSNumberFormatter, NSDateFormatter, and so forth.

And so let's talk about NSNumberFormatter. The most important thing for our number formatting is to use NSNumberFormatter. And NSNumberFormatter has been around for a long time, but it's only really been fully locale savvy since 10.4. So it has two different behaviors. There's an older compatibility behavior, and then there's the 10.4 behavior, which is the one you want to use. This is the default behavior for applications that are linked on Leopard, but if you're running back to Tiger, you need to explicitly specify this, and it's probably a good idea to do so.

And now, in the 10.4 behavior, it still supports format strings, but better than that, the better thing to do is to use one of the enumerated styles, like decimal style, currency style, and so forth, because these are things that will fully participate in the user's choice of preferences for the format that they want their numbers displayed in.

For NSDateFormatter, it's somewhat similar. Again, it's been around a long time, but it's fully locale-savvy, starting in 10.4, and you want to use the 10.4 behavior. And again, the best thing is to use enumerated styles, and for NSDateFormatter, there are two sets of enumerated styles. There's a date style and the time style, and these have values like full, long, medium, short, or none if you want to have a date, but not a time, or vice versa.

And these will fully participate in the user's choice of preferences for their display of dates. We recommend against using NSCalendarDate because, again, this is an older class and has the pre-10.4 behavior. Another thing that people sometimes do that we recommend against, and that is using format strings and making them localizable strings.

The problem there is that, as I said, the localization is not the same thing as locale. For example, many users around the world may use the English localization, not because it's not their first language, but it may be the second language they're most comfortable with. But they still will have their own preferences and requirements, indeed, for date format, time format. They'll have their own data, which is maybe in their own language that they'll need handled properly.

If you need to do calendrical calculations, we have the class NSCalendar, which handles a variety of different calendars in use around the world. And you can use the current calendar method to get the current user's preferred calendar. Now, if you're using NSCalendar for calendrical calculations, it handles all these different calendars in a sort of generalized way.

So you have to avoid making assumptions that might be valid only for, say, the Gregorian calendar. They don't all have the same number of days or months and so forth. Years don't necessarily start at the same point. And again, NSCalendarDate is an older class that handles only the Gregorian calendar. So I would like you to move forward from that to NSCalendar.

sorting of strings. Now, the basic NSString compare method is a default locale independent sort order. It's useful for programmatic sorting. If you are sorting data to present to the user, then you probably will want to use a fully locale savvy comparison method. We have some basic convenience ones, localized compare, localized case and sensitive compare. Then we have the full options, compare options range locale into which you can pass NSLocale current locales locale.

It's important to assume as little as possible about what these sort orderings are going to do. The sort orderings can vary quite a bit, not just with language, but even in a writing system, but from country to country within the same language or writing system. Different sort orders handle things like accents differently. Some of them sort them with the base character, some of them sort them separately. Some of them will combine multiple characters for purposes of sorting.

So you should assume as little as possible about what they're going to do and just let the comparison method handle comparison for sorting. So the little application I showed you just used the standard obvious compare method for sorting its strings. But these were going to be presented to the user, so what we really want is a locale-savvy sort, case-insensitive in this case.

and that's the thing we prefer. So the next thing that little sample application did was to measure the width of the words in the text. And right away it runs into a problem. So what does it do? It asks the text view for its font, and it goes through and finds each word, and then it calls font width of string string to get the width of that string. Again, simple, obvious, and wrong.

The NSString method, no, excuse me, the NS font method with the string is deprecated, and I want to explain why. Most obvious is that Given an arbitrary string, it can be an arbitrary UniCode string, and a single font is not going to be able to represent any character that might be in a string. And in general, the Cocoa text system is going to substitute in a suitable font whenever it finds a character that can't be represented in a specified font. So you can't necessarily just rely on the font to measure a particular string.

But even more important from an international perspective, even if you know that the string can be represented with a given font, just adding up the width of the glyphs in a string is not sufficient to measure its width. The glyphs will often have different contextual forms depending on where they appear and what the characters around them are. They may need to be reordered depending on the writing system. They may need to be repositioned. So in general, you need to have text layout in order to properly measure text or in order to properly display it.

Now, fortunately, Cocoa has some simple convenience methods that make it very easy to measure and draw strings and attributed strings. These are what we call the string drawing methods, things like NSString's size with attributes, where you specify an attribute dictionary and get back size. And there are various methods for these for NSStrings or NSTributed strings for sizing, drawing at points and rects and so forth. And you can look through NSStringDrawing.h for these.

If you have needs that are more complicated than these simple convenience methods will handle, then you can go to NSLayoutManager, which gives you full control over the layout process. I'm going to talk about that in a minute. So what should our little sample application have been doing? In this case, we can use NSString's size with attribute method as a drop-in replacement for what it was doing. In this case, we create an attribute dictionary with the font that we specify, and we go through, and for each string, we call sizeWithAttributes with that attribute dictionary and get the width.

Okay, so I mentioned the use of NSLayoutManager. And this gives you a great deal more control than these simple convenience methods. With NSLayoutManager, you can get complete information about the contents of the layout, about all the glyphs and all their positioning and so forth. It also has a lot more options.

You can get complete control over the process. Now, you've probably used NSLayoutManager already, just not known about it. NSLayoutManager normally is part of the hidden machinery that goes into NSTextView. But if you need it, if you need that sort of power, then you can use it on its own.

So how do you get a layout manager? Well, the simple and easiest method is if you have a text view, or if you create a text view, then you automatically get all the pieces of the Cocoa text system provided for you. And these are NSTextStorage, which is the contents of the text, the document, that is, the characters and their attributes, the fonts, and so forth. NSTextContainer, which describes the region within which the text is to be laid out. And then on the last, NSLayoutManager, which is a central object that manages everything and will give you all the information you need.

If you don't want to have an S-text view or if you want to completely customize this, then you can create all these objects yourself and hook them up. It's a few more lines of code. Fortunately, we have some good examples, such as the circle view example that shows exactly how to do that.

So what does NS Layout Manager do? The basic job of the Layout Manager is to take the text, that is, characters and their attributes such as fonts, as described by an attributed string, that is a text storage, and to convert them into a sequence of glyphs and glyph positions. What is a glyph?

A glyph is the fundamental displayable element in a particular font. It might be a letter, an accent, or a ligature combining multiple letters. So NS Layout Manager manages the process of taking the text, the characters and their attributes, and converting them into glyphs and their positions. And it provides all this information to you.

The most fundamental thing is it provides the mapping from the characters to the glyphs. And in general, this is not necessarily a one-to-one mapping. In fact, it can be many-to-many. A single character may require multiple glyphs to represent it. Or a set of characters may be represented by a single glyph.

And Aslant Manager handles all this for you. It makes sure that the mapping is always in order. But in general, it's many-to-many. And whenever you have a range in the text, you need to know whether you have a range of characters or a range of glyphs. And you use Aslant Manager to convert back and forth between these.

And as I said, NSLayoutManager also keeps track of the positions of these glyphs as they're laid out in the text. And it will tell you, for example, the size and position of any particular line in the text. And it will tell you the position of any particular glyph within that line.

And the thing to remember here from an international perspective is that the ordering of glyphs within a line does not necessarily bear any particular relationship to the ordering of the characters that generated them. So some writing systems write from left to right, some of them write from right to left. In some cases you will have text that combines both of these in a single line. So it may start out from one end and then turn around and go the other direction.

So when you inquire of NS Layout Manager about the positions of glyphs, you just need to make sure that you do not assume that the positions of these glyphs are in any particular order based on the ordering of the characters that generated them. So here's an example. Here at the top, we have the characters. We start off with some English, and then we include some Arabic.

And the glyphs down at the bottom, let me point out a couple of things about these. First of all, right here, we have an FI ligature glyph. That is two characters, F and I, represented by a single glyph. So right here, even in English, we have an example of a situation where the mapping between characters and glyphs is not going to be one-to-one.

Then when we go over to the Arabic, what happens is that these Arabic characters at the top are in logical order, in the order they appear in the text, but then when they're going to be displayed, their order will be from right to left within this line. So the positional ordering of the glyphs is different from the logical ordering of the characters. And another thing to notice is that these glyph forms, as they appear in context here, are quite a bit different from the context-independent appearance that I've shown up at the top. So these glyphs change shape depending on where they appear in a word.

NSLayoutManager also provides some higher level information. Things like you can go from ranges of characters and glyphs in the text to rects within the layout. And here again, the thing to remember is that even within a single line, within a single range of glyphs, because they can be reordered, you may end up with multiple rects that you need to represent them.

NSLayout Manager provides information about insertion points, potential insertion points in the text, and it can go the other direction. If you're given a location in the layout, it can go back and determine what glyph that corresponds to. Or if you're given a rect in the layout, it can go back and determine the glyph range that it encompasses.

So here is a brief example of how we might have used that if we wanted to measure the width of certain pieces of text. In this case, we have a text view, suppose, and from the text view, we get the layout manager and the text container. And then if we've determined a particular range of text that we want the width for, then we can go through and call this method rect array for character range and so forth and so forth.

And it returns a C array of rect. And then, as I said, there may be multiple rects even needed to represent one contiguous range in the text. So we go through and add up the widths of all these rects to get the total width of this portion of the text.

So, points I want you to take away. You should let the Cocoa framers do as much as possible for you, handling internationalization and international issues. And you need to avoid any of the problematic assumptions, things like assuming that a single keystroke corresponds to a single character, or that a single character corresponds to a single glyph, or that the locale is equivalent to the localization. or that all the glyphs in a line are ordered from left to right. Use UniCode. Use NSString as much as possible to handle the heavy lifting for you. Use the date and number formatters for user-visible dates and numbers, and use the 10.4 behavior for these.