Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2006-150
$eventId
ID of event: wwdc2006
$eventContentId
ID of session without event part: 150
$eventShortId
Shortened ID of event: wwdc06
$year
Year of session: 2006
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC06 • Session 150

Internationalization in Cocoa

Application Technologies • 45:59

Making your application ready for world-wide use can increase the size of your market. Learn to take advantage of Cocoa features that make internationalization easy, including support for international text via Unicode strings, and objects that abstract information about a user's locale.

Speakers: Doug Davidson, Aki Inoue

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Good morning, everybody. I'm Doug Davidson, and I'm here to talk to you about internationalization and Cocoa. Now, what is internationalization? Probably everybody who's ever built a Cocoa application has encountered our model for localization, single binary multiple localization, where you have a single code base and then multiple localized versions of your resources for each different localization. And that's very important for the user experience. But internationalization is a somewhat different and perhaps more subtle matter.

Internationalization covers making sure that your application handles your users' data in the manner they expect, no matter what their language, no matter which localization they may be using, even if your application is never localized at all. This is important. And why should you be interested in this? Because it's important for making sure that your application is suitable for a worldwide audience. If you have international problems, you could potentially cut yourself off from a significant portion of your audience.

Now, we're going to talk about some of the areas that are most important for this-- key event handling. We'll talk about strings. We'll talk about date and time formatters. And then we'll talk a bit about text. Now, the fortunate thing here is that the Cocoa frameworks try very hard to make sure that everything is fully internationalizable by default. So in a certain sense, all you have to do is not break that. However, there are some assumptions that are very easy to make, particularly for someone like me. I'm a native English speaker. I came into this field a long time ago before Unicode, and ASCII was the hot new standard. And so there are certain assumptions that it's very easy to fall into that cause real problems from an international perspective. And we've been dealing with these issues for a long time, and we've encountered quite a few of these. And so I decided to see how many of these sorts of problems I could fit into a single application. Here's the application that I came up with. If we could go over to the demo machine.

So this is a simple little application. It has a text view. You type some text in here. And then whenever you hit Return, it goes through the text and measures the width of every word in the text and puts it up in a table. So OK, this is a kind of contrived little application. But the problems that it exhibits are all very real. Now, how can a simple application like this go wrong? Let's go back to the slides. And I'll talk about how I can do that. And the first thing here is, how does this application detect that you hit Return in the TextView? So we have a TextView. The application does.

Subclasses in its TextView overrides key down, and it looks for a return character. Simple, obvious, and wrong. Now, to explain why this is wrong, I'm going to have to talk about the path that key events take when they come into the Cocoa Text system. Key down gets raw key events. The user hits a key, boom, the text view gets a key down method. And now there may be some views that will handle keys at this raw key down level, but a sophisticated text handling view like NSTextView is not going to do that. It's going to immediately pass them on to a second stage of handling, which is called key binding, and it does that by calling interpret key events. And what happens in key binding is a couple things. First of all, key binding is going to look at these keys and see whether they're ordinary keys like letter keys that are just input of text, or whether they're special keys like arrow keys, tabs, or turn, and so forth that are going to be bound to particular methods.

And the other thing more important from an international perspective is that at this point, complex input methods are going to have a chance to deal with the keystrokes. And these input methods are used mainly in languages like Chinese and Japanese that have large numbers of characters that the user needs to select from. A single keystroke isn't enough, and it's necessary to do some extra work to pick the character that's going to be entered. And so after the keybinding stage, after the input methods have had a chance at the keystrokes, then it comes back to the text view in one of two ways, either with a call to insert text in the case of ordinary typing like letters and so forth, or with a call to do command by selector for these special keys like arrows and return and tab and so forth that are bound to particular selectors. And this is the point at which you would usually want to intervene. You can subclass NSTextView in a variety of one of these methods. Fortunately with NSTextView, you usually don't have to do that because NSTextView also has delegate methods. For example, there's the delegate method text view do command by selector that gives you complete control over all these selector bound keystrokes. And there are other methods like should change text and will change selection and so forth, all sorts of things that you can control from the text view delegate level. There's one caveat I should mention with insert text that showed up in the labs just yesterday. insertText can receive either an NSString or an NSAttributedString. So if you are planning to override insertText, then you have to be prepared to handle either of those. The documentation does mention this. It's perhaps not entirely obvious. There are some cases, some input methods that are going to be passing in an attributed string because they need to specify, say, the font. Now let's go over to the demo machine.

I know possibly not all of you have had a chance to really work with some of these complex input methods, so I wanted to just give a really basic demonstration of one of them to show -- I think this will help make clear some of the issues involved here. And the method I've chosen to use here is -- is used for Chinese text. And what the user does in this particular method is to type in a phonetic representation of the character that they're interested in. And then they hit space, and they're presented with a list of characters that they can arrow through and choose from. And then when they find the character they really want, they can hit Return and select it. And at that point, it finally gets accepted and entered into the text view.

The thing here is that there are lots of keystrokes that are coming in that are really intended for the input method and not for the text view proper. And if you intervene at the key down level, you can interfere with this. At worst, at best, you're going to get a lot of spurious keystrokes that you didn't want. At worst, you can completely break this input method. And that's what this application that I showed you does by overriding key down and swallowing these return characters. It swallows the return character that this input method wants to accept the character. And not all input methods are like this, but this is fairly typical. They can do almost arbitrary key handling. So let's go back to the slides.

And so I'm going to explain something that you can do instead. But I want to talk about these, I talked about keystrokes that are bound to particular methods. And what are these methods that they're bound to? They're mostly defined in NSResponder. And NSTextView is probably the class that implements the most of these various NSResponder methods.

There's a list of which keys are bound to which methods. And these are methods of things like insert new line, insert tab, move left, right, up, down, and so forth. And in particular, the method that's bound to the return key is insert new line. So how could I improve this little demo app? So here's one example, something you can do. Instead of subclassing NSTextView, we're acting as a TextView's delegate and implementing TextView do command by selector.

And then we look for this select or insert new line that indicates a return was pressed. And if we see it, we do something special. And then in this case, we're returning no. Returning no means the TextView should go ahead and continue handling this event normally. If we return yes, then that would mean we've completely handled it and the TextView shouldn't do its normal thing.

OK, so now that you've got the text into your application, you probably want to do something with it. And to talk about string handling, I brought in an expert, Mr. Anna String himself, Aki Inoue. Thank you, Doug. Hi, I'm Aki Inaui. I'd like to discuss string handling, mainly focusing on pitfalls you might stumble on.

These are the topics I'm covering today. Let's get started with basics. Cocoa provides rich string handling capabilities by the NSString class. NSString is the string object in Cocoa. It's accepted virtually anywhere from foundation class to address book framework. And it allows us to provide efficiency, both in API design and performance point of view. So you don't have to use different string representation, depending on the framework you are using. Also, Cocoa is designed to work exclusively with Unicode standard. And it also gives us coherent, consistent, clear API. So we don't have to do something like this.

The Unico Standard, it's the industry standard character set. It's used by all the cool latest technologies. And the latest version, recently finalized version 5, covers more than 99,000 characters from over 60 scripts. A unique feature of Unicode standard is it's not merely assigning numeric values to each characters but it also defines rules and behaviors dealing with those characters. For example, comparison or case mapping behaviors. One of those rules is a standard encoding form. Unicode has a Three standard encoding intervals defined, UTF-8, UTF-16, and UTF-32.

The numbers represents the bit width of an atomic unit in the encoding. For example, UTF-16 is a two byte entity, just like Chopin. And a string is conceptually an array of UTF-16 characters. And it provides various string-hunting capabilities from comparison, searching, so on and so forth. And you could extend the capability of NSString using one of companion objects like NSCharacterSet, NSAttributedString, or NSCannons.

Now we cover the basics. I'd like to discuss HP falls. Now with the C strings. C strings. What is that? Simply, it's this. Now, what about it? Probably, this is one of the most familiar C data type for most of you. But I'd like to forget about it.

C-strings are mostly relying on some implicit string encoding definition. It could be Mac Roman, UTF-8, Japanese, or anything. So at some point, probably you are forced to guess the encoding you are using with some arbitrary C string pointers. Another issue with C string is by using that, you are unnecessarily limiting the number of characters you can represent in your application.

And if you want to have multi-liminal capability in your application, it's much, much harder to keep track of each encoding used in portion of C string. Also-- In order to make sure your application works with some arbitrary encodings like or Arabic, you have to probably test your application with the 22 Mac OS encoding we are shipping.

So I know you might be tempted, but stick with NSStream whenever possible. It provides-- higher level operation, so you don't have to access each of the individual characters by yourselves. We consider the performance of NSStream pretty seriously. In fact, we know that NSStream is one of the most efficient Unicode handling libraries in the industry.

And we keep track of the change in Unicode standard. So we're going to apply the change whenever possible to the major OS releases. There are cases you might still need to use C string. These are the recommendations we have. Always keep track of the encoding explicitly. So if you need to pass around C string pointer, we recommend to also pass around corresponding encoding explicitly. We have API to help you.

Another thing to note is if you need to use one of POSIX-like API, like Open System Core-- You need to use specialized methods, like file system representation, to get file system compatible C-String by the pointer. And no, we have already deprecated C string method that use implicit encoding.

Now we cover strings. I'd like to mention case operations. For English, lower casing, upper casing is pretty simple, one-to-one mapping. Not so for the rest of the world. There are cases-- there are string of-- casing operations is depending on the context, like the surrounding character for the-- case mapping operation being worked on. So let me show some of the examples here.

The sharp S character in German language only has lowercase representation. So if you want to uppercase the word like "Strauss" here, the becomes two capital S characters, thus changing the length of strength. Another example. It looks like two characters, but actually, it's a ligature, a single character looking like two characters. Unicode has those characters so that in addition to lowercase, uppercase, it needs to have title case. So Unicode has three cases, not just two.

Another example, you need to have the context in order to do the correct mapping. The Greek sigma character, the Greek capital sigma character here, requires a context in order to get the correct mapping, it could become lowercase sigma character, or it could be lowercase final sigma form, depending on its inward location. Even innocent-looking ASCII characters are not safe. If you're in the lower case, the capital I character, it's going to be a special dotless I if you're writing in Turkish language.

So casing operation is not a simple one-to-one mapping. It could change the length of string and often requires the context of the whole string. And surprisingly, lower casing and upper casing is not symmetrical. You cannot expect a round-trip fidelity between these operations. And that caused interesting point that if you want to have case insensitivity, just by lower casing and upper casing character doesn't give you that. You have to have specific case-- you need to use specific case insensitive functionalities in NSString. So if you want to do case operation, we provide these methods, uppercase string or lowercase string, and capitalized string supporting the title case we mentioned earlier. So please don't look at individual character when you want to change the case of string.

Next, I'd like to cover character clusters. Probably, it's one of the biggest mysteries in Unicode standards that multiple characters could comprise a semantic grouped together character. And Unicode has several types of composition. And surprisingly, you are allowed to have arbitrary state of composition or decomposition in any of strings. One of the common composition is combining marks. It's commonly used to represent accent marks. For example, this Grief Alpha character has three accents. Thus, you can have six patterns to represent this character in Unicode. And the string handles all correctly and recognizes it as equal.

So you don't have to worry. But there are some exceptions. is equal and hash do not take composition difference into account. So when you want to put these strings into collection classes such as NSDictionary, you need to take extra steps beforehand. In order to do that, we provide method. from arbitrary composition state string into free-compose or free-decomposed string. We recommend using the decomposed string with canonical mapping for compatibility with file system and also efficiency reasons.

There are other types of compositions. One is there are characters that can actually combine arbitrary characters, surrounding combining characters. So you cannot expect that combining only happens with the form of a base character followed by accent. Actually, a character could bind any surrounding character. Combining Graphium Joiner is one of them. And another type of composition is Hangul characters used in Korean language. each character could have at least two representations, combined or decomposed.

Yet another type of composition is solid pairs. Solid pairs are a combination of two UTF-16 characters. So it takes two slots inside a string. They are used to represent characters outside of 16-bit range by combining those two 16-bit data into a 32-bit data. And there are characters in these-- beyond 60-bit ranges used by customers in China and Japan. So if you're planning going to this market, this is pretty essential to support. Furthermore, the same combining behavior I mentioned earlier applies to the surrogate characters, too. So there are already characters assigned in this area that can combine with other characters.

In Unicode, there are those types of composition. You can forget all the details I mentioned today if you stick with NSString. NSString handles all of them transparently. If you need to process a character cluster all by yourself, you can use range of composed character at index. Range of composed character sequence at index. The method maps character index into a range of character cluster containing the index. You can process each character cluster from the beginning of the string, like the code up here.

I'd like to mention source file encodings here. Xcode supports any type of encoding, but we recommend keeping your source code asking only. For example, if you use Noasci byte in NSString literals represented by a double-quote format, the behavior is simply undefined. And often, what you get is garbled text like this in some other localization.

So if you need to embed some NoSQL characters into your source code, use one of those methods that takes one of the Unicode encoding forms. And if you have NoASCII characters, and that can be visible to users, the chances are you might want to try the localized string facility instead of trying to hard code NoASCII characters into your source code.

Some examples, if you want to embed UTF characters into your source code, you can use simple C numeric values instead of using the actual bytes inside the source code and calling the UTF-8. Another example, this example, we're using the %C format directive so that you can include a UTF-16 character, like ellipsis here, into your NSString.

I'd like to mention NS attribute string here a little. If you are looking for some string handling operation and you can't find it in NSString, the chances are you can find it in NSAttributedString instead. For example, it can provide-- Grammatical analysis like word or line break information. And also, it's capable of reading and writing a number of plain and rich text formats. So you might just skip the headache of dealing with plain text encoding issue just by using NSF attribute string.

String Handling Wrap-up. Cocoa provides string-rich string handling. And it's multilingual. It makes your application world-ready, transparently. So by sticking to NSStream, you can avoid all the pitfalls I mentioned here. So it's just us do the heavy lifting. Now I'd like to bring back Doug Davidson to talk about locale and formatus. Thank you.

Thanks, Aki. So the next set of issues I want to talk about have to do with locale. What is locale? Locale is not the same thing as localization. Localization refers to the language that your application uses to communicate with the user. The locale has to do with the user's preferences or how they want their data organized and presented. And the object that we use to identify a locale is NSLocale. And it has information about things like choice of calendar, date formats, time formats, sort orders, English versus metric measurement systems, and so forth. And you can use the class method current locale to get the user's current locale set of preferences. And this may be one of the predefined sets of formats, or the user might have customized it by choosing variations for for some of these items. And in most cases, you're not going to be using NSLocale directly. You might probably use it indirectly through a number of locale-savvy classes, like NSString, NSNumberFormatter, NSDateFormatter, and so forth.

And so let's talk about NSNumberFormatter. The most important thing for our number formatting is to use NSNumberFormatter. And NSNumberFormatter has been around for a long time, but it's only really been fully locale savvy since 10.4. So it has two different behaviors. There's an older compatibility behavior, and then there's the 10.4 behavior, which is the one you want to use. This is the default behavior for applications that are linked on Leopard, but if you're running back to Tiger, you need to explicitly specify this. And it's probably a good idea to do so. And now, in the 10.4 behavior, it still supports format strings, but better than that, the better thing to do is to use one of the enumerated styles, like decimal style, currency style, and so forth, because these are things that will fully participate in the user's choice of preferences for the format that they want their numbers displayed in.

For NSDateFormatter, it's somewhat similar. Again, it's been around a long time, but it's fully locale-savvy, starting in 10.4, and you want to use the 10.4 behavior. And again, the best thing is to use enumerated styles. And for NSDateFormatter, there are two sets of enumerated styles. There's a date style and the time style. And these have values like full, long, medium, short, or none if you want to have a date but not a time or vice versa.

And these will fully participate in the user's choice of preferences for their display of dates. We recommend against using NSCalendarDate because, again, this is an older class and has the pre-10.4 behavior. Another thing that people sometimes do that we recommend against, and that is using format strings and making them localizable strings. The problem there is that, as I said, the localization is not the same thing as locale. For example, many users around the world may use the English localization, not because it's not their first language, but it may be the second language they're most comfortable with. But they still will have their own preferences and requirements, indeed, for date format, time format. They'll have their own data, which is maybe in their own language, that they'll need handled properly.

If you need to do calendrical calculations, we have the class NSCalendar, which handles a variety of different calendars in use around the world. And you can use the current calendar method to get the current user's preferred calendar. Now, if you're using NSCalendar for calendrical calculations, it handles all these different calendars in a sort of generalized way. So you have to avoid making assumptions that might be valid only for, say, the Gregorian calendar. They don't all have the same number of days or months and so forth. The years don't necessarily start at the same point. And again, NSCalendarDate is an older class that handles only the Gregorian calendar. So I would like you to move forward from that to NSCalendar.

Sorting of strings. Now, the basic NSString compare method is a default locale independent sort order. So it's useful for programmatic sorting. If you are sorting data to present to the user, then you probably will want to use a fully locale savvy comparison method. We have some basic convenience ones, localized compare, localized case-insensitive compare. And then we have the full options, compare options, range locale, into which you can pass NSLocale, current locales, locale.

And it's important to, if you're doing this, to assume as little as possible about what these sort orderings are going to do. Because the sort orderings can vary quite a bit, not just with language, but even in a writing system. But from country to country, within the same language or writing system, different sort orders handle things like accents differently. Some of them sort them with the base character. Some of them sort them separately. Some of them will combine multiple characters for purposes of sorting. So you should assume as little as possible about what they're going to do and just let the comparison method handle comparison for sorting. So the little application I showed you just used the standard obvious compare method for sorting its strings. But these were going to be presented to the user. So what we really want is a locale-savvy sort, case insensitive in this case.

And that's the thing we prefer. So the next thing that little sample application did was to measure the width of the words in the text. And right away, it runs into a problem. So what does it do? It asks the TextView for its font. And it goes through and finds each word. And then it calls fontWidth of string string to get the width of that string. Again, simple, obvious, and wrong. The NSFont method with the string is deprecated. And I want to explain why. Most obvious is that-- given an arbitrary string, it can be an arbitrary Unicode string, and a single font is not going to be able to represent any character that might be in a string.

And in general, the Cocotext system is going to substitute in a suitable font whenever it finds a character that can't be represented in a specified font. So you can't necessarily just rely on the font to measure a particular string. But even more important from an international perspective, even if you know that the string can be represented with a given font, just adding up the width of the glyphs in a string is not sufficient to measure its width. The glyphs will often have different contextual forms depending on where they appear and what the characters around them are. They may need to be reordered depending on the writing system. They may need to be repositioned. So in general, you need to have text layout in order to properly measure text or in order to properly display it.

Now, fortunately, Cocoa has some simple convenience methods that make it very easy to measure and draw strings and attributed strings. These are what we call the string drawing methods, things like NSString's size with attributes, where you specify an attribute dictionary and get back size. And there are various methods for these for NSStrings or an attributed strings for sizing drawing at points and recs and so forth. And you can look through NSStringDrawing.h for these.

If you have needs that are more complicated than these simple convenience methods will handle, then you can go to NSLayoutManager, which gives you full control over the layout process, and I'm going to talk about that in a minute. So what should our little sample application have been doing? We can use, in this case, we can use NSString size with attribute method as a drop-in replacement for what it was doing. In this case, we create an attribute dictionary with the font that we specify. And we go through, and for each string, we call sizeWithAttributes with that attribute dictionary and get the width.

Okay, so I mentioned the use of NSLayoutManager. And this gives you a great deal more control than these simple convenience methods. With NSLayoutManager, you can get complete information about the contents of the layout, about all the glyphs and all their positioning and so forth. It also has a lot more options. You can get complete control over the process. Now, you've probably used NSLayoutManager already, just not known about it. NSLayoutManager normally is part of the hidden machinery that goes into NSTextView. But if you need it, if you need that sort of power, then you can use it on its own.

So how do you get a layout manager? Well, the simple and easiest method is if you have a text view or if you create a text view, then you automatically get all the pieces of the Cocoa Text System provided for you. And these are NSTextStorage, which is the contents of the text, the document, that is the characters and their attributes, the fonts, and so forth. NSTextContainer, which describes the region within which the text is to be laid out. And then NSLayoutManager, which is a central object manages everything and will give you all the information you need.

If you don't want to have an S text view or if you want to completely customize this, then you can create all these objects yourself and hook them up. It's a few more lines of code. Fortunately, we have some good examples, such as the circle view example, that shows exactly how to do that.

So what does NSLayoutManager do? The basic job of the Layout Manager is to take the text, that is, characters and their attributes, such as fonts, as described by an attributed string, that is, a text storage, and to convert them into a sequence of glyphs and glyph positions. What is a glyph? A glyph is the fundamental displayable element in a particular font. might be a letter or an accent or ligature combining multiple letters. So NSLayoutManager manages the process of taking the text, the characters and their attributes and converting them into glyphs and their positions. And it provides all this information to you. Now The most fundamental thing is it provides the mapping from the characters to the glyphs. And in general, this is not necessarily a one-to-one mapping.

In fact, it can be many-to-many. A single character may require multiple glyphs to represent it. Or a set of characters may be represented by a single glyph. And Aslant Manager handles all this for you. It makes sure that the mapping is always in order. but in general, it's many to many. And whenever you have a range in the text, you need to know whether you have a range of characters or a range of glyphs, and you use NSLayoutManager to convert back and forth between these.

And as I said, NSLayoutManager also keeps track of the positions of these glyphs as they're laid out in the text. And it will tell you, for example, the size and position of any particular line in the text. and it will tell you the position of any particular glyph within that line.

And the thing to remember here from an international perspective is that the ordering of glyphs within a line does not necessarily bear any particular relationship to the ordering of the characters that generated them. So some writing systems write from left to right. Some of them write from right to left. In some cases, you will have text that combines both of these in a single line. So they start out from one end and then turn around and go the other direction.

So when you inquire of NSLayoutManager about the positions of glyphs, you just need to make sure that you do not assume that the positions of these glyphs are in any particular order based on the order and the characters that generated them. So here's an example. At the top, we have the characters. We start off with some English, and then we include some Arabic. And the glyphs down at the bottom-- let me point out a couple things about these. First of all, right here, we have an FI ligature glyph. That is two characters, F and I, represented by a single glyph. So right here, even in English, we have an example of a situation where the mapping between characters and glyphs is not going to be one-to-one.

Then when we go over to the Arabic, what happens is that these Arabic characters at the top are in logical order. In the order they appear in the text, but then when they're going to be displayed, their order will be from right to left within this line. So the positional ordering of the glyphs is different from the logical ordering of the characters. And another thing to notice is that these glyph forms, as they appear in context here, are quite a bit different from the context-independent appearance that I've shown up at the top. So these glyphs change shape depending on where they appear in a word.

NSLayoutMeasure also provides some higher level information. Things like you can go from ranges of characters and glyphs in the text to rects within the layout. And here again, the thing to remember is that even within a single line, within a single range of glyphs, because they can be reordered, you may end up with multiple rects that you need to represent them.

NSLayout Manager provides information about insertion points, potential insertion points in the text, and it can go the other direction. If you're given a location in the layout, it can go back and determine what glyph that corresponds to, or if you're given a rect in the layout, it can go back and determine the glyph range that it encompasses. So here is a brief example of how we might have used that if we wanted to measure the width of certain pieces of text. In this case, we have a text view, suppose, and from the text view we get the layout manager and the text container. And then if we've determined a particular range of text that we want the width for, then we can go through and call this method rect_array for character range and so forth and so forth, and it returns a C array of rects. And then, as I said, there may be multiple rects even needed to represent one contiguous range in the text, so we go through and add up the widths of all these rects to get the total width of this portion of the text.

So points I want you to take away. You should let the Cocoa framers do as much as possible for you handling internationalization and international issues. And you need to avoid any of the problematic assumptions, things like assuming that a single keystroke corresponds to a single character, or that a single character corresponds to a single glyph, or that the locale is equivalent to the localization. or that all the glyphs in a line are ordered from left to right. Use Unicode. Use NSString as much as possible to handle the heavy lifting for you. Use the date and number formatters for user-visible dates and numbers, and use the 10.4 behavior for these.