Text Processing in Cocoa - WWDC 2009

Mac • 1:00:07

As native applications receive more and more content from the Internet, text processing gains more importance. Snow Leopard adds several new Cocoa APIs and features to process text, performing most of the heavy lifting for you. Learn how these new and existing APIs help you to manage URLs; format dates, times, and addresses; read and write rich and plain text documents; and detect spelling mistakes.

Speakers: Doug Davidson, Aki Inoue

Unlisted on Apple Developer site

Downloads from Apple

SD Video (131.9 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

All right. Good morning, everyone. I hope you all had a good time at the bash last night. But not too good a time, so I want you all to be alert and awake for our presentation this morning. We're going to be talking about text processing in Cocoa. Now, in previous years, we spent quite a bit of time talking about display of text and the editing of text and various UI aspects of text.

This time around, we're going to be talking about lower level non-UI processing of text, and we're going to do this the Cocoa way, which means thoroughly Unicode-based with NSString as the fundamental object. And I mean a little bit more by that than just that you're going to be using NSString. I mean, we always say that conceptually, an NSString is a sequence of Unicode characters.

But what we wanted you to do is to deal with strings and the substrings and with ranges of strings and not with the individual characters. Because time after time in Unicode context, it happens that where you might expect to be dealing with a single character, in certain circumstances you're going to have to deal potentially with more than one.

And when I talk about all the framework APIs that make it really easy and quick to do this. And I've divided the talk into 3 parts. In the first part, I'll talk about getting the text into and out of your application. In the second part, we'll talk about processing it once it's in your application.

And in the third part, we're going to talk about some new functionality in Snow Leopard. Now, I should mention that except for that last part, a lot of this talk is applicable to both Mac OS X and the iPhone OS, except for a few methods that are maybe specific to Snow Leopard or specific to the AppKit, and we'll try to call those out from time to time as we run across them. So first part. Now, before your text gets in to your NSString, it's probably stored somewhere else, and after you deal with it and process it, you may want to store it off some place else more permanent.

So we'll talk about how to do that in Cocoa. Cocoa philosophy is to make simple things simple. So in keeping with that, in the very simplest situations, for example, where you want to read an entire plain text file with a known encoding or want to write an entire plain text file with a known encoding. We'll make that very simple, just one line of code, one method on NSString to do either one or the other.

All right, that's very straightforward. But at this point, I suppose I should say something about encodings. Your string is conceptually a sequence of Unicode characters that-- in order to store it in some persistent format-- that needs to be converted into a very concrete sequence of bytes, and the recipe for doing that is in encoding. Now, the situation with regard to encodings now is much better than it has been historically. We have the universal encoding, it's called UTF-8. You should use it wherever possible.

Occasionally, you may wish to use UTF-16 with the byte ordering mark to specify the NDNS, maybe UTF-32, but probably not. But these are all Unicode encodings. They can all represent any string that you're going to run across. They're all good. Unfortunately, there are still quite a bit of legacy data in other encodings that you may need to import and NSString will take care of all these for you, just specify the encoding and it will convert it. Of course, you should know the encoding. That should be part of your specification, and I'll talk a little bit later about what to do if you don't know the encoding.

But one thing the previous NSString writing method does as part of encoding handling, is that when you write out a plain text file, it will add an extended attribute to it that specifies the encoding that you use to write it out. And so another method on NSString that you can use to read back one of those files, and it will take the encoding from that extended attribute that was written to it. Now, one thing we all can ask in Cocoa text processing is. What is it that TextEdit does? Now TextEdit doesn't actually use this method although it does use this functionality.

What TextEdit uses is a similar method on, actually on NSAttributedString you find in the AppKit. And this does what that other method does, but it does a whole lot more. So this AttributedString method can read not only plain text files, but also rich text documents in a wide variety of formats. Again, very simply, a single line of code, now one possible-All right, this is written in terms of attributed strings rather than strings, but it's really simple to convert back and forth between strings and attributed strings. A reattributed string has an underlying string you can get quickly.

It's easy to create an attributed string from a string and so on. One possible complicating factor is that this method involves 2 dictionaries. There is one dictionary that you pass in, that's the Options Dictionary, and then there's one that you get back out, that's the Attributes Dictionary. So with the Options Dictionary can pass in things like what document formats you think the document is in, what encoding you think it is in. And when the attributes that get back out tell you what form it was actually in and what encoding it was actually read as.

And similarly, you can write using attributed strings, and the most general way to do this is with this AttributedString method that produces a fileWrapper. Well, fileWrapper. Well, NSFileWrapper is a great class, it's much improved in Snow Leopard. It's probably heard at the What's New on Cocoa talk. But the reason we're using it here is because the fileWrapper can represent not only a simple single file, but also a document that makes a system more than one file package for example. Also, a fileWrapper can now represent that extended attribute that we mentioned before.

So writing it out like this, getting a fileWrapper and then writing out the fileWrapper can support the extended attribute that I mentioned before. Now, if you're not dealing with a single file or a single plain text document, or you get the text from somewhere else, there are ways to handle that too. We have parallel methods that do all the same sorts of things with the text data as data, or maybe it's just a sequence of bytes. We have similar methods for reading and also very similar methods for writing.

Just do exactly the same thing. Now at this point, I want to say something about the case where you don't know the encoding. Now, this is an unfortunate situation, but we do run into it from time to time. You know, perhaps the specification for the format you're working with was written down on a back of a napkin in 1989, it says nothing whatsoever about encodings, your Russian users are getting angry, it happens. There are a couple of things I want to say about this. The first is that if you have some text data, if you don't know what encoding it's in, you have a significant problem.

There are ways of trying to infer the encoding from the data. But they're not perfect, they can all fail. It may be necessary to provide a means for the user to go back and try again with a different encoding in case the first one produced some nonsense. It's difficult to do this automatically because in some cases, the difference between sense and nonsense is a little subtle.

Also, there are some documents that is nonsense to start with no matter what encoding you use. The other thing I want to say is that this is very much domain dependent. Let me give you an example. Consider XHTML, that's a way of writing HTML as a dialect of XML. Because it's XML, it has a default encoding. Default encoding, if none is specified for XML, it's UTF-8.

That's a right answer. Then consider HTML, it's very similar, but HTML is not XML. In fact, HTML has no default encoding. If you don't specify an encoding for HTML, browser is likely to interpret it maybe as Windows code page 1252, unless maybe you're in Russia, or China, or Japan, maybe something else there, it's kind of a mess.

There's in fact a discussion about this going on right now with regard to HTML 5. People would like to try to improve this situation, but I think there's a general feeling that it may not be possible for compatibility reasons. So I think we had enough slides for now.

I'd like to go over to the demo machine and take a look at some code. So I wrote a little sample here. It's not yet available on the developer site. I hope it will be in the next few weeks. And what I tried to do in this was go to a situation where maybe things aren't so simple. In fact, maybe you've got about as little as possible.

So say you have nothing but an NSFile handle. Maybe you know an encoding, maybe you don't, and you want to read in plain text from it. But you're not wanting to read in the whole document at once, you only want to read in little chunks at a time.

So I wrote a little class to do that, file reader here, create it with a file handle. You can pass an encoding if you have and a zero if you don't, and then it passes back the text a paragraph at a time. A little note on terminology, traditionally in text processing, people often talk about lines of text.

Well, in the Cocoa text system, we're often dealing with text layout. In the context of layout, a line of text means something different. So we usually talk about paragraphs, where a paragraph is unambiguously a chunk of text separated by some paragraph terminator. And in Cocoa, we do this generally.

We don't deal just with UNIX terminators and Mac terminators and Window terminators, but all of those, and also, the Unicode paragraph separator as well. And so what this class does is just to return the text to you a paragraph at a time until it runs out of text or until it hits an error.

So how does that work? Let's take a look at the code. So we have, first of all, a little method here that reads the chunk of data, reads in, just reads in a chunk of data from the file handle 512 bytes at a time. And then, the next paragraph method just calls that repeatedly and then goes through the data looking for a suitable paragraph separator.

Fortunately, these paragraph separators are fairly easy to recognize in almost any encoding that you are likely to encounter. And they also make very natural points at which to split the data and try to convert it to a string. And so once we've encountered one, then we use one of those methods that I exhibited before to take the bytes of the paragraph and convert them into a string and then pass it back. It's pretty simple.

So then the question is what do we do if the encoding is not known? Well, what this would do is it will take the first chunk, first 512 bytes, and pass it over to this function I wrote called YES encoding. Now, let's take a look at that. As I mentioned, determining the encoding is very much domain specific. So this is not intended to be a universal solution. I don't think there is a universal solution.

This is just exhibiting some techniques you can try. So first technique is to look for Unicode byte ordering mark of various kinds, which pretty much on an ambiguous side identifies various Unicode encodings. OK, at that, we're good. Usually we don't, so here's another technique. It is to see if they started a file. It looks like it's XML. If it is, we can scan through it looking for an XML style encoding declaration and use that. That's one possibility.

Or maybe the file looks like it's HTML. There's so much different HTML, it's kind of hard to recognize. But if it is, then maybe we scan through it looking for an HTML style char set declaration and use that. Now if you don't have that, well, then what can we do? Well, we can try some encodings.

UTF-8 is a good one to try first because it's very distinctive. It's unlikely that text and some other encoding will be properly recognized as UTF-8. If not that, well, depending on what kind of document it is, we might fall back to something else, maybe a Windows encoding. And as a very final fallback, we're trying Mac Roman because any sequence of bytes can be interpreted in Mac Roman, although the results may not be terribly intelligible. Now, we'll see this in action a little later on.

But let's go back to the slides. And now, I'd like to turn the stage over to my colleague, Aki Inoue, who will talk about processing text once you have it.

[ Applause ]

Thank you, Doug. Good morning. Hi. I'm Aki Inoue, Cocoa Software Engineer. In this part of the session, I'd like to discuss string contents handling, especially focusing on tips and tricks that you may feel useful. Also, after this session, I hope you're going to have the information presented here that should build confidence in manipulating NSStrings efficiently and properly.

Much of the information presented here are inspired by questions and comments I've seen on various demo for forums and mailing lists. Those are the classes I'm covering today, NSString and the two formatter classes. Let's review the basics of NSString first. NSString is the standard string object used everywhere, on both Mac OS X and iPhone. The APIs are full-fledged covering from string encoding conversion as you saw to collation, and we tried very hard to keep the API simple.

Yet, we try to retain the full string handling power at your disposal. Finally, it encapsulates Unicode. Now Unicode, that's one of the keywords I've seen many times accompanying the questions about string handlings. What is Unicode then? It is the universal character set standard. It is why many technology leaders and standards.

It is right now at version 5.1 and covering more than 100,000 imported characters. The printed document is about this thick. It contains 1,472 pages, with dozens of auxiliary specifications like the Unicode by their algorithm or common locale data repository, making it one of the largest technology standards today. But, our design goal of NSString is to insulate the applications from the growing details of this gigantic standard. So by following a few simple rules, I'm going to be discussing your application, should let NSString do its own magic. I divided my presentation into 4 subtasks.

First, let's take a look at search and comparing. These are the questions I've seen. What do you sort like Apple applications? Or how do you satisfy your customers from the other side of the world? Or simply, how do you sort strings numerically? When you try to answer these questions and try to consider search compare string values manually, please take a look first at higher-level APIs, find panel, spotlight, Core Data, and other data management frameworks in a system that has a built in facility for sorting, searching, filtering. By using those, they provide higher abstraction, excuse me, and possibly can protect you from any future OS changes. One of the questions concerning comparing sorting is localized sorting.

NSString has 2 sets of APIs. The first group is compare and its variants, taking much more arguments for specifying custom behavior. The other set is method taking one argument, the other string to compare and how it's pre-selected the behavior was pretty self-explanatory of the signatures. So you might think when you want a localized sorting, probably localizing application system compare is the method you want to use. But we recommend to use the most generic one, compare:options:range:locale. Let's take a look.

This is a sample of string compare function. It takes 2 string arguments and returns the result, pretty simple. The first step you should do is identify the locale you want to use. Without specifying locale, NSString does generic and localized collation. So obviously, when you're doing that localized sorting, you want to specify locale. And generally, if you want to follow the user setting, you want to specify the current locale like this.

After that, next step is to specify a series of options. This is a set of the option we would want for you to follow. Let's take a look at each of the options. Case insensitive search, it maps simple one to one case like this or a complex case like ligatures that have different lengths, or it can support locale sensitive mapping like this too.

Next one, numeric search. By default, NSString treats all the digits just as other letters. By specifying this option, the numbers are sorted according to their numeric values. Diacritic insensitivity, by specifying these options, NSString ignores the difference in accents like those. But, it's not enough to know to know to ignore accents that have semantic significance like this. And width insensitivity, in Unicode, there are many variant characters, some of those just simply wider or narrower versions.

For example, there is a full set of ASCII character set copy that's just wider. And by specifying these options, those variations of the best characters are treated the same. Finally, forced ordering. This option could be pretty important if you could have multiple instances of the same word. For example, options like CaseInsensitivity. Those visually different characters are considered the same with these options. Very useful, but the comparison results for those strings are always the same.

So the sorting result could be at random because A, B, C presented here are all from the same NSString. By specifying NSForcedOrdering options, after NSString determines strings are equal due to one of the insensitivity options, it will try to get the forced ordering. So this comparison becomes ordered and the result is always stable.

So after specifying those options, you can use compare:options:range:locale: method to return the result. And we provide convenient functions, convenient method like sortedArray using function like this, so that you could have nice sorted array of strings. But new in Snow Leopard, you can forget all about the options. We now have localizedStandardCompare designed specifically for local sorting, and it packages all the options by adopting it. Even when we provide a new OS enhancement insensitivity, the applications get those enhancements for free. Next, filtering. How do you do prefix matching?

Pretty simple question. You wanted-- tend to think it's a simple task and doing-- just limiting the number of characters compared should suffice in this case, but it's not recommended. Being equal doesn't necessarily mean the lengths are also the same, due to the fact that case insensitivity or the different completion of the action could result in different length of strings to be matched. So in fact, prefix matching is not comparing operations.

It's a searching operation of a special kind. And by specifying NSAnchoredSearch options, the searching can be pinned at the beginning of the string, and using the option, you can search to do the prefix matching, rangeOfString, options, range, locale. And if there is a match, you can determine if the prefix is matched. Of course, NSString has the nice method for prefix matching. So whatever you need to do, you can use the method provided by NSString. Next, parsing.

NSString contents of black box. Yeah. And you might feel, you know, system processing is more efficient or you might have to parse some other data. In text data processing, there are 2 kinds of, roughly 2 kinds of data processing you might encounter, structured and unstructured. Structured text data processing is mainly processing the data designed for the machine to process, and other kind, unstructured, is essentially natural language documents from the user. Those are the probably commonly used structures text data today, HTML, XML, and so on and so forth. But, we provide a nice higher level API such as WebKit to process HTML, or NSXMLParser to process XML.

But, you know, still there are some other structures of text data like HTML/XML fragments or some legacy file formats. And you might have to actually parse the user input. But if you want to try parsing user inputs by yourself, first consider control such as NSTokenField, DatePicker, or fields with formatting, so that you can have the system processed for you.

And if you still want to parse structured data, NSScanner and NS-- some of the NSString methods components separated by string can help you there. This is a sample code that totalizes such strings inserted in the searchField. Simply just displaying of whitespace, you can have tokens of search words.

Since searchField takes usually structured text data, that's in the search string, specifying whitespace is usually safe for you. Now, you want to process natural language in your document. There is one golden rule. That is to identify the appropriate basic text processing unit for your task. Those are grapheme clusters, words, paragraphs, and try to never, never use this per character processing.

Let's take a look at a grapheme cluster. This is the smallest unit of NSString you want to process. You can let the range of grapheme cluster at specified index using rangeOfComposedCharacterSequenceAtIndex, and this is example grapheme cluster with composed character sequence. E, accented e could be represented as base e plus an accent, two NSString characters.

Or, if you stick to grapheme-cluster-based processing, you get a much better support. Characters outside the 16-bit range, usually are represented by 2 Unicode characters like this This is the standard idiom in NSString processing. It identifies the range you want to process.

Process the range and go to the next range. In this case, we can use rangeOfComposedCharacterSequenceAtIndex to identify the current grapheme cluster range. Words, words are often the most appropriate unit you can use for tasks such as manipulations or transformation like case mapping.

You want uppercase, lowercase, capitalized, or spell checking. And you can query the range of words using NSAttributedStringDoubleClickAtIndex. Naturally, double-click is the user action, you select a word, so we use the method name. But be aware, whitespace is not the only way to have the word's boundary. For example, with Chinese and Japanese, there might not be whitespace at all.

And the notion of word is pretty much local and context sensitive. Again, the same idiom, use the doubleClickAtIndex and do the same range-based processing. Paragraphs, as Doug mentioned before, probably you should consider using a paragraph as a maximum unit instead of blasting through the entire document at once.

It is important especially for languages like Arabic and Hebrew that have multiple directionality. Each paragraph has its own direction. So if you try to process multiple paragraphs with most direction, for example if you're in a contact name, those paragraphs, you might lose the original semantics of the user's documentation, user's document.

[ Pause ]

This is the same idiom for paragraphs. New in Snow Leopard, we're introducing a block-based API for enumerating these basic text units, enumerateSubstringsInRange options usingBlock. It can enumerate the 3 basic units, as well as lines and synthesis. We provide line-specific convenient space of line enumerateLinesUsingBlock, and also added NSAttributedString method for enumerating attribute runs. So this code can become this. Let's take a look at a sample in a demo.

[ Pause ]

OK, what is this application? It copies the contents of the upper text view to the text view in the bottom and inserts spaces after each character, and the result is pretty disastrous. Accents are detached, ligatures corrupt, and some characters are flat out garbled and unreadable. By processing based on grapheme clusters, the visual units are retained.

So take a look at, let's take a look at the code--

[ Pause ]

Processing the cluster. You get the range and specify the options in this case, because we're mutating the contents of the text. We are processing from the back. So as you can see, the animation method can work with mutation operations. Let's go back to the slide please.

Slide, please. Next, I'm going to be talking about transforming. How do you s-print-f with NSString or NSCalendarDate is deprecated. How do you want to format your dates? NSString has many transformation APIs from case mapping to normalization. But today, I'd like to focus to formatting text using NSDateFormatter and NSNumberFormatter.

First, number formatting. With the dozens of locales we support and demanded by the user, it's totally impossible to support the notion of the variations we support using the traditional print-f-style formatting. This small sign could be different or the digits may not be the ones you're familiar with. Or the NSNumberFormatter provides convenient access to those common formats like percentage, monitoring units, so on and so forth. New in Snow Leopard, it gets more convenient. We added a class method for formatting NSNumber into a format string using localized string for number, number style. You can get the formatter string without even creating the NSNumberFormatter instance.

Let's take a look. [Applause] Take a look at date formatting now. It gets more interesting. We deprecated NSCalendarDate, changed the default formatting behavior and we've been recommended to use the predefined 4 formats, short, medium, long, and full. And the recommendation is for protecting the applications from the sheer complexity of supporting gazillions of locales and calendar variations.

And we're adding more locales and calendars each release. And user could have non-Gregorian calendars now. For example, the year might not be 2009 for him or her, or even the month, the number of month might be 12 per year. So it's important to stick to the 4 basic formats.

Just as the NumberFormatter, NSDateFormatter has the class method to get the formatted string easily, localizedStringFromDate dateStyle timeStyle. And under that enhancement, we now support localized RelativeDateFormatting. For example if today is June 12, 2009, instead of displaying the format string, with this option, that formatter can return today or yesterday, just like mails receive date calendars.

[Applause] Yet, you still want to do custom formatting, right? But we deprecated NSCalendarDate. So what do we do? Starting with Snow Leopard, we are adding a custom date formatting. With dateFormatFromTemplate options locale, you specify a generic formatting template and the method returns customized format suitable for the locale so that you can safely use a custom formatting with "exotic" locales of calendar confirmations.

And let's take a look at this code here. In this example, we're using MMM/Y as a template format.

This should show abbreviated month names and year. In US format, it should display June 2009. You can pass the template's format and a locale in this example where you're using traditional Chinese locale and it should return a transform format suitable for traditional Chinese, and it should contain the right Chinese character suffix for you. You put the return custom format to the NSDateFormatter and you're ready to go. Let's take a look at the power.

[ Pause ]

So these applications show the 4 basic units and user could change the formatting like this and specify any calendar combination. These are just a small portion of the locale in calendar we support on OS. Now, you know, you want to stick to the 4 combinations here.

With Snow Leopard specifying the locale date option, it can display today, yesterday, tomorrow, and provide some other localization, works properly. Next, you want to display custom format. The title here shows the custom format used by the field next to it. You want to display week of day but use the shorter date formatting, or you just want to show the week of day and month date. OK, but those custom formats are usually specific to one locale.

You might have to specify multiple formatting strings depending on the use of the locale. For example, if you search French locale, their-- always their formatting is not good anymore because month should come after day. But since we are specifying custom specific format, we cannot do it. Using the template, it transforms the original custom format template to the one suitable for the locale. For example, if you specify Chinese, it contains the right Chinese character suffixes. Or specify Japanese and Japanese calendar, it correctly asks the era required for the calendar. [Applause] Now, the power of the user template can be used to configure user interface.

One example is you want to show your credit card's expiration date, and list the name of the month and year. But if you select French, that's fine. But for example, Japanese, usually you should display the year before the month and have the Chinese character suffix after year, but you cannot. By using a template, you can know the final configuration of the formats, so that you can modify your user interface accordingly.

Let's go back to that slide please. Finally, I'd like to touch performance. I tried NSString method and it has a performance problem. What should we do next? Or do we want to really, really touch the contents of an NSString? The first step you should do is always to sample the hotspot. With Shark instruments and other performance tools, we should be able to identify what's taking time.

You will be surprised that most of the performance programs with string operations are coming from memory allocation or deallocation. It may be spending time allocating alter use object in a tight loop, or simply try to deallocate object created inside autoreleased pool. So by using MutableString API, you can try to avoid this temporary object created.

Or if you want to transform many strings, you can try to use and take a look at CFMutableString API if you want to do in place transformation. Or you can pre-allocate initial buffer by yourself and initialize NSString at the end using no copy version of the initializer. Let's review what I told. Unicode is not a mystic monster, you can tame him.

And it is easy to do by following the simple rules. You can use the framework API, and by knowing the basic units, grapheme clusters, words, paragraphs, you can be confident you were processing the text properly. And finally, by knowing that performance semantics are mutable and immutable methods, you should be able to get the best performance out of NSString API. And now, I'm giving back the stage to Doug, who's going to be discussing brand new exciting features.

Thank you.

[ Applause ]

Thank you, Aki.

[ Applause ]

So what I'm going to talk about now is something that's new in Snow Leopard that we call text checking, and what we've done here is taking some existing functionality, things like spell checking and smart quotes and smart links, and we added some new functionality.

Things like spelling correction, data detectors, automatic text replacement, smart links, and made them all available in one easy to use set of APIs. And when I say easy to use, I mean really easy to use. If you're just using a TextView or WebView, it's simply a matter of turning on whichever ones of these you want to have active.

And I'd like to demo that.

[ Pause ]

So I'm going to demo that in TextEdit. And what we did in TextEdit was to just add various preferences to turn all these things on or off, and on this machine I've turned everything on. So we can see it all in action.

So for example, if we have an address data, detectors will identify it and give us some interesting options. We click on it to create a contact or show a map or what have you. When I'm typing, it will do things like simple spelling correction for me, and I've added some shortcut substitutions for phrases I frequently use, and you have smart dashes and smart links. And actually, I have also set this up so that it will automatically identify what language I'm typing in.

[Applause] So it's really very simple. Let's go back to the slides. Well, I'm not really here to talk about the UI level aspects of this.

What I want to do here primarily is to talk about the underlying APIs for this that you can use to get similar functionality and do whatever you want with it. So what we've done is to make these sorts of checking available with one very simple API on a spell checker.

Well, OK, maybe this doesn't look so simple. Let's go through it. So the first thing you do is you pass in a string, OK? The string you want to check, that's pretty obvious. And you pass in a range in a string that you wanted to look at, that's fairly obvious as well, and the types.

Well, that's just a bit wise-- or of any of the types from that list of items that you want to have used. And then options, well, you can usually ignore that, just pass in nil. The options allow for advanced things like if you want to specify, if you want to override the user's default which kind of smart quotes to use. SpellDocumentWithTag, this is something that a spell checker has used all along. They use a tag to identify text that belongs to the same document for things like ignored words.

If you don't have a spell checking document, just pass in 0. And then there are a couple of out-parameters. Orthography, which I'll discuss later on, allows it to identify the language of the text as a whole. And the wordCount, it will give back a count of the number of words in a text.

If you don't want either of those, just pass in nil. And then what do you get back? You get back an array of text checking results that you can go through and analyze and do whatever you want with. I'll discuss that in a minute. First, I want to say is that this is the simple version, the synchronous version of this API. And we have an advanced version which does exactly the same thing but asynchronously. So it does the same thing, it just does it in the background.

And it doesn't return the results as a return value. Instead, you get them in in a block that you supply, if you passed in as a parameter of that block. Now, doing things in the background, you're probably concerned about thread safety. So you don't have to worry about the thread safety of the text checking operations itself. It will copy out whatever it needs before it does that. The block that you just passed in, you should be aware this is being run using an operation queue.

So it's not going to be run in any particular context. If you need the work to be done on some particular thread or queue, then you need to pass it over to that thread or queue from within a block. Also, alternatively, if you need some particular form of locking for whatever it is you're doing in your block, then you have to buy that yourself, of course.

The other thing you might need to keep in mind is that if the text you're working on is mutable, it might have changed between the time that you sent it out for text checking and the time you get the results back. So the results might be out of date, and there is a sequence number to help you keep track of that. So what are these results that you get back? There are instances in the new class, NSTextCheckingResult, you get an array of these and each one just represents some interesting item that text checking has found in your text.

That might be a misspelled word, it might be a URL, a date and address from data detectors. It might be some piece of text for which we have a suggested replacement like a curly quote for a straight quote, or something of that sort. These objects always have first of all, a type, the type of a checking that they are result for, and a range, the range in a text to which they apply. Depending on the type, they may have other information, a URL, a date, address, as a dictionary, or a string that's suggested as a replacement for that text. And then you just go through the list of these and do whatever you feel is appropriate.

So here's an example. If you passed in this sentence to text checking and drive you to turn on all the text checking types, you'd get back quite a variety of results. You would get back one overall result saying that this whole string has been detected as being in English. Then you would get back some more specific results about specific pieces of the text.

So you might get back a spell checking result saying that the second word M-O-E-R, is misspelled. And if you'd ask for correction, you may get back another result suggesting that it be corrected to M-O-R-E. And you might get back a link checking result for that www.apple.com, suggesting that this looks like a URL with a full form, that fully formed URL, http: etc.. And then for the phone number there, phone numbers are actually detected as address results that have a phone number.

So you get back an address result for this phone number specifying that it is a particular phone number. One of the class, new class that I need to mention is NSOrthography which is what we use for specifying the language that had been detected, the text. The complication here is that a particular of a piece of text might include more than one language.

In fact, it might include more than one script. So NSOrthography is a class that encapsulates this. It specifies what scripts were found in your text using standard 4 character script codes, things like that in Greek, somewhere like Japanese and so forth. And for each script, what languages were detected as being there using again standard, in this case, 2-character codes, say EN, FR, and so forth. There are a couple other things I want to mention. We've done a lot to spell checking in Snow Leopard.

One thing that we've done, we've always had files in which we saved the words that the user has learned. We never actually documented the format for those files. We have now documented. It's a very simple format. Look at the release notes, it will tell you how you can interpret these. Also, we've added a feature for Snow Leopard where users can add new spell checking dictionaries in a standard format. This is actually a format that Open Office and a number of other things use.

It uses 2 different files, so we can add spell checking for a new language just by dropping these into one of your library spelling directories and the system will recognize it. As always, as before, developers can write their own custom code for spell checking and have that added to the system, and a number of developers have done that. Now, let's go back over to the demo machine, and I want to show some of this text checking in action.

[ Pause ]

We can go back to our sample. I decided that I wanted to take this file reader and actually do something with it, and what I decided to do with it is to have it go through a file and take a look at each one, and use text checking to have data detectors detect any dates or phone numbers that occur in that file. So what we're going to do is, for each file specified, we go through and process it.

And to process it, we create one of these file reader objects and iterate through it by paragraphs. And for each paragraph, we call the text checking API with the types, date and address, and allow the data detectors to check it, and then we iterate through to the results looking for dates, and addresses, and creating some output for those.

What we're going to do is print out any date or address that we find. I mean, sorry-- for addresses, we're actually only looking for phone numbers, so we'll print out any date or phone number that we find. And we're especially looking for paragraphs that might have both a date and a phone number in them. And for those, we do something special, we print out the whole thing.

So let's see this in action. And what I've done is, I have a file here that has all sorts of dates, addresses, phone numbers, etc., in many different languages, and I want to apply this to that file, and we'll see what kind of app would we get. Boom! it has detected all the different things that look like phone numbers or dates in that file, printed them out, and finally, here's a single paragraph that contains both of them. Give me a call at this date and phone number.

So that is text checking [applause] in action. Let's go back to the slides. For more information about this, I want to urge you very much to take a close look at the AppKit and Foundation Release Notes for Snow Leopard which contains lots and lots of detail about these things.

There is also more information in various programming guides. So let me summarize. You should use the framework methods for dealing with NSStrings, dealing with strings and substrings and ranges of strings rather than individual characters. You should use Unicode, use Unicode encodings wherever possible, and take advantage of some of these new methods in Snow Leopard. 23