Video hosted by Apple at devstreaming-cdn.apple.com

Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2011-128
$eventId
ID of event: wwdc2011
$eventContentId
ID of session without event part: 128
$eventShortId
Shortened ID of event: wwdc11
$year
Year of session: 2011
$extension
Extension of original filename: m4v
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: [2011] [Session 128] Advanced Te...

WWDC11 • Session 128

Advanced Text Processing

App Frameworks • iOS, OS X • 56:31

Applications deal with large amounts of text in a wide variety of languages and formats. They can benefit from improving their understanding of textual data. Learn how to analyze the content of text using the wide range of features now available on iOS and Mac OS X, including regular expressions, data detectors, linguistic APIs, spellchecking, and more.

Speakers: Doug Davidson, Jennifer Moore

Unlisted on Apple Developer site

Downloads from Apple

HD Video (829.4 MB)

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Good afternoon, everyone. I'm Doug Davidson and I'm here to talk to you about text processing. We've worked hard to bring you an extensive toolkit for analyzing text because we know that all our devices now have lots of text on them and we think that your applications can give a better user experience if they understand a little something about the content of that text.

So what we're going to do is first go through how text is represented and its structure and some of the basic APIs for iterating through it. And then we're going to talk about the pieces of this toolkit, things like regular expressions, data detectors, linguistic APIs, text checking. And then we'll have some demos and sample code to put it all together.

Now, I remember when I started out in this business a long time ago, string handling looked a little something like this. Everything was ASCII C strings. Well, those days are long gone. Our text is Unicode now. And no matter what system and platform you're using, chances are that there is an API that deals with strings as sequences of UTF-16. And NSString is no exception. Conceptually, an NSString is a sequence of UTF-16 units, what we call unicars.

So there's a great temptation to think that these are really just like the old C-strings, except that the character is a little wider and that the right thing to do is to go and iterate through character by character. But really, that is not the way that Unicode works. Let me show you. Let's take a look at some text to start off with.

So here's a simple piece of text. And as we look at it, we notice that it has some structure. So this text is composed of a sequence of paragraphs. And each paragraph is composed of sentences. And then the sentences, in turn, are composed of words. And then the words, in turn, are composed of what I'm going to call user-visible characters. And none of these things, not even the user-visible characters, can be directly identified with the individual unit carriers. They're all ranges of unit carriers in the text.

Let me go back to the slides and say why that's so. So these user visible characters are actually in general character clusters. The official Unicode term for this is a graphene cluster. Sometimes in API we refer to this as composed character sequences. This is a very characteristic feature of Unicode and that is that multiple Unicars can go to making up what is a single user visible character. So I don't know if you noticed in that example text that I had there but in the word San Jose, there's an accented E. And that is actually not one unicare, it's two. An E with a combining accent mark. And that is just the very simplest example of sequences of characters that combine to form a single user-versible character. There are much more complicated examples. I show one in Korean, however it happens all the time.

In Vietnamese, there are very complicated sequences where you have multiple unit carriers that combine to form a single user visible character. And in addition to this, we have surrogate pairs. Anything in Unicode that goes beyond the basic multi-lingual plane, that is beyond the first 64K characters, is going to be represented not by a single UTF-16 unit, but by two of what's called the surrogate pair. And this is not just some abstract thing, this is a very practical issue. The characters beyond the first 64K include things like cuneiform and hieroglyphics, but it's not limited to that. If you're dealing with Chinese text, you'll often see characters that require surrogate pairs. We now have emoji available for everybody, and most of the emoji characters are also beyond the basic molding legal plan and are going to require surrogate pairs.

So in all these cases, we have more than one unicare that goes to make a single user-visible character. And you can't split these up when you're processing them. They have to stay together. You don't want to divide the E from its accent. You definitely can't divide the two halves into certain pairs. So when we go through these, we can't deal with the individual unicares. We have to deal in general with ranges unicares. If we can't use that simple for loop, how can we iterate through the text? Well, we have a device that seems as if it was made for text processing, and that is blocks. So what we provide is a block iteration API on NSString. Enumerate substrings in range, and we specify how it is that we want to enumerate. In this case, we're choosing to enumerate by composed character sequences, that is, by these user visible characters.

And what this API does is to call your block, the block that you provide here, once for each of these in the range of the string that you asked for. And then you can do whatever it is you want. The block -- your block gets passed in the range that it's supposed to deal with, and you can do whatever you want with it. This example code was taken directly from the app that I showed, and all it's doing is to add a color to that piece of the text. And I chose to do alternating colors to make it evident.

So, for example, in this string, your block, if you call this, would be called once for the first character, then the next, and so on and so on and so on and so on and so forth, until you get to this one that is not a single unit character, but two, and it would be passed in the range for that one as well.

We can do the same thing with words. And words are an appropriate processing unit for many different tasks, things like uppercasing, recognition of words, spell checking and correction. One thing to keep in mind with words is that not all languages separate their words by white space. So you can't rely on white space as an indicator of word boundaries. So we have, again, an API that will deal with this, also in these cases, for example, it's the same API with a different constant, enumeration by words. So if you call this enumeration with an enumeration by words constant, then your block, again, will be called once for each word in the range of the string that you asked for. For example, in this string, it would be called for the first word, second so on and so on, and in each case we passed in the range of that word in the text. And you can do with it whatever you want.

Same thing for other units like sentences and paragraphs. Paragraphs are important because they're in some sense the maximal unit for Unicode processing. None of the Unicode algorithms require you to go beyond the bounds of a paragraph. In some cases they might have effect on the whole paragraph, for example, for bi-die processing. Some features in bi-die processing affect the whole paragraph.

So again, it will be just the same API that you can call to iterate through paragraph by paragraph. And your block in this case will be called once for each paragraph with the range of the paragraph. And one useful pattern is to nest these. So you could go through your text first by paragraphs and then within the paragraph by some smaller unit as necessary.

Now, of course, block iteration is wonderful, but it's not the only thing that you want to do with text. So there are many other APIs on NSString. For example, the range of string APIs are the Unicode-savvy way for finding an instance of a string within a piece of text. In this case, I've chosen to look for a string resume, or maybe the string that I'm matching is resume, because I've chosen as options to include not only case-insensitive search, but also diacritic-insensitive search. So this search will find that string whether it has diacritics on it or not. And this is a very powerful API that can be used for many kinds of locating of substrings within a string.

If I add one additional parameter to the options, I can, instead of looking for that string in the text, I can choose to decide whether I have matched it at a particular location. That's the anchored search option. And that doesn't go looking through the entire string. It just looks at the start of the range that I specify and says, is that string there or not? And this will match it in a fully Unicode savvy way with all the options that I specified. so it will ignore differences of case, differences of diacritics. In this case, I've chosen also to ignore the difference between the half-width and full-width characters.

And if what I want is not just to see does it match, but also to see is it a comparison result, is it before or after in a sort order, I use the compare API on a string. And that returns a comparison result with all the same options.

And there are many other APIs on NSString for deciding whether a word has a -- a string has a given prefix or suffix. If you want to look for characters from a character set instead of a specific string, you can do that. And none of these APIs, you'll notice, mention individual unicars. They're all Unicode savvy, so they deal with ranges of text in the string. and they find it in a Unicode conformant way. One other set of methods I want to mention is our powerful search and replace methods where you can find all instances of a specific string and replace them with something else. And we have two variants of that. One that if you have an immutable string that will change it and return a modified copy. And another one that will take a mutable string and change it in place.

Okay, so that's a very simple set of text processing operations. But maybe we wanted to do something a little more sophisticated. Maybe instead of finding all words in the text, I wanted to find, say, all words that begin with "th." or maybe I wanted to find all the URLs in this piece of text, or perhaps what I wanted to do was to find all of the nouns in this piece of text, or maybe all the adjectives, or maybe what I wanted to do was to find all the names in this piece of text. Subtitles by the Amara.org community How am I going to do that? So let me go through some of the tools that we have available. Finding the words that begin with "th," that sounds like a job for regular expressions. So I'm not going to tell you how regular expressions work. If you don't already know, there are many, many books and articles and websites that will tell you how to work with regular expressions.

What I want to talk about here is how we make regular expressions available as an API at foundation level. And our regular expression API is based on the ICU regular expression implementation. So it uses the ICU syntax, which is very standard, and fully Unicode compliant. We've added some Apple improvements to make it more efficient. And all of the usual options that you would expect to find in a regular expression API are there, and a few more. I'm not going to describe them all. You can go look at the headers, the documentation, the release notes that describe all these things.

We make this available via a class called NSRegularExpression. This is available in Mac OS X with Lion on iOS starting with iOS 4. And NSRegularExpression represents a compiled regular expression. You create one, a regular expression object, by passing in a regular expression pattern. So here's a contrived little example of a regular expression pattern. It finds -- first of all, there's a backslash B which finds a word boundary. Of course, the backslash has to be escaped with another backslash because it appears in a string constant. Then there's a parenthesized subexpression which locates an I or an O. Another parenthesized subexpression finds an F or an N. And then another slash B for a word boundary. So what this does is to locate two-letter words that start with I or O and end with F or N.

And the option I've chosen to use here is the case insensitive option. So this will find these two-letter words, whether they're lowercase or uppercase or any combination. Pretty simple. So what are we going to do with this regular expression object once we've created it? Well, naturally, the thing to do is to get a block iteration. And that is the basic API on this regular expression.

Enumerate matches in string. goes through and calls the block that you supply for each match of that regular expression in the range of the string that you're targeting. And then, again, once -- when in your block, you can do whatever it is you need to do with that match of the regular expression. So, for example, for this string, it would be called once for the first match, the second match, the third, and so on and so on.

And in this case, the thing that gets passed into your block is not just a range. We have a match object. And the reason for this is that for regular expression match, it's not just the range of the match that's important. We also have what are called capture groups, these parenthesized sub-expressions, and we want to be able to find the ranges of those as well. So we have an object of class and its text checking result -- we'll be seeing this again in the future that expresses that match. And what does a test checking result have? It has a type, result type. I'll discuss that more later. It has always a range. That's the overall range of the match of the regular expression in the string. And then it also has a range at index method. So range at index zero is the overall -- is the range of the overall match. Range genetics one is the range of the first capture group, range genetics two the range of the second capture group, and so on.

So, for example, that particular regular expression I demonstrated has two capture groups. So what we could do is in our block when it gets called, we can find first of all the overall range of the match. And then the range of the first capture group, that's that first sub-expression, which is just the first letter in this case. And then the range of the second capture group, in this case just the second letter. And then once we have that, we have everything that there is to know about the match of that regular expression. And we can do whatever it is we want to with it.

Now, again, as I say, block iterations are wonderful, but they're not the only thing. We have convenience methods if what you want to do is, let us say, find all the matches of your regular expression as an array, or just count them, or just find the first one, or maybe all you need is the range of the first one. So if you just want to find the range of the first match of the regular expression in the string, one line of code, you get it, and you've found where the first match is.

We also have a very common operation with regular expressions, that is find and replace. And again, just like with NSString, we have one method for immutable strings that will take it and return a modified copy, and one method for mutable strings that will modify it in place. Now, the thing that we're replacing with For a regular expression case, it could be a literal string, but in general, it is a template that can include these meta variables like $0, $1, $2, where $0 represents the overall match of the regular expression, $1 is match for the first capture group, $2, second capture group. This is very typical for regular expression find and replace. So here I chose to do $2, $1, that is the second capture group followed by the first. So if you're familiar with regular expressions, you probably know what this is going to do to this string. It's just going to change the order of the letters in each of these two-letter words.

Now, if you're going to use it more than once, it's more efficient to create an NS regular expression object and hold onto it. They're convenient, they're fast, they're thread safe. But if all you want to do is one operation, we have some convenience methods on NSString, so you don't have to create a regular expression object yourself. But this is not new API. just the same old API range of string with a new option and it's regular expression search that causes the thing you're looking for to be treated as a regular expression pattern instead of a literal string. And likewise, you can use the basic string search and replace. immutable or mutable version with this NSRegularExpressionSearchOption, which tells it to treat the thing you're looking for as a regular expression pattern and the thing you're replacing with as a regular expression template.

All right. Suppose we want to go a little further and look not for a regular expression, but suppose we're looking for something like a URL or a phone number. Now people sometimes do try to use regular expressions for this sort of thing, which is maybe okay if you're just looking for a single pattern, say a single pattern of phone numbers. But once you try to look for multiple ones, it starts to get complicated and expensive quickly. Fortunately we have something better, and that is Apple technology called data detectors, which locates interesting things like URLs, phone numbers, dates, addresses, in a large, large variety of international formats, and it does so very efficiently, much more efficiently than regular expressions could do it. And so you may be familiar with this from the UI context where data detectors locates things in the text you're looking at and, you know, provides a way to do something useful quickly with them, like show a location on a map. But what we're talking about here is the API version of this, low-level API, foundation-level API, which we expose by a class that's just a subclass of NSRegularExpression called NSDataDetector, again, available on -- Mac OS X Lion, and on iOS, starting with iOS 4.

So NSDataDetector is subclass, as I said, of NSRegularExpression. And you create an NSDataDetector, but not with a regular expression pattern. You create it instead by specifying which kinds of data detector types you want this detector to find. In this case, I've chosen the link type, which finds URLs, and the phone number type, which finds phone numbers. There are a bunch of different types. You can look for any or all of dates, addresses, URLs, phone numbers, et cetera. And since this is subclass of Ns regular expression, the basic API, again, is a block iterator. For example, on this text, your block would be called first with the first thing it finds, a phone number here, second with the second thing it finds, a URL.

Now, this is the point at which we get very glad that we used a match object in our regular expression API because now this match object can carry all sorts of other interesting properties that are useful when you find a data detector type. For example, if what you found is a date, there's a date property that can get that date as a parsed date. If it's an address, there's a components property that gets all the different components of the address. If it's a URL, you can get it as a URL. Phone number -- there's a phone number property.

And here's where we use the result type for our NSTextCheckingType object that expresses the match. And that will tell us what type it is that was found. So, for example, this -- the data detector that I created here finds two different types of things. And so we need to distinguish. Did we find -- was it a link type? That is, was it a URL that's found? If so, we can get the URL from that match object. or maybe it was a phone number that was found, in which case we can get the phone number from the object and then do with it whatever we need to do with it.

Again, there are convenience methods if you don't want to iterate through all the matches, if you just want to get them all as an array or count them or find the first one or the range of the first one. If you just want to find the range of the first one, it's a single line of code like this.

Okay, so that's data detectors. Now, suppose we want to go a little deeper and -- we want to find out something linguistic about our text. So we have a new API for that, which is new in Mac OS X Lion and iOS 5, called linguistic tagging. And so the basic premise of linguistic tagging is that we take a piece of text and we divide it up into pieces that we call tokens, and then there are various layers of linguistic information that we might want to find out about this text.

So the most basic and trivial is what kind of token is this? Is this a word? Is this white space? Is it punctuation? Okay, that's pretty obvious. But there are many others. For example, we might want to find out what language is this text in? And we have a statistical model that will tell us, as best it can, what language this text is in, not currently at the individual word level, but definitely at the paragraph level, We'll turn that.

And going a little deeper, we might want to find out what part of speech these words are. And again, we have a statistical language model that will tell us to the best of its ability what part of speech is of the word as it is used in this context.

So there are a number of these different layers. We refer to these -- the layers as tag schemes, and the individual pieces of information they return are called tags. So the simplest ones are the basic type, that is, word or white space or punctuation, et cetera. The language, that's returned as a standard language tag, like EN or FR or DE and so forth. Script is the text in, Latin script, Cyrillic, Arabic, Chinese, what have you. All of these things are available for many different languages and scripts on both Mac OS X and iOS. The more sophisticated tag schemes are, first of all, parts of speech, what we call in technical terms lexical class, Is this word as used in context, is it a noun or verb or adjective or what have you?

Then we have named entity recognition, which finds out whether a piece of text looks like it's you being used in context as a name and moreover does it look like it's a personal name or a place name or an organization name. And also we have what we call the lemma. That is, if we know this word, we can give you the root form of the word without any inflectional suffixes, so forth, if that's available. Now, these are more sophisticated and they're not available for all languages on Lion. We have lexical class and lemmas for English, French, and German. We have named entity recognition for English. on the version of iOS 5 you have now currently, all these things are available only for English.

So how does this work? So we have the class NSLinguisticTagger that is the interface for all of this. And you create an NSLinguisticTagger object by specifying a set of schemes. And so in this case I've chosen that I want to be able to get out the basic token type scheme, I want the lexical class scheme that gives me parts of speech, and I want the name type scheme that gives me names.

Once you have your tagger, then you take the string that you want to analyze and you attach that string to the tagger. This is important because the linguistic tagger keeps a bunch of state about that string so that it only has to analyze a particular piece of it once, no matter how many times you ask.

And then, of course, I think you can guess what the basic API is for it. It's a block iteration. Enumerate tags in range. You tell the tagger, go through this range of my string and tell me for each token in the string what the tag is for in the particular scheme. So in this case I'm using the lexical class scheme, so it will tell me the parts of speech for each of these tokens.

and it passes in that tag and the range of the token and also the range of the sentence within which it lives. And then I can take a look at it and do whatever it is I need to do with it. For example, if it's a noun, do something with it. This code is taken directly from that app that I showed.

Again, block iteration is not the only thing you might want to do, so we have some additional methods if you just want to find the tag for a particular token. Or if you want to get the tags for a set of tokens as an array. We also have methods for that. So if you just want to find out what is the part of speech for this particular word here at this index, you call this method, and it gets back that tag.

Now, we use this for a number of different things. We use it to assist with auto-correction for dictionary lookup and things like that. If you're concerned about performance, one thing I might mention is currently on Lion, we're doing this essentially full-time for the text you type to help in correcting it. But I'm sure that there are plenty of other potential applications for this that we haven't thought up yet and that you may be able to come up with. But right now, I'd like to bring up my colleague Jennifer Moore to show one particular demo application. Hi.

Okay, so this demo is just a sample application of how we can use these linguistic APIs to provide a more visually compelling experience of the text in our application. So I'm going to start off with -- I have a set of biographies of some of the top-trending artists on iTunes right now. And what I want to do is... What I want to do is use the linguistic tagger to extract the most frequent words out of this text and send it to the user in the form of a nice big word cloud.

And so we get a nice visual look at the content of our data. So to do that, it's very simple. We're going to start... By first we just -- we start by initializing our linguistic tagger, as you showed before, with a set of tag schemes. And for now we're going to start off with a lexical class tag scheme and also the name type tag scheme.

And then for each one of our biographies, it's very simple. This is the core of our work here. For each of our texts, we're going to associate that text with our NSLinguisticTagger, with set string. And then we're going to use our block API to iterate over all the words in our text. So we call enumerate tags in range with our text range. And for our purposes, we're going to use lexical class tag scheme because for our word cloud, we don't really want words like of and the. These are kind of boring words. What we really want are all the nouns of the text. These are the words that convey the most information in our text. So what we're going to do is if the word type is of the type noun, we're going to extract word from our text and add it to our set of words with the corresponding count. Keeping track of the count. And the rest of it is just getting the most frequent words out of our set and displaying it in the view, which I'm not going to show here, but it's pretty straightforward. So all we just do now is build and run.

And now we have a nice word cloud of these are all the most frequent words of our text. And you see the words that kind of pop out. Music, pop is kind of the genre of music that's most popular right now. But, you know, if we're looking at biographies, these are biographies of people. And actually it wouldn't be interesting if we could look at all the names across our text and maybe capture some of the relationships of artists who are influenced by other artists maybe. So we can do that.

And the change is very simple. We go back to our code here. Instead of NSLinguisticTagSchemeLexicalClass, let's just do name type. And if we're looking at names, sometimes names can have multiple tokens as part of the name. So we have an option for that. It would be nice to have one range for the entire name. So we can use this option here.

And this linguistic tagger joined names. And then instead of nouns here, we're going to look for, let's say, all the personal names. And that's it also, the change that we have to make. So we run this again. And now these are all the names across all of our biographies. And you see Alanis Morissette was not one of the biographies we're looking at, but she influenced a lot of musicians. So that's our word cloud.

Now, I think we said earlier that these APIs are cross-platform, so I just want to show that to you now. Almost the exact same code. Although not quite as flashy, but... So we have all the words, all the nouns in our text, and we can also get all of the names in our text. And that's linguistic tagging on iOS and Mac OS X.

Thank you, Jennifer. So now all the APIs that we've discussed so far are all at the foundation level, they're all cross-platform, iOS and Mac OS X. Before we go on further, I want to mention one additional related API that is also cross-platform but it's available at the core foundation level, and that is for hyphenation. What do I mean by that? I mean taking a word and finding the potential hyphenation points inside it based on linguistic analysis of it. And this is something that you may have seen used, for example, by Safari Reader within the WebKit use of hyphenation. It's also used by things like iBooks. Generally this is something that would be used directly primarily by a low-level text layout engine. But we do make this API available if you need it. It's available as an API on CFString and CFString.h. CFString get hyphenation location before index. And as I said before, this is available both Mac OS X and iOS. We provide hyphenation data currently for eight different languages. If you were at the WebKit talk this morning, you would have seen how to enable hyphenation within WebKit. Currently, WebKit is limited only to using hyphenation in English. But as I say, we do make the data available for many different languages.

So now what I want to do is to go on to the piece of this that is -- has available similar APIs on iOS and Mac OS X but not identical because these APIs are at the UI level in App Kit or UI Kit. And this is text checking, which is represented by a spell checker on Mac OS X in App Kit and UI text checker and UI Kit on iOS. And this does a number of different things, but the most significant and important thing is it does spell checking.

And the various UI text editing widgets on both platforms all use it already by default, but we also make it available to you as an API so you can use it wherever you need it. So NSSpellChecker. NSSpellChecker in AppKit, the way this is organized is that there's a single shared spell checker object for the whole application and you can just get it. And then you can ask it to do some checking of a piece of text. And actually NS spell checker will check for a number of different things. It can look for misspelled words. It can look for potential grammar mistakes. It can find corrections to the text. It can do smart quotes, smart dashes, and it also does the text replacement, which is a sort of shortcut type feature where you type a piece of text and gets replaced by something else that the user has specified. And the way that it works, unfortunately we don't yet have a neat block iteration API for this. You call it and you get back an array of, again, NSTextCheckingResult objects. You specify what kinds of types of things you want it to look for.

And then you get this array back of results and you can go through the results and for each one, again, you can look at the result type, the result range, and see what kind of thing it is that it found and what the range is and possibly other attributes to tell you what you might want to do with it. There is also a separate API. This is a synchronous API. There's also a separate asynchronous API where but you can call and get results back at some later point when they might be available.

So, for example, if you were looking for misspelled words in this piece of text, you would get back two different results, and you would go through them one by one, and it would tell you what the range of each is in the text. As I say, there are different kinds of things that a spell checker will check for: spelling errors, possible grammar mistakes, straight quotes that might be substituted by -- smartly by curly quotes, same thing for dashes, the replacements and potential corrections. That is not the only API on a spell checker.

There are also specific methods for doing things like spell checking. If you want to go through and find the first misspelled word in a range of text, this call will do it. If you have a misspelled word and you want to find out what the spell checker wants to suggest as possible alternatives for it, you You can get several different alternatives by calling guesses for word range. If you want to find the single proposed correction, if any, you can call a correction for word range.

And there is also, as I mentioned, the UI text editing objects like NSTextView and WebView will automatically use NSSpellChecker to do spell checking and correction and so on. And they all have some API that allow you to turn on and off all these different facilities on a specific view-by-view basis. The user can also turn them on and off. So there are user preferences available for these things as well. And in addition, in NSTextView level, you can customize all this by overriding the methods NSTextView has to call NSSpellChecker and to get the results back. So you can customize it in any way you want.

Something that you may have seen on Mac OS X Lion is that we have a new UI for autocorrection that is very similar to the iOS autocorrection UI. This is something new that we provide and it is implemented both -- and as TextView and WebView will automatically use this if autocorrection is turned on. But as I say, it can be turned on and off per view, or the user can decide whether or not to use it globally. But if you have your own TextView, we also have APIs that allow you to bring up these correction indicators as well if you need to. So there's an API to show one of the correction indicators, also one to dismiss it, and also there's the record response API where you tell the spell checker, did the user accept this correction or did they refuse it or did they go back and edit it and revert it so that the spell checker can learn from the user's behavior and customize its auto corrections and not do the ones the user doesn't want.

On the iOS side, there is UI Text Checker, which provides spell checking. It doesn't yet do a number of the other things that NS Spell Checker does. It uses a slightly different model. So for UI Text Checker, you don't use a single shared instance. You create a new instance for each document that you want to check.

Create one. Then UI text checker has the basic sort of API where you go through a piece of text and find the first misspelled word. And you call it again to go through and find the next misspelled word. And again, once you have found a misspelled word, then you can ask what are the possible correction alternatives that you might want to suggest to the user. These are the sorts of things you would see, you know, if you tap on a misspelled word in iOS, you get some alternatives to correct it to, and those are the sorts of things that this API would provide.

So now I'd like to spend a little time going into another demo. So... This demo is perhaps not quite as flashy as Jennifer's, but I chose it because it provides a good example of how to use all these different pieces in the toolkit together. And so what this application does is it takes a piece of text and it does a specific kind of capitalization on it, the sort of capitalization you would expect if you're going to show something as a title. Sometimes it's called headline style.

So I enter the text in the old capitalization, and it comes out with certain words capitalized. Now, the rules for headline style, if you look in the style manuals, they're kind of complicated and fussy. And in some cases -- even just for English, I'm doing this only for English here as a demo. they generally depend on not just what word it is or how long it is, but also what part of speech the word is, and even how it's being used, what part of speech it's being used as in the text. So, for example, most of the long words get capitalized, some of the short words don't get capitalized, but usually it's just things like the articles and prepositions that don't. The verbs, like "is," will be capitalized even though they're short.

And there are a number of other twists and turns. So let's see. This has a few other features. So, for example, this particular app knows how to capitalize things like iPhone and iCloud. So you notice that the first word of a title is always capitalized, even if it would otherwise not be. So "on" is capitalized as the first word, even though it's not when it appears in the middle of the text.

But if I were going to do something like make this a parenthesized, expression, you would want to capitalize the first word inside the parentheses. So we'll do that as well. Or the same thing if it was in quotes. And you notice I have also done some nice smart quoting, so I convert the straight quotes into curly quotes.

And one last thing to show is I have a little quote from Shakespeare here. and you notice that the word "on" in this quote is capitalized it was not capitalized in these previous examples because it was being used as a preposition, but here it's not being used as a preposition, so it does get capitalized. So that's -- these are some of the complications of the rules. So what's important here is not the specific rules that are being used, but the way that we use the tools here.

And you could combine them in other ways if you had different sets of rules. So this titleizer is going to make use of a linguistic tagger. So the first thing I do when I create it is I create myself a linguistic tagger. And the tag scheme I've chosen to use is our -- one I haven't mentioned before, this is our top-level tag scheme. Name type or lexical class. So this will tell you -- if it's a name, it'll tell you what kind of name it is. If it's not, it'll tell you what part of speech it is.

And I'm also going to use a data detector for finding URLs because, you know, I don't want to go and change in capitalization of URLs. That doesn't really make sense. So I'll find the URLs in it. And I'm going to use a mutable index set with that, and I'll show you that in a minute.

This is the external API for this, title string for string. And so basically the first thing I'm going to do is associate my string with my tagger so I can get tagging results for it. And as I said, I'm going to do this just for English, so I'm going to tell the linguistic tagger to treat the text as English. If I don't do this, the tagger will try to identify the language of the text, but here I want to treat it all as English.

And then I'm going to find URLs in the text, and what I'm going to do is determine where the URLs are and use my mutable index set to record all the indexes of URLs in the string and just leave all those alone. So I just use the enumerate matches in string API for my data detector. Whenever I get a result, I add those indexes to my mutable index set. So now I know where all the URLs are in this text.

Now I'm going to go through the text word by word. This is the numerate substrings and range API that I mentioned before going through by words. Now, as I mentioned, and as you saw in Jennifer's example, the linguistic tagger will also iterate through text, but one thing to keep in mind is that the linguistic tagger analyzes text on a linguistic basis.

So in particular for words like in English like "can't" and "won't" and "should've" that are actually two words smashed together, the linguistic tagger will separate those. But for casing operations, what I want to do work with is the actual words and go word by word as they are in the text. So I'm going to enumerate by word and then I'm going to ask the tagger individually what it thinks the part of speech is for those words. And another thing I can find out from the tagger is the range of sentences in the text because I want to capitalize in this case the first and last words of any whole sentence.

Another thing to keep in mind is that in addition to the words, there's also space stuff in between the words, which might be white space, punctuation, et cetera. And I want to do something with that, too. Now, one thing to keep in mind about this enumerate substrings and range API is that it actually passes in two ranges. One is the substring range, that's the range of the thing you're looking at, and there's is also the enclosing range, which includes all the other parts of the text between, in this case, between the words. And it's designed so that the enclosing ranges will completely partition the text with no overlap. So the enclosing range will include for the first word all the text before the first word and the text after it to the next word, and then for subsequent words, it will include all the text after the words as well.

So first I'm going to take a look at and analyze the text before the word, if there is any, and the text after the word, and I'll go through that in just a moment. And that will help me -- so one thing I'm going to do there is do these smart quotes, And then I'd take a look at the text before the word. And so if this word is the first word of the text, or if it's the last word of the text, or if it's the last word of the sentence, or if the non-word text before it had something like an opening parenthesis or an opening quote, or if the text after it had something like a closing parenthesis, a closing quote, then I definitely want to capitalize it. no matter what part of speech it is. And then I call this stringForWordRange method that does the actual cuts of it. So let's take a look at that.

So, the main thing I do here is I ask the tagger, what part of speech is this word? I am also going to ask the spell checker, do you have a correction for this word? Now, we're not going to try to correct all spelling errors here. We're just going to try to correct those that are case errors, like not capitalizing P in iPhone.

So, I want to decide whether this word should be capitalized. If it was the first or last word of a sentence, et cetera, then as determined before, yes, I want to capitalize it. Otherwise, if it's a conjunction like and or, or a preposition like of and to, then my style manual said that those should be capitalized only if they're long, only if they're greater than four characters.

If it's a determiner or particle, like an article, a, an, or the, it should be capitalized unless it is one of these specific words, a and the or two. Those are my style rules. If your style rules are different, you could express them in similar ways with these tools. Otherwise, we want to capitalize it.

Now we know where we want to capitalize it, but we still need to check, is it part of a URL? If it's part of a URL, just leave it alone. If we had a case correction for it from the spell checker, that is, if the spell checker's correction is the same as our original word except for case, that's this case insensitive compare, then I use the spell checker's version for a lot of things like iPhone and iPad. Otherwise, if I think it should be capitalized and it's not, capitalize it. Otherwise, if I don't think it should be capitalized and it is, lowercase it. And then one more thing. Remember I mentioned the string by replacing occurrences of string method. I'm going to take any apostrophes that happen to lie in the middle of this word and replace them with curly apostrophes. And that gives me my result for the word.

And now let me take a look at what happened with those interword ranges. So the text in between the words, here I'm going to use the tagger and use the tagger's enumeration. You might have some white space and some punctuation and some more white space, say. And I'm going to go through these piece by piece. Remember that the tagger stores the information about the text. And so no matter how many times I ask, it's only going to evaluate that piece of text once.

So I go through and find out what that token is. And if it is a single quote, the tagger will tell me whether it's an open quote or a closed quote. If it's an open quote, I'm going to replace it with a curly open quote, otherwise a curly closed quote.

Likewise, for a double quote, it will tell me whether it thinks it's an open quote or a closed quote. And I replace it with a curly open or closed quote accordingly. If the tagger tells me this is white space, what I've chosen to do here is replace all the white space with just single space. I'm going to condense down all the white space. Otherwise, I just pass it through and leave it alone.

And so that puts together all of these different techniques for analyzing text and shows them working together in a single example. Now, this example is a bit complicated, so I think what we're going to try to do is make this available as a developer example. It's not up yet. But we'll try to make this available as a developer example so you can review it. You should keep in mind, though, that this is not to be considered as a be-all and end-all of rules for capitalization. This is one specific set of style manuals. Different style manuals have slightly different notions of what should be capitalized.

So let me summarize. The way we want to encourage you to analyze text is by looking for ranges within the text that are of interest and to deal with the text range by range. And the primary method that we make available for this is block iteration, iterating over blocks that specify ranges of a specific type. And we have different types of block iteration for different kinds of analysis. The basic analysis provided by NSString, more sophisticated types provided by regular expressions or data detectors, and the linguistic analysis provided by NSLinguisticTagger, plus text checking at the UI level for things like spell checking and so forth.

And that concludes our discussion of text analysis. For more information, you can contact our evangelist, Bill Dudney. There's documentation, developer.avl.com. Not all of this probably has documentation yet. Specifically in this linguistic tagger, I know the documentation is under development, may not be there quite yet. You can look at -- I've been informed that the latest foundation release notes have been released and should be available to you. and those include some more information on all of these APIs.

Related sessions, you may have gone to the What's New in Cocoa and What's New in Cocoa Touch where they review things like some of the UI implications of some of these things. Tomorrow there's an internationalization talk on making your application ready for international markets. Tomorrow at 10:15 that I encourage you to go to. And that concludes the presentation.