Advanced Text Processing - WWDC 2011

App Frameworks • iOS, OS X • 56:31

Applications deal with large amounts of text in a wide variety of languages and formats. They can benefit from improving their understanding of textual data. Learn how to analyze the content of text using the wide range of features now available on iOS and Mac OS X, including regular expressions, data detectors, linguistic APIs, spellchecking, and more.

Speakers: Doug Davidson, Jennifer Moore

Unlisted on Apple Developer site

Downloads from Apple

HD Video (829.4 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon, everyone. I'm Doug Davidson, and I'm here to talk to you about text processing. We've worked hard to bring you an extensive toolkit for analyzing text because we know that all our devices now have lots of text on them, and we think that your applications can give a better user experience if they understand a little something about the content of that text.

So what we're going to do is first go through how text is represented and its structure and some of the basic APIs for iterating through it. And then we're going to talk about the pieces of this toolkit, things like regular expressions, data detectors, linguistic APIs, text checking. And then we'll have some demos and sample code to put it all together.

Now, I remember when I started out in this business a long time ago, string handling looked a little something like this. Everything was ASCII C strings. Well, those days are long gone. Our text is Unicode now. And no matter what system and platform you're using, chances are that there is an API that deals with strings as sequences of UTF-16. And NSString is no exception. Conceptually, an NSString is a sequence of UTF-16 units, what we call unicars.

So there's a great temptation to think that these are really just like the old C strings, except that the character is a little wider. And that the right thing to do is to use the same string to do the same thing. And that's why we used to go and iterate through character by character. But really, that is not the way that Unicode works. Let me show you. Let's take a look at some text to start off with.

So here's a simple piece of text. And as we look at it, we notice that it has some structure. So this text is composed of a sequence of paragraphs. And each paragraph is composed of sentences. And then the sentences in turn are composed of words. And then the words in turn are composed of what I'm going to call user visible characters. And none of these things, not even the user visible characters, can be directly identified with the individual unicars. They're all ranges of unicars in the text.

Let me go back to the slides and say why that's so. So these user visible characters are actually, in general, character clusters. The official Unicode term for this is a graphene cluster. Sometimes in API we refer to this as composed character sequences. This is a very characteristic feature of Unicode, and that is that multiple Unicars can go to making up what is a single user visible character. So I don't know if you noticed in that example text that I had there, but in the word San Jose, there's an accented E.

And that is actually not one Unicare, it's two, an E with a combining accent mark. And that is just the very simplest example of sequences of characters that combine to form a single user visible character. There are much more complicated examples. I show one in Korean. However, it happens all the time. In Vietnamese. There are very complicated sequences where you have multiple Unicars that combine to form a single user visible character. And in addition to this, we have surrogate pairs.

Anything in Unicode that goes beyond the basic multilingual plane, that is beyond the first 64K characters, is going to be represented not by a single UTF-16 unit, but by two. It's called the surrogate pair. And... This is not just some abstract thing. This is a very practical issue. And the characters beyond the first 64K include things like cuneiform and hieroglyphics, but it's not limited to that. If you're dealing with Chinese text, you'll often see characters that require surrogate pairs.

We now have emoji available for everybody. And most of the emoji characters are also beyond the basic multilingual plane and are going to require surrogate pairs. So in all these cases, we have more than one character. We have more than one Unicare that goes to make a single user visible character. And you can't split these up when you're processing them.

They have to stay together. You don't want to divide the E from its accent. You definitely can't divide the two halves with surrogate pairs. So when we go through these, we can't deal with the individual Unicars. We have to deal in general with ranges of Unicars. So if we can't use that simple for loop, how do we do that? How can we iterate through the text?

Well, we have a device that seems as if it was made for text processing, and that is blocks. So what we provide is a block iteration API on NSString. Enumerate substrings in range. And we specify how it is that we want to enumerate. In this case, we're choosing to enumerate by composed character sequences, that is, by these user visible characters. characters.

And what this API does is to call your block, the block that you provide here, once for each of these in the range of the string that you asked for. And then you can do whatever it is you want. The block -- your block gets passed in the range that it's supposed to deal with, and you can do whatever you want with it. This example code was taken directly from the app that I showed, and all it's doing is to add a color to that piece of the text. and I chose to do alternating colors to make it evident.

So, for example, in this string, your block, if you call this, would be called once for the first character, then the next, and so on, and so on, and so on, and so on, and so forth, until you get to this one that is not a single unit character, but two, and it would be passed in the range for that one as well.

We can do the same thing with words. And words are an appropriate processing unit for many different tasks, things like uppercasing, recognition of words, spellchecking, and correction. One thing to keep in mind with words is that not all languages separate their words by whitespace. So you can't rely on whitespace as an indicator of word boundaries. So we have, again, an API that will deal with this.

Also in these cases, for example, it's the same API with a different constant, enumeration by words. So if you call this enumeration with an enumeration by words constant, then your block, again, will be called once for each word in the range of the string that you asked for. For example, in this string, it would call for the first word, second word, so on and so on. And in each case, it will be passed in the range of that word in the text. And you can do with it whatever you want.

Same thing for other units like sentences and paragraphs. Paragraphs are important because they're in some sense the maximal unit for Unicode processing. None of the Unicode algorithms require you to go beyond the bounds of a paragraph. In some cases they might have effect on the whole paragraph, for example, for bi-die processing. Some features in bi-die processing affect the whole paragraph.

So again, it will be just the same API that you can call to iterate through paragraph by paragraph. And your block in this case will be called once for each paragraph with the range of the paragraph. And one useful pattern is to nest these. So you could go through your text first by paragraphs and then within the paragraph by some smaller unit as necessary.

Now, of course, block iteration is wonderful, but it's not the only thing that you want to do with text. So there are many other APIs on NSString. For example, the range of string APIs are the Unicode-savvy way for finding an instance of a string within a piece of text.

In this case, I've chosen to look for a string resume, or maybe the string that I'm matching is resume, because I've chosen as options to include not only case-insensitive search, but also diacritic-insensitive search. So this search will find that string, whether it has diacritics on it or not. And this is a very powerful API that can be used for many kinds of locating of substrings within a string.

If I add one additional parameter to the options, I can, instead of looking for that string in the text, I can choose to decide whether I have matched it at a particular location. That's the anchored search option. And that doesn't go looking through the entire string. It just looks at the start of the range that I specify and says, is that string there or not? And this will match it in a fully Unicode-savvy way with all the options that I specified, so it will ignore differences of case, differences of diacritics. In this case, I've chosen also to ignore the difference between the half-width and full-width characters.

And if what I want is not just to see does it match, but also to see is it a comparison result, is it before or after in a sort order, I use the compare API on a string. And that returns a comparison result with all the same options.

And there are many other APIs on NSString for deciding whether a word has a -- a string has a given prefix or suffix. If you want to look for characters from a character set instead of a specific string, you can do that. And none of these APIs, you'll notice, mention individual unicars. They're all Unicode savvy, so they deal with ranges of text in the string.

And they find it in a Unicode conformant way. One other set of methods I want to mention is our powerful search and replace methods, where you can find all instances of a specific string and replace them with something else. And we have two variants of that. One that if you have an immutable string that will change it and return a modified copy. And another one that will take a mutable string and change it in place.

Okay, so that's a very simple set of text processing operations. But maybe we wanted to do something a little more sophisticated. Maybe instead of finding all words in the text, I wanted to find, say, all words that begin with "th." Or maybe I wanted to find all the URLs in this piece of text. Or perhaps what I wanted to do was to find all of the nouns in this piece of text. Or maybe all the adjectives. Or maybe what I wanted to do was to find all the names in this piece of text.

[Transcript missing]

We make this available via a class called NSRegularExpression. This is available in Mac OS X with Lion on iOS starting with iOS 4. And NSRegularExpression represents a compiled regular expression. You create one, a regular expression object, by passing in a regular expression pattern. So here's a contrived little example of a regular expression pattern. It finds -- first of all, there's a backslash B, which finds a word boundary.

Of course, the backslash has to be escaped with another backslash because it appears in a string constant. Then there's a parenthesized subexpression, which locates an I or an O. Another parenthesized subexpression, which locates an I or an O. Another parenthesized subexpression finds an F or an N, and then another slash B for a word boundary. So what this does is to locate two-letter words that start with I or O and end with F or N.

The option I've chosen to use here is the case insensitive option. So this will find these two-letter words, whether they're lowercase or uppercase or any combination. Pretty simple. So what are we going to do with this regular expression object once we've created it? Well, naturally, the thing to do is to get a block iteration. And that is the basic API on this regular expression. Enumerate matches in string goes through and calls the block that you supply for each match of that regular expression in the range of the string that you're targeting.

And then, again, once -- in your block, you can do whatever it is you need to do with that match of the regular expression. So, for example, for this string, it would be called once for the first match, the second match, the third. and so on and so on.

[Transcript missing]

So, for example, that particular regular expression I demonstrated has two capture groups. So what we could do is in our block when it gets called, we can find, first of all, the overall range of the match. And then the range of the first capture group, that's that first sub-expression, which is just the first letter in this case. And then the range of the second capture group, in this case just the second letter. And then once we have that, we have everything that there is to know about the match of that regular expression. And we can do whatever it is we want to with it.

Now, again, as I say, block iterations are wonderful, but they're not the only thing. We have convenience methods if what you want to do is, let us say, find all the matches of your regular expression as an array, or just count them, or just find the first one, or maybe all you need is the range of the first one. So if you just want to find the range of the first match of the regular expression in the string, one line of code, you get it, and you've found where the first match is.

We also have a very common operation with regular expressions, that is find and replace. And again, just like with NSString, we have one method for immutable strings that will take it and return a modified copy, and one method for mutable strings that will modify it in place. Now, the thing that we're replacing with for a regular expression case, it could be a literal string, but in general, it is a template that can include these meta variables like $0, $1, $2, where $0 represents the overall match of the regular expression. $1 is match for the first capture group. $2, second capture group.

This is very typical for regular expression find and replace. So here I chose to do $2, $1, that is the second capture group, followed by the first. So if you're familiar with regular expressions, you probably know what this is going to do to this string. It's just going to change the order of the letters in each of these two-letter words.

Now, if you're going to use it more than once, it's more efficient to create an NSRegularExpression object and hold onto it. They're convenient, they're fast, they're thread safe. But if all you want to do is one operation, we have some convenience methods on NSString. You don't have to create a regular expression object yourself. But this is not new API.

It's just the same old API range of string with a new option, NSRegularExpressionSearch, that causes the thing you're looking for to be treated as a regular expression pattern instead of a literal string. And likewise, you can use the basic string search and replace. and the NIS regular expression search option, which tells it to treat the thing you're looking for as a regular expression pattern and the thing you're replacing with as a regular expression template.

All right. Suppose we want to go a little further and look not for a regular expression, but suppose we're looking for something like a URL or a phone number. Now, people sometimes do try to use regular expressions for this sort of thing, which is maybe okay if you're just looking for a single pattern, say a single pattern of phone numbers. But once you try to look for multiple ones, it starts to get complicated and expensive quickly.

Fortunately, we have something better, and that is Apple technology called data detectors, which locates interesting things like URLs, phone numbers, dates, addresses in a large, large variety of international formats. And it does so very efficiently, much more efficiently than regular expressions could do it. And so you may be familiar with this from the UI context where data detectors locates things in the text you're looking at.

And, you know, provides a way to do something useful quickly with them, like show a location on a map. But what we're talking about here is the API version of this, low-level API, foundation-level API, which we expose by a class that's just a subclass of NS regular expression called NSData detector, again, available on -- Mac OS X Lion, and on iOS, starting with iOS 4.

So NSDataDetector is a subclass, as I said, of NSRegularExpression. And you create an NSDataDetector, but not with a regular expression pattern. You create it instead by specifying which kinds of data detector types you want this detector to find. In this case, I've chosen the link type, which finds URLs, and the phone number type, which finds phone numbers. There are a bunch of different types.

You can look for any or all of dates, addresses, URLs, phone numbers, et cetera. And since this is subclass of NS regular expression, the basic API, again, is a block iterator. For example, on this text, your block would be called first with the first thing it finds, a phone number here, second with the second thing it finds, a URL.

Now, this is the point at which we get very glad that we used a match object in our regular expression API because now this match object can carry all sorts of other interesting properties that are useful when you find a data detector type. For example, if what you found is a date, there's a date property that can get that date as a parsed date. If it's an address, there's a components property that gets all the different components of the address. If it's a URL, you can get it as a URL. Phone number, there's a phone number property.

And here's where we use the result type for our NSText checking type object that expresses the match. And that will tell us what type it is that was found. So, for example, this -- the data detector that I created here finds two different types of things. And so we need to distinguish, did we find -- was it a link type?

That is, was it a URL that's found? If so, we can get the URL from that match object. or maybe it was a phone number that was found, in which case we can get the phone number from the object and then do with it whatever we need to do with it.

Again, there are convenience methods if you don't want to iterate through all the matches, if you just want to get them all as an array or count them or find the first one or the range of the first one. If you just want to find the range of the first one, it's a single line of code like this.

Okay, so that's data detectors. Now, suppose we want to go a little deeper and, um, We want to find out something linguistic about our text. So we have a new API for that, which is new in Mac OS X Lion and iOS 5, called linguistic tagging. And so the basic premise of linguistic tagging is that we take a piece of text and we divide it up into pieces that we call tokens. And then there are various layers of linguistic information that we might want to find out about this text.

So the most basic and trivial is what kind of token is this? Is this a word? Is this white space? Is it punctuation? Okay, that's pretty obvious. But there are many others. For example, we might want to find out what language is this text in? And we have a statistical model that will tell us as best it can what language this text is in, not currently at the individual word level, but definitely the paragraph level, and we'll return that.

And going a little deeper, we might want to find out what part of speech these words are. And again, we have a statistical language model that will tell us, to the best of its ability, what part of speech is of the word as it is used in this context.

So there are a number of these different layers. We refer to these -- the layers as tag schemes, and the individual pieces of information they return are called tags. So the simplest ones are the basic type, that is, word or white space or punctuation, et cetera. The language, that's returned as a standard language tag, like EN or FR or DE, so forth. What script is the text in?

Latin script, Cyrillic, Arabic, Chinese, what have you. All of these things are available for many different languages and scripts on both Mac OS X and iOS. The more sophisticated tag schemes are, first of all, parts of speech, what we call in technical terms lexical class, Is this word as used in context, is it a noun or verb or adjective or what have you?

Then we have named entity recognition, which finds out whether a piece of text looks like it's you being used in context as a name, and moreover, does it look like it's a personal name or a place name or an organization name? And also we have what we call the lemma. That is, if we know this word, we can give you the root form of the word without any inflectional suffixes, so forth, if that's available.

Now, these are more sophisticated, and they're not available for all languages on Lion. We have lexical class and lemmas for English, French, and German. We have named entity recognition for English. On the version of iOS 5 you have now, currently, all these things are available only for English.

So how does this work? So we have the class NSLinguisticTagger. That is the interface for all of this. And you create an NSLinguisticTagger object. By specifying a set of schemes. And so in this case, I've chosen that I want to be able to get out the basic token type scheme. I want the lexical class scheme that gives me parts of speech. And I want the name type scheme that gives me names.

Once you have your tagger, then you take the string that you want to analyze and you attach that string to the tagger. This is important because the linguistic tagger keeps a bunch of state about that string so that it only has to analyze a particular piece of it once, no matter how many times you ask.

And then, of course, I think you can guess what the basic API is for it. It's a block iteration. Enumerate tags in range. You tell the tagger, Go through this range of my string and tell me for each token in the string what the tag is for in the particular scheme. So in this case, I'm using the lexical class scheme, so it will tell me the parts of speech for each of these tokens.

And it passes in that tag and the range of the token and also the range of the sentence within which it lives. And then I can take a look at it and do whatever it is I need to do with it. For example, if it's a noun, do something with it. This code is taken directly from that app that I showed.

Again, block iteration is not the only thing you might want to do, so we have some additional methods if you just want to find the tag for a particular token, or if you want to get the tags for a set of tokens as an array. We also have methods for that. So if you just want to find out what is the part of speech for this particular word here at this index, you call this method and it gets back that tag.

Now, we use this for a number of different things. We use it to assist with auto-correction for dictionary lookup and things like that. If you're concerned about performance, one thing I might mention is currently on Lion, we're doing this essentially full-time for the text you type to help in correcting it. But I'm sure that there are plenty of other potential applications for this that we haven't thought up yet and that you may be able to come up with. But right now, I'd like to bring up my colleague Jennifer Moore to show one particular demo application. Hi.

This demo is a sample application of how we can use these linguistic APIs to provide a more visually compelling experience of the text in our application. I have a set of biographies of some of the top trending artists on iTunes right now. And what I want to do is... What I want to do is use the linguistic tagger to extract the most frequent words out of these texts and send it to the user in the form of a nice big word cloud. So we get a nice visual look at the content of our data. So to do that, it's very simple.

We're going to start... First, we start by initializing our linguistic tagger, as you showed before, with a set of tag schemes. And for now, we're going to start off with a lexical class tag scheme and also the name type tag scheme. And then for each one of our biographies, it's very simple. This is the core of our work here. For each of our texts, we're going to associate that text with our NSLinguisticTagger, with set string.

And then we're going to use our block API to iterate over all the words in our text. So we call enumerate tags in range with our text range. And for our purposes, we're going to use a lexical class tag scheme because for our word cloud, we don't really want words like of and the.

These are kind of boring words. What we really want are all the nouns of the text. These are the words that convey the most information in our text. So what we're going to do is if the word type is of the type noun, we're going to extract that word from our text and add it to our set of tags. And then we're going to use our block API to iterate over all the words in our text.

And add it to our set of words with the corresponding count. Keeping track of the count. And the rest of it is just getting the most frequent words out of our set and displaying it in the view, which I'm not going to show here, but it's pretty straightforward. So all we just do now is build and run.

and now we have a nice word cloud of these are all the most frequent words of our text. And you see the words that kind of pop out. Music, pop is kind of the genre of music that's most popular right now. But, you know, if we're looking at biographies, these are biographies of people, and actually it wouldn't be interesting if we could look at all the names across our text and maybe capture some of the relationships of artists who are influenced by other artists, maybe. So we can do that.

And the change is very simple. We go back to our code here. Instead of NSLinguisticTagSchemeLexicalClass, let's just do name type. And if we're looking at names, sometimes names can have multiple tokens as part of the name. So we have an option for that. It would be nice to have one range for the entire name. So we can use this option here.

and his linguistic tagger join names. And then instead of nouns here, we're going to look for, let's say, all the personal names. And that's it. That's all the change that we have to make. So we run this again. And now these are all the names across all of our biographies. And you see Alanis Morissette was not one of the biographies we're looking at, but she influenced a lot of musicians. So that's our word cloud.

Now, I think we said earlier that these APIs are cross-platform, so I just want to show that to you now. Almost the exact same code. Although not quite as flashy, but... So we have all the words, all the nouns in our text, and we can also get all of the names in our text. and that's linguistic tagging on iOS and Mac OS X.

[Transcript missing]

So now what I want to do is to go on to the piece of this that is -- has available similar APIs on iOS and Mac OS X but not identical because these APIs are at the UI level in AppKit or UIKit. And this is text checking, which is represented by a spell checker on Mac OS X in AppKit and UI text checker in UIKit on iOS. And this does a number of different things, but the most significant and important thing is it does spell checking.

and the various UI text editing widgets on both platforms all use it already by default, but we also make it available to you as an API so you can use it wherever you need it. So NSSpellChecker. NSSpellChecker in AppKit, the way this is organized is that there's a single shared spellchecker object for the whole application, and you can just get it.

And then you can ask it to do some checking of a piece of text. And actually NSSpellChecker will check for a number of different things. It can look for misspelled words. It can look for potential grammar mistakes. It can find corrections to the text. It can do smart quotes, smart dashes, and it also does the text replacement, which is a sort of shortcut type feature where you type a piece of text and it gets replaced by something else that they don't know about.

And it's a very simple process that the user has specified. And the way that it works, unfortunately we don't yet have a neat block iteration API for this. You call it and you get back an array of, again, NSTextCheckingResult objects. You specify what kinds of types of things you want it to look for.

And then you get this array back of results and you can go through the results and for each one, again, you can look at the result type, the result range, and see what kind of thing it is that it found and what the range is and possibly other attributes to tell you what you might want to do with it. There is also a separate API. This is a synchronous API. There's also a separate asynchronous API where you can call and get results back at some later point. point when they might be available.

So, for example, if you were looking for misspelled words in this piece of text, you would get back two different results, and you would go through them one by one, and it would tell you what the range of each is in the text. As I say, there are different kinds of things that a spellchecker will check for: spelling errors, possible grammar mistakes, straight quotes that might be substituted smartly by curly quotes, same thing for dashes, the text replacements, and potential corrections. That is not the only API on a spellchecker. There are also specific methods for doing things like spellchecking. If you want to go through and find the first misspelled word in a range of text, this call will do it.

If you have a misspelled word and you want to find out what the spellchecker wants to suggest as possible alternatives for it, you can get several different alternatives by calling calling guesses for word range. If you want to find the single proposed correction, if any, you can call a correction for word range.

And there is also, as I mentioned, the UI text editing objects like NSTextView and WebView will automatically use NSSpellChecker to do spellchecking and correction and so on. And they all have some API that allow you to turn on and off all these different facilities on a specific view-by-view basis.

The user can also turn them on and off. So there are user preferences available for these things as well. And in addition, at NSTextView level, you can customize all this by overriding the methods NSTextView has to call NSSpellChecker and to get the results back. So you can customize it in any way you want.

Something that you may have seen on Mac OS X Lion is that we have a new UI for autocorrection that is very similar to the iOS autocorrection UI. This is something new that we provide, and it is implemented both, and as TextView and WebView will automatically use this if autocorrection is turned on.

But as I say, it can be turned on and off per view, or the user can decide whether or not to use it globally. But if you have your own TextView, we also have APIs that allow you to bring up these correction indicators as well if you need to.

So there's an API to show one of the correction indicators, also one to dismiss it, and also there's the record response API where you tell the spellchecker, "Did the user accept this correction, or did they refuse it, or did they go back and edit it and revert it?" So that the spellchecker can learn from the user's behavior and customize its autocorrections and not do the ones the user doesn't want.

On the iOS side, there is UI Text Checker, which provides spellchecking. It doesn't yet do a number of the other things that a spellchecker does. It uses a slightly different model. So for UI Text Checker, you don't use a single shared instance. You create a new instance for each document that you want to check.

Create one. Then UI Text Checker has the basic sort of API where you go through a piece of text and find the first misspelled word. And you call it again to go through and find the next misspelled word. And again, once you have found a misspelled word, then you can ask what are the possible correction alternatives that you might want to suggest to the user. These are the sorts of things you would see if you tap on a misspelled word in iOS. You get some alternatives to correct it to. And those are the sorts of things that this API would provide.

[Transcript missing]

This demo is perhaps not quite as flashy as Jennifer's, but I chose it because it provides a good example of how to use all these different pieces in the toolkit together. And so what this application does is it takes a piece of text and it does a specific kind of capitalization on it, the sort of capitalization you would expect if you're going to show something as a title. Sometimes it's called headline style.

So I enter the text in the old capitalization, and it comes out with certain words capitalized. Now, the rules for headline style, if you look in the style manuals, they're kind of complicated and fussy. And in some cases -- even just for English. I'm doing this only for English here as a demo.

They generally depend on not just what word it is or how long it is, but also what part of speech the word is and even how it's being used, what part of speech it's being used as in the text. So, for example, most of the long words get capitalized. Some of the short words don't get capitalized, but usually it's just things like the articles and prepositions. The verbs like "is" will be capitalized even though they're short.

And there are a number of other twists and turns. So let's see. This has a few other features. So, for example, this particular app knows how to capitalize things like iPhone and iCloud. So you notice that the first word of a title is always capitalized, even if it would otherwise not be. So "on" is capitalized as the first word, even though it's not when it appears in the middle of the text.

But if I were going to do something like make this a parenthesized, and the other folks at Microsoft. I'm going to show you how to analyze text in a wide range of ways. and you notice I have also done some nice smart quoting, so I convert the straight quotes into curly quotes.

and one last thing to show is I have a little quote from Shakespeare here. And you'll notice that the word "on" in this quote is capitalized. It was not capitalized in these previous

[Transcript missing]

and you could combine them in other ways if you had different sets of rules.

So this Titleizer is going to make use of a linguistic tagger. So the first thing I do when I create it is I create myself a linguistic tagger. And the tag scheme I've chosen to use is our -- one I haven't mentioned before. This is our top-level tag scheme. Name, type, or lexical class. So this will tell you -- if it's a name, it'll tell you what kind of name it is. If it's not, it'll tell you what part of speech it is.

And I'm also going to use a data detector for finding URLs because, you know, I don't want to go and change in capitalization of URLs. That doesn't really make sense. So I'll find the URLs in it. And I'm going to use a mutable index set with that, and I'll show you that in a minute.

This is the external API for this, title, string for string. And so basically the first thing I'm going to do is associate my string with my tagger so I can get tagging results for it. And as I said, I'm going to do this just for English, so I'm going to tell the linguistic tagger to treat the text as English. If I don't do this, the tagger will try to identify the language of the text, but here I want to treat it all as English.

And then I'm going to find URLs in the text, and what I'm going to do is determine where the URLs are and use my mutable index set to record all the indexes of URLs in the string and just leave all those alone. So I just use the enumerate matches in string API for my data detector.

Whenever I get a result, I add those indexes to my mutable index set. So now I know where all the URLs are in this text. Now I'm going to go through the text word by word. This is the numerate substrings and range API that I mentioned before going through by words.

Now, as I mentioned, and as you saw in Jennifer's example, the linguistic tagger will also iterate through text. But one thing to keep in mind is that the linguistic tagger analyzes text on a linguistic basis. So in particular, for words like in English like "can't" and "won't" and "should have" that are actually two words smashed together, the linguistic tagger will separate those.

But for casing operations, what I want to do work with is the actual words and go word by word as they are in the text. So I'm going to enumerate by word and then I'm going to ask the tagger individually what it thinks the part of speech is for those words. And another thing I can find out from the tagger is the range of sentences in the text because I want to capitalize, in this case, the first and last words of any whole sentence. I'm sorry.

Another thing to keep in mind is that in addition to the words, there's also space stuff in between the words, which might be white space, punctuation, et cetera. And I want to do something with that, too. Now, one thing to keep in mind about this enumerate substrings and range API is that it actually passes in two ranges.

One is the substring range, that's the range of the thing you're looking at. And there's also the enclosing range, which includes all the other parts of the text between -- in this case, between the words. And it's designed so that the enclosing ranges will completely partition the text with no overlap. So the enclosing range will include for the first word all the text before the first word and the text after it to the next word. And then for subsequent words, it will include all the text after the words as well.

So first I'm going to take a look at and analyze the text before the word, if there is any, and the text after the word. And I'll go through that in just a moment. And that will help me -- so one thing I'm going to do there is do these smart quotes, for example.

And then I'd take a look at the text before the word. And so if this word is the first word of the text, Or if it's the last word of the text, or if it's the last word of the sentence, Or if the non-word text before it had something like an opening parenthesis or an opening quote, or if the text after it had something like a closing parenthesis, a closing quote, then I definitely want to capitalize it. No matter what part of speech it is. And then I call this string for word range method that does the actual cuts of it. So let's take a look at that.

So the main thing I do here is I ask the tagger, what part of speech is this word? I'm also going to ask the spellchecker, do you have a correction for this word? Now, we're not going to try to correct all spelling errors here. We're just going to try to correct those that are case errors, like not capitalizing P in iPhone.

So I want to decide whether this word should be capitalized. If it was the first or last word of a sentence, et cetera, as determined before, yes, I want to capitalize it. Otherwise, if it's a conjunction, like and, or, or a preposition, like of and to, then my style manual said that those should be capitalized only if they're long, only if they're greater than four characters.

If it's a determiner or particle, like an article, "A," "A" or "The," it should be capitalized unless it is one of these specific words, "A" and "The" or "To." Those are my style rules. If your style rules are different, you could express them in similar ways with these tools. Otherwise, we want to capitalize it.

Now we know where we want to capitalize it, but we still need to check, is it part of a URL? If it's part of a URL, just leave it alone. If we had a case correction for it from the spellchecker, that is, if the spellchecker's correction is the same as our original word except for case, that's this case-insensitive compare, then I use the spellchecker's version for things like iPhone and iPad.

Otherwise, if I think it should be capitalized and it's not, capitalize it. Otherwise, if I don't think it should be capitalized and it is, lowercase it. And then one more thing. Remember I mentioned the string by replacing occurrences of string method. I'm going to take any apostrophes that happen to lie in the middle of this word and replace them with curly apostrophes. And that gives me my result for the word.

And now let me take a look at what happened with those interword ranges. So the text in between the words, here I'm going to use the tagger and use the tagger's enumeration. You might have some white space and some punctuation and some more white space, say. And I'm going to go through these piece by piece. Remember that the tagger stores the information about the text. And so no matter how many times I ask, it's only going to evaluate that piece of text once.

So I go through and find out what that token is. And if it is a single quote, the tagger will tell me whether it's an open quote or a closed quote. If it's an open quote, I'm going to replace it with a curly open quote, otherwise a curly closed quote.

Likewise for a double quote, it will tell me whether it thinks it's an open quote or a closed quote. And I replace it with a curly open or closed quote accordingly. If the tagger tells me this is white space, what I've chosen to do here is replace all the white space with just single space. I'm going to condense down all the white space. Otherwise, I just pass it through and leave it alone.

And so that puts together all of these different techniques for analyzing text and shows them working together in a single example. Now, this example is a bit complicated, so I think what we're going to try to do is make this available as a developer example. It's not up yet.

But we'll try to make this available as a developer example so you can review it. You should keep in mind, though, that this is not to be considered as a be-all and end-all of rules for capitalization. This is one specific set of style manuals. Different style manuals have slightly different notions of what should be capitalized.

So let me summarize. The way we want to encourage you to analyze text is by looking for ranges within the text that are of interest and to deal with the text range by range. And the primary method that we make available for this is block iteration, iterating over blocks that specify ranges of a specific type.

And we have different types of block iteration for different kinds of analysis. The basic analysis provided by NSString, more sophisticated types provided by regular expressions or data detectors, and the linguistic analysis provided by NSLinguisticTagger, plus text checking at the UI level for things like spell checking and so forth.

And that concludes our discussion of text analysis. For more information, you can contact our evangelist, Bill Dudney. There's documentation, developer.avl.com. Not all of this probably has documentation yet. Specifically in this linguistic tagger, I know the documentation is under development, may not be there quite yet. You can look at -- I've been informed that the latest foundation release notes have been released and should be available to you. And those include some more information on all of these APIs.

and other people. Related sessions, you may have gone to the What's New in Cocoa and What's New in Cocoa Touch where they review things like some of the UI implications of some of these things. Tomorrow, there's an internationalization talk on making your application ready for international markets tomorrow at 10:15 that I encourage you to go to. And that concludes the presentation.