Text and Linguistic Analysis - WWDC 2012

Essentials • iOS, OS X • 51:23

Many applications need to analyze or process the text that they are working with. With so many languages in the world this can be a daunting task. Learn how to process and analyze text in world-ready ways. See how to use the linguistic APIs offered on both iOS and OS X to iterate through text, evaluate regular expressions, and do linguistic analysis of the text you operate on.

Speakers: Doug Davidson, Jennifer Moore

Unlisted on Apple Developer site

Downloads from Apple

HD Video (521.9 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

All right, good morning, everyone. My name is Doug Davidson, and I'm here to talk to you about analyzing text. Now, every app has to deal with text in one form or another, and there are other sessions at the conference about inputting text and displaying text. But here what we're going to talk about is what to do if you want to understand something about what is in a piece of text that you have. And we built a powerful and flexible toolkit for that into our frameworks, primarily at the foundation level. So all of the APIs I'm going to discuss today are available on both OS X and iOS.

And we've made it really easy for you to use. I'm going to tell you about that today. I should say that proper handling of text, especially international text, is something that's very important to Apple. It's part of what allows us to help people all around the world. And it should be important to you, too, to get the same kind of global reach.

So we're going to start with some very basic things and then move quickly to the next. So the most very basic is the class NSString. Now, we all use NSStrings all the time. It's worth remembering what an NSString is. Conceptually, it is a sequence of UTF-16 units, which the API refers to as characters.

But when you're analyzing strings, Generally speaking, we don't want you to be dealing with individual characters. I'll explain a little more about that in a minute. What you should be referring to as the pieces of analysis are ranges within a string. And all the APIs deal with text in that way.

In addition to NSString, the companion class is NSAttributedString. And there's a whole session this afternoon on attributed strings and another one tomorrow morning, especially for iOS developers. So I'm not going to say a great deal about attributed strings, but remember that an NSAttributedString has an underlying NSString that you can get at readily and cheaply with the string method.

And in addition, it has certain attributes that decorate it, things like fonts, colors, underlines, etc., that again apply to ranges of characters within the string. So anything you can do analyzing an NSString, you can also do to an attributed string by getting its output. So you can get the output of this underlying string. And then you can do various things to it, including manipulating the attributes on these ranges. So we'll use that in some of the examples.

So why don't you want to deal with individual characters mostly? Because even the smallest user visible unit, the thing that appears as a character to a user, may in some cases be more than one character in the string. For example, many accented characters can be or must be represented as a base letter followed by a combining accent character. And in some scripts, like Korean especially, there are even more complex combinations where multiple characters go up to make one user visible character, which we refer to as a composed character sequence or sometimes just a character cluster.

In addition, for any characters in Unicode beyond the basic multilingual plane, they will be represented in UTF-16 terms, that is in NSString, as a surrogate pair, two UTF-16 units. I should say that the characters that are beyond the basic multilingual plane are not just obscure things like hieroglyphics and cuneiform. This includes a lot of characters in Chinese. We all want to handle Chinese well. It includes most of the new emoji in Unicode. We certainly want to be able to handle emoji well. So what do you need to do?

Well, first of all, just be sure you don't split the parts of a character cluster when you're analyzing a string. If you use the APIs we're discussing, they'll all take care of that for you with the ranges they give you. If you're in doubt, you can call range of composed character sequence at index or range of composed character sequences for ranges to get a good range that doesn't split character clusters. If you do split them, you might end up with odd things. Odd things like accents being divorced from their base characters or half of a surrogate pair showing up.

And if you don't specifically test for it, you might not know it until you get a bad review mentioning it. But I want to describe, first of all, an API that we have for going through a string and looking at it, character cluster by character cluster. And I want to use this to illustrate a pattern that we'll see again and again here, and that is analyzing text using APIs that are block iterators. Now, block is a string that's not a string block. And blocks are pretty much perfectly designed for analyzing text, because what do they do?

They apply some chunk of code to, in this case, one piece of text, one range of characters in a string, then another, then another, and so on. So for this API, enumerate substrings in range, if we pass it the option enumeration by composed character sequences, then it will call your block over and over again with each of the characters in the string, and it will pass you here the substring range will be the range of the characters in that character cluster. And then you can do with those whatever you want. So, for example, here, if you have a corresponding attributed string, you can just add an attribute like a color to that range if you needed to.

So let me show you how that works. Here, if we call that on this string, our block will be called repeatedly with the ranges of each character cluster. And here, these are just single character characters clusters so far. They'll be ranges of length one until we get to a character that happens to be a character cluster that happens to be two underlying characters. Here an E with an accent character. And then here an emoji character that's a surrogate pair. And in these cases, you would pass in the range of length two corresponding to this character cluster.

Another unit of processing that's especially important for linguistic-related tasks is the word. And we should remember that in English we're used to thinking of words as being delimited by white space or punctuation, but that's not the case in every language. In languages like Chinese, Japanese, and Thai, words may not be delimited by punctuation or white space, and you need -- some analysis has to be done to determine where the word boundaries are.

So we have an API for going through strings by words. In fact, it's just the same API that I mentioned for character clusters but with a different constant. Use enumeration by words, and instead your block will be called with the ranges of the words in a piece of text.

In this case, let me pay a little closer attention to the arguments of the block. So the first one is substring, that is the actual word as a string. Next is the range of it in the string. So the first one is substring, that is the actual word as a string.

Next is substring, that is the actual word as a string. Next is substring, that is the actual word as a string. And the enclosing range also includes, if there is any, other stuff around the word up to the next word, like punctuation or white space, if it happens to be there. and there's also in most of these block enumerators an out point or Boolean argument, stop, which you can set to yes to break out of the iteration prematurely if you need to.

Another unit of processing is the paragraph, that is, text that's delimited by some kind of paragraph break, carriage return, line feed, or both, or Unicode paragraph break character. This is the largest unit you have to consider at once for Unicode-related processing. And again, we have the same API for enumerating through a string by paragraph by paragraph if you pass it the right constant, enumeration by paragraphs. One common pattern is to nest these. You can go through a string first by paragraphs, then, say, by words, then maybe by character clusters if you need to.

There's also a nice convenience API if you're doing the Unity sort of thing of going through something line by line, then we have an API enumerate lines using block that will take a string and pass you each line at a time for you to do something with it.

Suppose what you want to do is to find the first occurrence of a particular substring within a string. Well, we have the range of string class of APIs for that. And they take a number of options. In this case I've chosen the case insensitive option, which means I'll find it in upper or lower case. For this word it could be resume or resume with some accents. I'll find either one. And width insensitive, which will find either a single or double width versions of characters that may occur in some Japanese text.

If I don't want to find the next occurrence of a substring, if I just want to check and see if a substring is present at a given place in the string, I can add the anchored search option to the same API. And this means it will find a match only if the string we're looking for occurs starting at the start of the range that we pass in.

And this is useful in many cases and much cheaper, of course, than looking through the entire string. We also have search and replace APIs based on these. One version that applies to immutable strings and returns a modified copy. And another version that acts on mutable strings and mutates them in place.

Sometimes the thing you're looking for is not a specific substring, but maybe characters from certain certain class like punctuation or white space or digits or letters or uppercase letters. For that we have NSCharacterSet. Conceptually an NSCharacterSet is just a bitmap over all of Unicode. It can be any subset of Unicode characters.

We have a bunch of predefined examples in the API, white space, punctuation, etc., that I mentioned. But you can create it with any set of characters that you choose. You can pick a list. There's even a mutable variant of NSCharacterSet so you can perform any sorts of set operations on it you like to get the precise character set you want.

And then there's a very similar API, range of characters from set, that finds the next occurrence of characters from this set in your string. and there is also a matching version with an anchored search option that checks to see is there a character from this set at the start of this range that I'm looking at. So let's see how some of this works in code.

And so I have a very simple sample project that has some text and a number of methods for going through it using various of these APIs. So here we're going to go through it by words. Enumerate substrings in range by words. Now, the string here is the underlying string of the text, which is a myattributed string.

So I'm just going to add a color to each word in my text just to show you that the examples on the slides are really taken from running code. So let's see how that works. So I'll color all the words in the text red and leave the nonword text uncolored. Maybe it's easier to see if we do it in slow motion that shows how the enumeration actually works.

Okay. Next. If I want to find a particular substring, again, my resume or resume, one common pattern, this range of string API just returns the first match, so if I want to find them all, I can go through it in a while loop, start and I look in a range that goes from the end of my previous match to the end of the string and continue until I no longer find any or reach the end of the string. When it doesn't find something, it will return and it's not found, zero is the range and I'll break out of my loop. So I'll try this.

And okay, I'll find all occurrences of that string. You notice that this substring search is not a by word search. So it finds both resume and resume as part of resumes. If I wanted to do it by word, I can write this a little differently. And I could just use my enumerate substrings in range, enumeration by words, and go through word by word. And then I can use the range of string API with the anchored search option to see if that range that I found as a word actually matches the thing that I'm looking for.

[Transcript missing]

I just find the cases that match a whole word. Now you'll notice that a number of things in this text are in quotes of one form or another. There are a lot of characters that are used as quote characters in various languages. So I want to find all the quotes that occur in this text. So I have a method here that creates a character set, and it creates a character set with certain characters that I've listed by putting them in a string here.

And it shows just all the characters that I know that are used as quotes in various languages. And so I'm going to use that and use the range of character from set API in the same way that I use the range of string API. I'm going to go through in a while loop until I no longer find characters from that set.

And what does that look like? It colors all of the characters that are used as quotes here, as red. Suppose I want to find the words that are surrounded by quotes. Well, I wrote a little helper method here to determine whether a particular range of the text is surrounded by quotes. So I use that same quote character set, and I look and see, using the anchored search option to range of character from set, to ask, is the previous character a quote, and is the following character a quote?

So to look at the previous character, I use backwards search, which is the option that starts from the end of the range rather than the beginning, and anchored search to check for a match right at the boundary of the range. And then to look at the following character, I pass the anchored search option, and I pass appropriate ranges. So in this case, one that starts at the end of the range I'm looking at and goes to the end of the string.

And that tells me whether a range is in quotes. So then I go through enumerate substrings in range by words to go through word by word and ask, is that word in quotes? If yes, color it. How does that work? Let's color all the words in quotes. So here are all the words that are in quotes colored.

Okay, let me go back to the slides. Let's move along to something a little more powerful. Regular expressions. So probably you're familiar with regular expressions. If not, let me just say that regular expressions are patterns that include things like literal strings or sets of characters or combinations of these or repetitions of them or alternatives or more sophisticated things.

The term regular expression by now is a bit of a misnomer because modern regular expression APIs include things that aren't formally regular like look ahead and look behind and back references and so forth. But the term has stuck. So we have -- here's a very simple example of a regular expression I'll use to demonstrate. It starts off with a word boundary, finds a word break.

continues by this first parenthesized sub-expression matches either an I or an O. The second one matches either an F or an N. Now these parenthesized sub-expressions are sometimes referred to as capture groups for reasons that will become clear later. And ends with another word boundary. So what this expression matches is whole words that have two letters, the first of which is either I or O, the second is either F or N. Very simple. This is how the regular expression looks. If we are putting it in a literal string in code, we have to remember that the backslash character needs to be escaped. So it would look like this with the backslashes doubled. Just a little caution.

And we have a powerful regular expression facility in Foundation that is based on the ICU regular expression library. ICU is the standard library for dealing with Unicode algorithms, so it's fully Unicode-compliant, and has many different options. The things that are usual, like case insensitivity, dot matches all, the so-called multi-line mode, options for including comments in regular expressions, and so on and so forth. You can read the header for descriptions of these.

The simplest API for finding regular expressions is just the same API that we used before, the range of string API. If we add the regular expression search option to it, it means that the thing we're looking for is now treated like a regular expression pattern instead of a literal string, and we'll find a match. One interesting twist is that, remember I said all these APIs return a range and it's not found zero if they don't find a match.

And for the previous APIs, they'll always return a non-zero length range if they find something. There are certain regular expressions, if you use look ahead or look behind, you can create one that will match correctly with a zero length match. So I've chosen to determine whether there's a match by comparing the location with "and it's not found" instead.

For a little more flexibility, we have a class, NSRegularExpression. With this you can control when the NSRegularExpression object is created. Now when that happens, the pattern gets compiled into an interval form, so this is useful for performance reasons. And the class is fully thread-safe, so you can just create one, keep it around as long as you need it, and use it from all the threads you need.

You create it by passing in a pattern and some options. So here I've chosen the case insensitive option. So this will match either upper or lower case. And then the basic API on this regular expression is just a block enumeration. Enumerate matches in string with options and there are various options here.

Again you can look in the header for these, things like transparent bounds or anchoring bounds. There's an option to regularly report progress if it takes a long time. There's an option to tell you when it's about to finish and so on. And then your block will be called for each match of the regular expression in the string.

So it would be called once for each of these two-letter words that match my regular expression pattern. And what will it be called with? Here it's not just a range that we passed in. We pass in an object representing the match of class NSTextCheckingResult that has a number of properties. One property is a result type. That will be important later. The main property is the range property, which is the overall range of the match.

But with regular expressions you also get to find the ranges of the individual capture groups, that is these parenthesized sub-expressions, which can be very useful in many cases. So range at index, range at index 0 is the overall match range, range at index 1 the range of the first capture group, range at index 2 the range of the second capture group, and so on and so forth.

And so then you can go through and in your block you can get the overall range of the match, you can get the range of the first capture group. For this particular regular expression it's the first letter of the two-letter word. You can get the range of the second capture group. Again, for this one it's the second letter of our two-letter word. And with those ranges you can then do whatever you need to the match of your regular expression.

The basic API is a block enumerator, but there are plenty of other APIs if you don't need that. You can get all the matches of the regular expression as an array if that's what you want, or just count them, or just get the first one, or maybe just the range of the first one if that's all you need. It's cheaper to do that than to get them all. So you can just ask for the range of the first match, and that will tell you is there a match at all.

Remember I mentioned the convenience API to go through a piece of text line by line. If you were trying to do something grep-like using NS regular expression, you could use that, go through a string line by line. For each line, ask, "Is there a match of my regular expression in the line?" And then if so, do something like print it out. Very simple.

There's also a search and replace API on NSRegularExpression. Again, one method that takes an immutable string and returns a modified copy. And another method that operates on a mutable string and mutates it in place. But for these methods, the thing that you're replacing with is not just a literal string. It's a template.

This should be familiar for users of regular expressions. Where $0 represents the whole thing that was matched. $1 is the first capture group. $2 the second and so on and so forth. So here I picked $2, $1. That is, contents of the second capture group, contents of the first capture group will replace the match. So it should probably be obvious what that's going to do to these two-letter words. It's just going to reverse them.

There is also for convenience a search and replace by regular expression API just on NSString. And it's the same API we saw before on NSString but with the NSRegularExpressionSearch option added which makes it treated as a regular expression search. So regular expressions are great for things that are simple patterns.

But for more complicated things like URLs or email addresses, phone numbers, dates and times, even street addresses, people sometimes try to find these using regular expressions, but it's very difficult and the regular expression is not likely to match the complexity, especially of the wide range of international formats you'll find for things like dates and times and phone numbers. So fortunately we have something that's better for these applications. Faster, more powerful, handles all sorts of international formats. And it's called Data Detectors. You're probably familiar with it from seeing it find things like dates and times and addresses in your mail at the UI level.

It's also available as an API, NSDataDetector. And NSDataDetector is just a subclass of NSRegularExpression. You create an NSData detector. You don't pass it in a pattern. Instead, you just pass it in a set of types of the sorts of things that you want to find from the list of things that data detectors can find. And there are a bunch of these. So it can find dates and times with a type date. It can find street addresses with a type address.

URLs and email addresses with a type link. Phone numbers with a type phone number. Because this is just a subclass of NIS regular expression, it has the same API. The block enumerator gets called for each match of the sort of thing that your data detector is looking for in your string.

And again, it gets called with an object of class and its text checking result. Now here's where the result type property of that class comes into play because it tells you what the sort of thing is that was found. Is it a date or an address or a phone number or URL, etc.? And there are some additional properties that are specific to certain types.

So if it's a date, there's an NSDate property. If it's a street address, there's the components dictionary which gives you the street and city and state and so forth. If it's a URL or email address, there's an NSURL property URL. If it's a phone number, there's a phone number property.

So you can go through and for each match that you find, check the result type and see what sort of thing it is you found. Is it a link? Okay, get the URL. We can do something with that. Is it a phone number? Okay, get the phone number. We can do something with that.

And again, because this is just a subclass of NS regular expression, it has all the same additional methods to get all the matches in an array or just count them or find the first one or just the range of the first one. As you can see, if there is a match at all or not.

Let's see how that works in code. Okay, so next I'm going to use the regular expression that I showed. I create an NSRegularExpression with it, make it case insensitive, and then just call enumerateMatches in string. And for each match, I'm going to color the range of the match.

Very simple. Let's try it out. and you see that all of the two letter words that match are colored. Very simple. You can use it for any regular expression you like. Suppose I want to use data detectors. All right. I create a data detector with a type. In this case I'm just choosing the type link that will find URLs. I use the same -- this is the same code.

Enumerate matches in string. If I have a match, I color it. and it finds the URLs in this piece of text and colors them. I should say this example here happens to be an OS X application using NS TextView, but you could do the same sorts of things on iOS with an attributed string and the underlying string.

All right, so let's go back to the slides. And let me move on to the heavy artillery here, which is something a little special. We call it linguistic tagging. And what linguistic tagging -- what this functionality does is it takes a string and the first thing it does is to analyze that string to cut it up first into paragraphs and the paragraphs into sentences, and then the sentences into what we refer to as tokens, which are the words or the other things that show up in text -- punctuation, white space, et cetera. So it divides up the text, it tokenizes it, and you can use it just for that tokenization if what you need is word or sentence or paragraph boundaries.

And then it attempts to determine from the text what language a particular piece of text is in. So it's a language identification. And it can give you -- it uses the standard BCP 47 codes like EN for English, FR for French, and so on and so forth. And also it determines what script a piece of text is in -- Latin script or Cyrillic or Greek or Arabic or Chinese, simplified Chinese or traditional or Japanese. And for those, it uses, again, standard abbreviations like LATN and CYRL. So that's the basic analysis that Linguistic Tagger does on a piece of text. And again, many pieces of code use it just for the language identification, the script identification.

Then it can go on to do some more sophisticated analysis. It can take a look at that text and try to determine what part of speech each word in that text is being used as. Is this a noun, a verb, adjective, and so on and so forth? It can take a look at word and text.

It can take a look at the text and see what it is. It can take a look at the text and say, "All right, this is an inflected form, maybe a plural. What is the base form of this word?" We use a little technical terminology here. We call this a lemma. That's what would appear as the head word in the dictionary for this word. Maybe it's a verb form and you want the infinitive form of the verb.

It can also attempt to determine whether a piece of text looks like it's part of a name or not. Named entity recognition. Whether it's a person name or place name or organization name. Now, the basic analysis of tokenization and language and script identification, that we can do for arbitrary pieces of text.

The more sophisticated linguistic analyses are very language specific. They require sophisticated natural language models for each language. So, these are provided only for certain languages. Currently, on OS X, we do parts of speech and base word identification, lemmatization, for English, French, German, and New and Mountain Lion, Italian, and Spanish.

: We do named entity recognition for English only. On iOS, for reasons of space, currently we provide all these capabilities but only for English. Now, precisely which languages are supported may change in the future if we add more, so there is a method to determine what capabilities are supported for a given language, available tag schemes for language.

The object that does the linguistic tagging is NSLinguisticTagger. To create one, you create an NSLinguisticTagger and you pass it in the tag schemes. That is what we call the list of the various kinds of analysis that you want it to do. There are constants for this. Here I've chosen token type. That's the basic tokenization. Lexical class. I say we use a little technical terminology. Lexical class is our term for the parts of speech analysis.

Name type, that's named entity recognition. Name type or lexical class combines those two. It will give you whichever is most salient for a given piece of text. And lemma type, that gives you the base form of a word. Then you take your string and you set it on the tagger. The reason we do this is because a tagger does some fairly sophisticated analysis on a piece of text when it needs to and then it holds on to it. So subsequent requests for information about that text are very fast.

And then the basic API on NSLinguisticTagger, again, guess what? It's just a block enumerator. Enumerate tags in range. You pass it in a scheme, that is, what kind of analysis do you want? Do you want, in this case, I've chosen lexical class, I want to get parts of speech out. There are some options. You can choose to omit the white space or punctuation or what have you.

And then your block will get passed in the range of given tokens, a word let us say, and the tag. And for the lexical class scheme the tag is just what part of speech is this word. and I can take a look at that and say, well, if it's a noun, I want to do something with it. Maybe I'm looking just for nouns or adjectives or adverbs.

How does this work? So if I were enumerating through this particular string, say I was doing it either by lexical class or by lemma. So for the first word, lexical class would say this is a pronoun. Lemma would say the basic form is "we." That's fairly simple. For the next one, lexical class would say this is a verb. Basic form lemma would say this is a form of the verb "say."

The "to" is a preposition. "Him" is a pronoun. Basic form is "he." It also identifies punctuation. In some cases it gives some more specific identification. For example, it determines this is an open quote. This is an interjection. It determines which punctuation is sentence ending. And it determines that this is a closed quote.

The basic API on Linguistic Tagger is the block enumerator, but there are additional ones. You can get a specific tag if you just want one. Tag at index. You can get a set of tags and arrange as an array if you choose to do so. So if I just wanted to look and say, is the word here a noun, I ask it tag at index with a lexical class scheme, and if the tag is noun, then it thinks it's a noun.

Now, remember I said the linguistic tagger tries to identify the language of a piece of text from the text. And in most cases that's accurate. If there's only a few words, a few words of text, it may not be possible to identify the language accurately. It may not do it. In some cases, if you know already the language of a piece of text, you can tell the tagger and it will use that.

And you do so by creating an NSOrthography object, which is basically a list of the languages that occur in a piece of text. And you create it here. I've said this text is English in Latin script. And I call setOrthography on the tagger with that orthography and it tells the tagger, okay, treat this as English. Even if you can't identify it, treat it as English.

In addition to the NSLinguisticTagger APIs, there's a convenience API on NSString. You can just take a string and say, "All right, for this string, go through it and enumerate linguistic tags in a range and pick your scheme." Lexical class, let us say, "Give me the parts of speech for the words in this string." And you can do all the same things that you can do with LinguisticTagger with this, but of course, you don't have the control over the lifetime of the tagger object and the information that it holds onto, but it's a convenience. So let's take a look at how this works in code.

Suppose I wanted to go through that same text and color only the nouns. Well, it's not really any more complex than any of the other things I've been doing. I create an NSLinguisticTagger. I give it some tag schemes to start off with. I say let's do basic tokenization, parts of speech analysis, and name recognition because I may want those later on.

And I set my string on it. Then I call it numerate tags in range. Let's choose lexical class. I want parts of speech out of this here. And so it goes through token by token. And if I find a noun, I'll color it. Very simple. See how it works. All right. Let's color all the nouns in this piece of text.

Suppose I wanted to find just the nouns that appear in quotes in this text. I can take the same tagger, call the same set string and enumerate tags in range. I could use my previous method that I had to find out if a word was in quotes, but I don't need to because, as I mentioned, the tagger keeps track of quotes. So I can just ask the tagger. Remember, I say additional queries to the tagger are cheap once it's done the analysis. I just ask using tag at index, and I'll look at the range immediately before this word and ask is that an open quote?

And I'll take a look at the word at the tag at index for the index immediately following my word and ask, is that a close quote? And if both of those are true, then my word is in quotes. So let's try that. So let's find all the nouns that are in quotes in this piece of text. All right, nouns in quotes. So you'll notice the verb see is in quotes, but it doesn't show up. The adjective human is in quotes, but it doesn't show up here. I just get fox, dog, time, and events.

And finally, let's try name recognition, which is, again, same API. I create my linguistic tagger and enumerate tags in range, and I use the name type, and I ask, is this one of our types of names we know? Person, place, or organization? If so, I'll color it. Let's try coloring all of the names that appear in this piece of text. And there they are.

So what sorts of things do we use this for? At Apple we use these APIs and this technology to do things like improving our text checking and our auto correction. We use it for things like lookup where you tap on a word and get information on it. You look it up in a dictionary. There are many other applications for this potentially.

You can use this to improve your indexing. You can use it to identify names if you want to find out who is referred to in your text. : What I want to do is to bring up my colleague Jennifer Moore to give a little bit more involved example of the use of these APIs.

Hi, I'm Jennifer. So one of the things that I love about something like the iPad is that it displays text so beautifully. It makes it such a pleasure to read my favorite books and magazines. But in application there's so many more things you can do with text if you only knew something about the content of that text.

So we've been working really hard to provide these sophisticated tools for analyzing text so that we can leverage this content and add more functionality to our own applications and also make it really easy for developers to do the same thing. So one of the things that we would like to be able to do is maybe to be able to search through text. So that's what we're going to do today. So I have this application here. I'm just going to show you what it looks like.

So basically we have a collection of articles about different topics. And what we wanted to be able to do is allow the user to type in a search string and be able to get back a list of all the articles that contain words from that search string in the body of the The easiest way to do this would be to create a dictionary and index all of our articles by all the words in the text and keeping track of all the articles that they occur in. But of course if our application is indexing something like Wikipedia, the dictionary size might be huge in size and probably not really feasible, especially for something like the iPad.

If you think about it, what are the most important pieces of text? Those would be the nouns. The nouns are the words that really represent what an article is about. We could probably go a long way towards providing really useful search if we just go and index all the nouns in our text. With NSLinguisticTagger we can do that.

Today we're going to be looking at content-based searching using NSLinguisticTagger and the linguistic APIs. If we go and take a look at how this might be implemented, First of all, we have our -- we'll just start with a -- we have a class for managing all the articles in our collection.

Basically, there are two things we need to do. We need to be able to index our files and then we need to be able to return a set of filtered articles by the search string. To start off, we're just going to go and when we allocate our articles manager, we're going to go and index all of our files all at once.

Since we're going to be processing a lot of text all at once, the most efficient way to do this would be just to create a linguistic tagger object. We'll just do that to start. We need to specify all the tag schemes that we might be interested in looking at because we don't want it to be doing any processing that it doesn't need to do. Since we're going to go and we want all the nouns out of our text, we're going to choose a tag scheme for lexical class.

Later on, we might be also interested in keeping track of the languages that are being used in the text. For that, we'll want to keep track of the language and also the script because we want to be able to interface with the text. We want to be able to interface with NS orthography with the tagger. Then we just initialize our tagger with these tag schemes and also our search index for later. Then we go through and for each article that we have, we're going to go and index this article.

First thing we do is call set string on our tagger with the text of the article. Now, at this point, if we knew the language of the text, we could also call set orthography on the tagger. But since we're dealing with a good amount of text here, we can be pretty confident that it will automatically guess the right language and do the right thing. So we can just go on to our block API. We're going to call enumerate tags in range with the range of the article.

with a tag scheme for lexical class. We're not really interested in white space or punctuation so we could just emit those. And in our block we're going to get the tag back for the current word in the token range. So all we have to do is say, you know, if the tag is a noun, we're going to grab the substring corresponding to that word, lowercase it so that we normalize it with our index, and then we'll just add it to our dictionary.

And that's all we have to do. For later purposes, as I said, we might want to keep track of the language that's being used. We don't really know if there's mixed -- there could be mixed languages in this document. We're not sure. So for each word we're just going to keep track of the languages that we see. So we could just call tag it index to get the language out and also the script out.

And then we can use this to create an NSOrthography object and keep it around for later use. More on that later. That's all we have to do to index our files. Next thing we want to do is be able to return a list of all the articles that contain the search string.

Let me make this bigger. As Doug mentioned before, there are also convenience methods on NSString. I'm going to show you this here. Our search string is pretty short. Or probably pretty short. We have our search string and here we're going to call enumerate linguistic tags in range directly on NSString.

Basically, we're going to enumerate through all the words in the search string. There might be one or more terms. We don't know. For now, we're just going to use the tax scheme for token type. We're going to emit whitespace and punctuation again. We don't know the language of what the user is typing, so for now we'll leave that nil.

We grab the substring, lowercase it, and then we just go check is that word in our index. If it is, grab all the titles that it occurs in and add it to our filtered set. And that's all we have to do. So now when we run this again, now we have this list of articles and I can type in something like

[Transcript missing]

Now, just to make sure that it's clear, we're only indexing all the nouns that are being used in our text.

I just want to show you, I don't know if you can see this, but there's a sentence here that says, can't wait to share the latest news, right? And here, the word share is being used as a verb, right? So we'd expect that if we typed in share, you know, we wouldn't get this article, but maybe we'd get an article on market share or, you know, stock shares, like this nice Apple financial report, right? So, right. Yes.

So let's go back to this example again. So here, I typed in developers, which is the plural form of developer. But if I typed in just the singular form, I don't get any results back, because that word didn't actually occur in my text. And that's kind of a bummer, because they mean the same thing. And especially for languages like Italian and Spanish or German, there might be multiple forms for the same kind of word. So I don't have to index all those inflicted terms, or the user just has to know exactly what word is being used in the text.

So this is where something like lemmatization would really come in handy, because instead of indexing what I call the surface term, surface form of a word, we could just go and index all the lemmas for each noun in our text. So let's go see how we would do that instead. So we go back to our methods here, two methods here. So up here in our index article, let's just comment this guy out for now. I have a little snippet of code here. Just a little bit updated. To do this is really easy.

So here instead, so we say if the tag is a noun, instead of indexing just a substring, we're going to call tag it index to get the lemma out. And if there's a lemma, we set our index term to be the lemma. If not, we'll default to the substring. And then we just add it to our index as usual. Right? And then for our filter articles, Okay, I'm going to actually comment this whole block out.

Grab this. Now, it's important to mention that if we're searching, the search string is probably going to be very short, so the -- so the tagger is probably going to have a difficult time guessing the language. It's not going to be very clear from one or two words.

So all we have to do, since we kept track of all the languages that we're being used in our text, and because the search string is going to be very short instead, anyway, we can just go through all the languages that we know occur in our text and get the lemma out for that word and check if it occurs in our index. So that's all we're going to do here. So for each of the languages that occur in our texts, in our collection of text, we will go and get all the lemmas out.

using NST's linguistic tag scheme for lemma. If there's a lemma available, we'll set our search term to the lemma. If not, we will default to the substring. And that's all we have to do again. So we could just run this now. And now when we type in developer, now we get our article. So here you can see all the instances where developers were found with the lemma for the search string.

And of course, this is not just for English. So if I could type in something like "monaten" . I will get this nice German article. Even though monaten doesn't occur in this text, it will still find the text. That is multilingual content-based searching using a Neslinguistic tagger. Thank you very much.

Thanks, Jennifer. Okay. So let me summarize what we've discussed here. The way that we want to promote for you to analyze text is to find ranges of interest within your NSStrings. We use the Foundation APIs available on both OS X and iOS to iterate over the ranges of interest. In many cases, block enumeration APIs will do this for you. There are some other APIs that will find a particular range of interest.

And depending on the sort of thing you're looking for, there are different APIs you can use. The NSString enumeration or search APIs, NS character set search APIs, regular expressions, data detectors, and finally, the linguistic tagger. And these all work together very well, as we've shown in these examples, to find various pieces in combination.

For more information, you can contact our evangelist. We have a string programming guide for Cocoa for the particular details of the regular expression syntax we use. It's provided by the ICU regular expression library. I encourage you to take a look at the headers for these classes. I actually wrote many of them, so they contain detailed information on all of these things.

There are some other related sessions this afternoon. One on keyboard input in iOS, which has a lot of things related to text handling. Attributed strings for iOS this afternoon. An advanced session tomorrow. And internationalization on Friday. And that's what we have for you today. Thank you very much.