SearchKit: A Powerful Text Indexing and Searching Framework - WWDC 2003

Application Frameworks • 33:38

SearchKit is a powerful C language framework for searching, indexing, and summarizing text. Formerly known as AIAT (the Apple Information Access Toolkit) or by its code name V-Twin, SearchKit is the engine that provides fast content searching in Finder, Address Book, and Mail. View this session to learn how you can take advantage of this framework to provide similar functionality and powerful information access capabilities within your Mac OS X application.

Speakers: Wayne Loofbourrow, David Casseres, Eric Koebler, Alice Li

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. I had way too many brownies at lunch and if you're like me you'll probably just knock right off in about an hour here. But it's the last couple of sessions this week and the purpose of this session is to present to you some new technology that's in Panther called SearchKit. And the value of SearchKit I think is that it provides some functionality that's been missing in the system first of all and that wasn't available to you as a developer.

And secondly if you can leverage that functionality in your application you can really deliver a very consistent user experience for searching. Not to mention the fact that you'll have a search engine that's already written for you in your application. So to talk you through the specifics of this new technology in Panther I want to introduce Wayne Loofbro. To the stage. Thanks.

Thanks, John. So welcome to the SearchKit session. If you're having trouble finding a seat, there are some up front. So today we're going to talk a bit about SearchKit and how it's used in Mac OS X. It's actually been used for a while, even before it's been released as a public API. A bit about what it can do and how to add search to your application.

So specifically, we'll talk about what SearchKit is, some of the challenges of providing searching that SearchKit solves, and then how to go about doing indexing in three steps and how to do searching in three steps. And then we'll talk a bit about an application of SearchKit called Summarization, which provides a specific API that you may also be interested in using in your application.

So where does SearchKit fit into the technology framework? So you've all seen this diagram by now. In this diagram, we also show core services in the middle, the middle green one there. And that's precisely where SearchKit fits in. So this is a layer above Darwin and below the level of the frameworks.

So just a bit of history. Some of you actually may already be familiar with SearchKit if you looked at all at V-Twin or AIAT. This is a technology that was available in Mac OS 9. It was called the Apple Information Access Toolkit, if you can believe it. And it's a C++ API.

It used its own custom data types, and it was a rather complex API. In fact, I think the manual for it was about 300 pages. So we wanted to improve this in Mac OS X, and so what we did was we created a very streamlined C API that's based on the core foundation types. It has about maybe 40 or 50 calls, and it's much shorter documentation.

So as I mentioned, SearchKit is used in a number of the applications in the system. In particular, in Address Book, the sort of searches you type that lets you find addresses in your Address Book uses SearchKit. In Apple Help, you want to know how do I connect to the internet?

Apple Help will use SearchKit to find all the documents that are relevant to that query. In Apple Mail, it's also used to search mailboxes. Each mailbox is indexed using SearchKit, and you can search individual indexes or all of them at the same time, and you get relevance-ranked results.

and in the Finder in the content searching. So when you do content searching in the Finder, SearchKit does the underlying search and again you get relevance ranked results based on the contents of the documents on your disk. So now to demonstrate a few of the applications, I'd like to invite David Casseres, the Find by Content tech lead. Thank you very much. I'm going to do, if we can have demo one here. I'm going to do a couple of very brief demonstrations. First I'll show you Mail. Where we have a few mailboxes and I can put a little text right in here.

I type in iTunes, I find the one message in all those mailboxes that has it. I can type in Beatles.

[Transcript missing]

and several there. Got a Switch commercial. Got an article. Yes, it does have the word switch in it. And because this is mail, it's doing prefix searching. So I can type. There, and I get the ones that have Abbey Road. All I typed was A-B-B. That's mail.

Now in the Finder, I can search some files. I have a set of demo files here. There are 770 files in here, and they are previously indexed, so you're not going to see the initial indexing process. But the Finder does something much more complicated. It uses a couple of other frameworks to go through to do something complicated with SearchKit, and it will first search the existing index, then start validating the index to make sure that it's still up to date, and it will search repeatedly while it's doing that.

And you'll see that it's all very quick. There we are. Relevance-ranked results. We've got all kinds of files here. We've got PDFs. We've got Text files, and so forth and so on. This works in practically any language you can use with the Mac. and it's very much improved in Panther and that's fine by content.

Thank you, David. So there are a number of challenges to providing search in SearchKit. In particular, doing relevance ranked results. You don't just want the documents that match a particular query, you also want them ranked by how relevant they are to what it is you were asking for.

Being able to support multiple languages, as David mentioned, with SearchKit you can index a large number of different languages. These are human languages, not computer languages, and get correct results. We need to be flexible about what a document means. It could be a document on your disk, but it could also be a message in mail or an address in address book. And we have to be flexible about the kinds of queries. You saw an example of prefix searching, as well as you can do word-based searching and some other ones that I'll talk about in a bit.

So how does SearchKit deal with some of these challenges? Well, one example that comes up in Apple help is that you'd like to be able to type in a query like "How do I connect to the internet?" Now, do you really mean to find documents that contain the exact phrase "How do I connect to the internet?" Well, no, that's not what you're looking for. What you want to know is documents that are relevant to the query "How do I connect to the internet?" So you're not just doing a string matching search.

So what documents match this query? Well, it's really hard to know without breaking it first into words. So that's what SearchKit does. It breaks it into words. But which document is relevant to these words? Is a document that contains the words "do," "I," "to," and "the" very relevant to the query? Well, probably not.

It's probably not as relevant as a document that contains "how," "connect," and "internet." And so one of the things that SearchKit does is to try to determine which words in the query are most relevant to what it is that you're looking for. And it does this using some statistical analysis.

So another challenge in SearchKit is dealing with different languages. So, it's fairly easy to break things into words when you're dealing with English, say, or a world or a language that separates words by spaces. But what if you're dealing with Japanese? There are no spaces between words. And so what SearchKit does is uses a Japanese language analysis which was developed by the folks in Apple Japan. It's the language analysis framework. And it uses this to do a grammatical analysis of Japanese, and then it can break Japanese into words. And then use the same statistical techniques to then find the most relevant document.

But what is a document? Is a document just a file on your hard disk? Well, that would be a fairly limited capability if that's all you could look for. But, you know, so we do search text documents, but we also search documents of other file types such as Word and PDF and HTML. But in addition, we support searching items that are within the application, such as mail messages, address book items, or in your application, really any object or maybe an entry from a database. Really anything that contains text can be indexed and searched with SearchKit.

The other area of flexibility that we need is in what a query is. I already gave the example of a natural language query such as how do I connect to the internet. But we also want to be able to do prefix queries such as the one that David demonstrated so that you can type just the beginnings of words and even in the middle of your typing you can do a search and get relevant results back without having to complete a word.

You also want to be able to do a Boolean query. In some cases you want to have an exact match and so you'd be able to construct a Boolean query. And then there's a popular method used in some of the internet search engines of marking things with pluses and minuses to indicate inclusion or exclusion.

And then as well, you might also want to be able to find things. You've already found some items, and now you want to find other items that are most like the items you've already found. And so we also support a similarity search. So you can provide documents as the input and produce documents as the output.

So we talked a little bit about a typical usage scenario for using SearchKit. Really it's very simple. You index the documents and then you search the index. And then we'll go into more detail about that of course. Really the index's only purpose, the only purpose of creating an index is to make searching fast.

You could certainly look through all the text content of all your documents to find things if that's all you wanted to do, but you'd be missing a couple things. You'd be missing speed first of all, but you'd also be missing the relevance ranking and some of the statistical analysis that goes into the index. Once you've created an index, you don't need the original documents in order to do searching. The index contains references to those documents, not the content itself. But it also contains statistical information about the words in the documents.

And the index can be stored in memory, if you like, for performance. Or if you want it to be persistent, because you're going to use it again, then you could create it in a file. The basic process of indexing is pretty simple. You provide the documents to SearchKit. It analyzes the text and updates the index.

The process of searching is equally simple. You provide it with a query and the index to search, and it searches the index and returns results. So that's the basic outline, of course, and we'll go into more detail. Indexing can be done in three basic steps. You open the index that you'd like to search. You add the documents to the index and then flush it.

So opening the index, of course, could be opening an existing one or creating a new one. Opening an existing one is pretty basic. I won't bother talking about that. Creating a new one, there's a decision to be made about what kind of index you want to create. There are three different types. The first is an inverted index. This kind of an index maps the terms or words in each document to the document itself and provides statistical information about them. This kind of index is used for most of the query kinds of searches.

A vector index maps the other way. It maps documents to terms. This kind of index is actually seldom used. More common is the third kind of index, which is an inverted vector index, which does both. So you might ask, you know, why not just always create an inverted vector index? Because it obviously has the capabilities of the first two.

And really the answer is just space and performance. It takes a bit longer to create. It has more information to put in it. And it takes more space on the disk or in memory. So it turns out that for most purposes, an inverted index is really the kind that you're going to want to use, both for performance reasons and for space reasons.

But just breaking it down a bit more, if you're doing ranked prefix required or Boolean searching, the four kinds that I talked about, four of the kinds that I talked about, then you really want to be using an inverted index. It'll pretty much do, it'll give you the best performance and size ratio. If you're going to be doing similarity searching, it's okay to use an inverted index.

It will still work. It's going to be a bit lower performance in doing similarity searches than the inverted, sorry, the inverted vector index is going to be. But if you're only doing similarity searching pretty seldom, then that's probably the best choice. If you're going to do a lot of similarity searching, then you probably want an inverted vector index. Okay, so we've either opened an existing index or created one.

Now we gotta add the documents. So as I mentioned before, documents can be any of a number of different kinds. It could be something on your disk, something inside your application. For each of these documents, what you provide is a document reference. The index doesn't need to store a lot about the document itself other than a reference and the statistical information that it wants to gather. And when you get results back, you're going to get references back rather than the actual document itself. These references are just URLs or they're created from URLs, I should say, to be more precise.

The other thing you need to provide is, of course, the text of the document. And there are a number of ways to do this. If you have a file, such as the first example on the left, Then you use a file call and reference and that tells SearchKit that it's something that it knows how to read. And the text from the document will be read automatically.

In addition, if you'd like it to handle multiple file formats, we have built-in support for a number of them in SearchKit. And all you need to do is load the default extractors, which is a call in SearchKit. And that means whenever you read a file, it will do the text extraction for you.

But if you'd like to handle any kind of item, in particular you want to handle something that's inside your application, or you have a document format that we don't know about that you know how to get this text from, then you can provide the text as a CFString. That's the universal method. And it can be, the document reference at that point can be any kind of scheme and any kind of hierarchy that you'd like to create within your application.

And then the last step is simply flushing the index to disk or to memory. And at that point, you're done indexing. So it's actually pretty simple. Now you're ready to search. So searching has three basic steps as well, and I'll talk about each of those. First, you create the search group. Then you send the query to SearchKit and process the results.

I just hit the blink button there. OK. So searching. One thing that SearchKit supports is the ability to have more than one index to be searched at the same time. Now you might ask why we do this. One reason is to support multiple attributes. It might be that the objects in your application have not just one text attribute, the content, but they may have other text attributes. Maybe it's a description of a movie or maybe the additional attributes that you'd like to index as well.

One example of this is in the Finder. When FindMyContent does indexing in the Finder, it indexes not only the names of the files, but also the contents of the files. And you'll see in these results that the top few items there were found because of their names. The search was for quartz. So a couple of files had quartz in the name, and the bulk of the files there had quartz in the content. So there are two indexes and they get searched simultaneously.

So creating a search group allows you to search multiple indexes at the same time. Now you might ask why use a search group? Why not just search each index individually and then combine the results somehow? Well, the reason is-- Oh, I guess I gave an example. Okay, let me give another example here before I mention that.

Multiple containers is another reason why you might create multiple indexes. An example here is mail. So in mail, each mailbox is indexed distinctly. So when you want to do an entire content search for all the mailboxes in mail at the same time, it creates a search group in order to search multiple mailboxes.

Now we'll get to the point about normalizing ranking. If you were to do searching on individual indexes and try to combine the results yourself, the problem is that the relevance ranking would only be relative to the content of each individual index. You haven't normalized the rankings across all the indexes.

So by creating a search group, the statistics are normalized and then the documents you find will be the most relevant across all the different categories. And of course it's possible to create a search group with just one index and that may be a common case as well. Okay, so you've got your search group created and now it's time to send the query to SearchKit to do the search.

But we're not quite ready for that because we have to determine what kind of search we're going to do. And the kinds, as I mentioned before, are ranked, prefix, boolean, required, and similarity searching. Just to go into a little more detail about each one, Ranked supports kind of a natural language kind of query.

So if you're trying to do something like Apple Help does, where you've got a large number of documents, you're trying to find the ones that are most relevant from a sort of English or other language point of view, then this is the kind of search you're going to want to do. So the user can ask a question, they can name a topic, or they can provide just related words, and it'll do a reasonably good job of finding the documents that are most relevant.

Again, an example of this is in Apple Help. Prefix searching allows you to do the search as you type, so if you want one of those interfaces where as you type the results come down like you'll see in address book or in mail as well, then this is the kind of searching you'll want to do.

It's basically just like the previous one except that it supports prefixes, which has the advantage of supporting searches you type. One disadvantage it has is, of course, you'll get more matches than might be intended because even if you type what seems like a complete word, it may be the prefix of some other word, and so you'll probably get more results than you would otherwise.

An example, this is an address book. Boolean searching is just what you'd think it would be, and/or not. And you can combine things with parentheses, et cetera. And this one's maybe less often used because most users aren't really familiar with doing this except advanced users. But it is, in fact, supported in Mail.

So you'll notice if you type an ampersand or a vertical bar in Mail, it'll do Boolean searching. And the way it does this is it looks at the query that you typed in. And if it contains any of those symbols, then it sends it on as a Boolean search. And otherwise, it sends it on as a ranked-- or, sorry, a prefix search.

And then required searching is the last type, and that allows you to add pluses and minuses. We don't have any examples in any of the Apple apps of using this, but it's a popular search technique that's used on the internet. And then similarity searching is a different call. For that one you provide the sample documents. You don't have any sort of text query at all. And SearchKit returns documents that are similar to the ones that you provided it.

Okay, so now we're ready to send the query. We decided, let's say, to do a ranked search in this case. So you send the query to SearchKit. SearchKit efficiently searches the index and then returns the results. And then the next step is, of course, to process those results.

So the results contain a number of bits of information that are helpful in displaying the results to your user. One is the document reference, which is really the same thing you provided it before. And it's up to you what kind of a hierarchy you want to create to make it easy to find items in your application. In this case, as an example, I said mail message 21. That's one way to identify a document. You might have more of a hierarchy, like which mailbox it's in. And you could do it by subject or something.

So it's up to you to then map that back and say, OK, these are the documents I want to display based on that reference. The second bit of information it returns is the relevance ranking and this is of course used in displaying your results. And the third, in the case that you have multiple indexes that you were searching simultaneously, then it's going to return a reference to which index this particular result came from.

The reason for that in the mail example is you need to know, well, I'll just go through all of them. The document reference tells you which message. The relevance rank tells you, of course, the relevance column. And the third one I mentioned, the index reference, is to tell you which mailbox it came from because it has multiple indexes that it has to know where it came from.

So that's basically searching in three steps. And at that point, you've added searching to your application. So how easy is that? Now, one of the things that SearchKit is capable of doing if you use it in just the right way is summarization. But to make it easy, we've brought that API up at a higher level so that you can directly do summarization of documents.

And I'd like to invite David back up to talk about summarization. Hello again. Summarization is my favorite feature of OS X, and I'm wondering if I could see a show of hands how many of you have discovered the summary service in-- some of you, not all of you. It'll be my pleasure to show that off.

It's been, we've had summarization as the service. Since 10.1, and it became better in 10.2, it'll be essentially the same in Panther. Now we've provided this, you heard me mention Find My Content before, and it's a high level interface to SearchKit. And it lives in App Services, which you probably already linked to. And you want to find a header called findbycontent.h to see the syntax. There's SearchKit. There's FindByContent.

Find by Content called SearchKit. Now to do summarization of text, it uses a very simple technique and it's using the SearchKit. It's indexing the sentences. The document now is a sentence and it takes that index and it searches for the best sentences by using--it constructs an idea of what is the meaningful stuff in this document, in this thing that we're summarizing. Looks for that among the sentences and gives you back the best ones. So you see there it's indexing and it's searching. It's exactly like what Wayne described.

And here is the demo. They should play a little music while we walk back and forth here. I have a weblog document here. For an interesting thing that I found on the web, it's an article by a guy named David Stutz that he wrote as part of his farewell to Microsoft when he quit working there.

I love it because it has this great sentence at the end. I don't know if you can read it. He says, "Stop looking over your shoulder and invent something." But, you know, when I get this on the web, I don't really know if it's something that I want to read or not. It's kind of long.

So... Here in Safari, I'm going to select all, go to the Safari menu, Services, Summarize. And there's the summary. And it's giving me a default size summary. I can read that. Or I can take the slider here. I can find the best sentence. I can find the paragraph that contained the best sentence. I can zoom up a little bit to get more. I can just scrub back and forth and get more and less. I can get the entire text like that. Is that cool?

That's the demo. But because it's a service, a lot of people have never even found it at all. And this should be in your application. By the fact that you're here at all, I'm guessing that quite a few of you have applications that are text-oriented. So why would you use this?

Well, to begin with, it's useful, as I just described. You can make little thumbnails. You can make these dynamic summaries that a user can play around with. and generally improve the usefulness of your application and you can advertise it. And it's a lot of fun. Text is not usually much fun. Maybe you envy the graphics people who can put all this wonderful court stuff in their applications.

You can put summarization with dynamic size summaries and it's quite easy. If you're doing a fixed size summarization like to create a set of thumbnails of text, there's one function that you call. You give it a CFString, you tell it how many sentences you want. If you pass zero, it'll figure out a good number of sentences. Gives you back a CFString. It's that simple.

The resizable summarization like what you saw there is twice as hard. There are two functions you have to call. There's two others, but they're not interesting. First function does all the hard work, and you can do that before the user even asks for a summary. You can do that in the background on a thread.

It does all of the analysis of the text into sentences. It does the indexing. It gives back an object that you can keep around. And from that object, you can construct the different size summaries very, very quickly as you saw. One thing you should think of, though, is that that object takes about as much memory as the original text.

How does it work? It works by some very simple stuff. It's statistics on the words in the sentences. It's just SearchKit searching like you saw Wayne doing. And the key is that it's analyzing the text into sentences. And we do that probably as well as anything that any other code that I've seen.

We only keep complete sentences. We throw out everything else. And that makes the summary actually come out better because it throws away Page footers, it throws away little headings and things. So it's only keeping the real sentences and it does not care anything about meaning and grammar. And that's summarization.

Thank you very much. Thanks, David. So as you've seen, you can add searching to your application fairly easily. You can add summarization to your application fairly easily. And both of these APIs are available in Mac OS X Panther. Summarization is actually available in Jaguar. So SearchKit is a powerful text searching framework that's now available in Mac OS X. It indexes anything that has text in it, whether it's in your application or on disk. It provides powerful, fast searching. and Summerization. So now the cat's out of the bag and it's up to you guys.

So I'd like to point to some other sessions that may be of interest to you. If you use Unicode in your application, well, of course these sessions have already occurred, but for folks watching DVDs, if you'd like to look at these sessions. If you're using Unicode in your application, the session number 404, Unicode for Japanese, Chinese, and everything else. If you're interested in searching the address book, then you probably want to use the address book APIs rather than the search code APIs.

And similarly, if you're interested in providing Apple Help, Apple Help uses search code underneath, but you want to go look at the session for Apple Help number 408. And if you're interested in indexing things off the internet, then you'll probably want to look at the session, the Internet Technology Session on the Advanced Foundation URL APIs. And of course, our friend John is the person to contact.

And for more information, there is a SearchKit reference that's available on the ADC site for ADC members, which is a free membership. and as well you can look at the header files on the Panther CD that you have. And if you just go in System Library Frameworks Core Services, you'll see in the frameworks folder there's a SearchKit folder which has all the headers. And those are fairly well documented with comments as well. And similarly you can look at the Find My Content and Summarization APIs there.