App Frameworks • OS X • 45:11
Latent Semantic Mapping (LSM) is the powerful engine behind such Mac OS X features as the Junk Mail Filter, Parental Controls, Kanji Text Input, and in Lion, a more helpful Help. This session will explain how you can use LSM to make your own documents easier for your users to find, to sort, to filter, to classify, and to retrieve. Learn how to use this valuable framework and the LSM command-line tool. Gain an understanding of the major concepts behind the API, see the types of problems it can address in your application, and learn best practices for achieving robust performance.
Speakers: Jerome Bellegarda, Matthias Neeracher, Kim Silverman
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning, everybody. It's nice to see, flattering for us to see that you sacrificed your beauty sleep to come up here and catch our presentation. So, what do we have for you today? We're going to be talking about Latent Semantic Mapping, also known as LSM. First, we're going to explain what LSM is, how it works.
how you can use the APIs. We're going to present some case studies on how we used LSM. And then more interestingly for you, we're going to talk about how you can use LSM for yourself. There is going to be some time for a Q&A at the end, of course. So what is latent semantic mapping? Latent semantic mapping is a technology to analyze documents, especially text documents, according to their meaning and classify them by topic. Now that's a little bit vague. So allow me to give you a brief demonstration on how this is working.
So, first thing you're going to see here is no C code was written for this demo whatsoever. We're going to be using just a shell tool, user/bin/lsm. And what we're going to do is build a classifier to distinguish Perl files from Python files from Ruby files. We have a whole bunch of these on the system. You can see we have in the library folders, we have something like 14 megabytes of text documents or of text files for these three languages, roughly equally split.
What we are going to do now is train a latent semantic mapping map from this. So you see here the command line tool LSM create. We have a name for the map that you can choose arbitrarily. We have our Perl files, we have our Python files, and we have our Ruby training files.
Each of this is a group of files separated by a semicolon. So we say our categories are grouped, split by semicolons. And we're now going to build this map when I hit return. These 14 megabytes of data are going to be read in. The map is going to be built using a very sophisticated process. So this takes some time, roughly two seconds. And we have our map.
We compiled this into a map of about one megabyte. This map is supposed to classify Perl, Python, and Ruby files. So let's pick some files we haven't seen in the training material. There are a bunch of scripts in /usr/bin, so we're going to... See what this map can do on acLocal, on pyDoc, and on Generate Bridge Metadata.
Now remember, we have three categories. Number one is Perl, number two is Python, Number 3 is Ruby. And our classifier tells us that aclocal is most likely, belongs to category 1, that is Perl. pydoc most likely belongs to category 2, that is Ruby, Python. And generate-bridge metadata is most likely category 3, Ruby.
For those of you who have looked at the file names, which the tool didn't do, it's likely that pydoc was a correct guess. So let's see. "What languages these scripts actually are. And you see that indeed, aclocal is a Perl script, pydoc is a Python script, and generate-bridge-metadata is a Ruby script. So you see that with little effort you can train up a recognizer to distinguish programming languages that may have different keywords.
Now, what other things can you do? You probably are not going to tell apart Perl files from Python files all day. One of our first successful applications was the Junk Mail Filter in Mail.app, which assesses whether an incoming mail message is a legitimate mail or junk mail. The same principle applies to Parental Controls, which do the same classification for web pages.
We also use this for the Japanese users might be familiar with the Kanatukanji conversion to convert Japanese text from a syllabic representation to an ideographic Kanji representation, which is a context-sensitive operation. Internally, we are also using LSM with some successful localization, where again you have a word that can map to different words in another language, and you need some context to disambiguate what the word possibly could mean. So now you've seen what it looks like and what you could possibly use it for, but how does it work under the hood? To answer this question, I would like to bring up my colleague Jerome Bellegarda, who actually developed the algorithms underlying Latent Semantic Mapping.
Thank you, Matthias. Welcome. So actually explaining latent semantic mapping is rather simple because it's all in the name. Let's unravel those three words, latent semantic and mapping. First of all, we're talking about mapping, specifically between words and documents and points in some complex multidimensional space, which we'll talk about.
And the important thing about this mapping is that it is a way to go from discrete entities, words and documents in this particular case, to a continuous vector space, which has very nice mathematical properties. The second point is that this mapping is very simple. The third point is that this mapping is semantic, meaning that we are using the meaning behind the words and the documents in order to unravel the entire fabric of the language, so to speak. And finally, this mapping is latent, which in this particular case means that the meaning that we're talking about here is inferred directly from the data and not obtained explicitly.
For example, from a dictionary. And that inference is based on word co-occurrences. Let's drill down a little bit on those three aspects, starting with this word co-occurrence thing. In particular, what I would like to stress is that two words can co-occur either in the same document, but also, and that's a more subtle way of getting to the co-occurrence, they could co-occur with a common set of words, but never to each other.
In both cases, those words will have a representation in the LSM space, which are in the same region of the space. So this is great for things like discovering synonyms, for example, car and automobile, which can be used interchangeably, but also to get to multiple, to disambiguate between multiple senses of a certain word. For example, the word bank.
If bank and rate happens to be close, in the LSM space, then we can pretty much be certain that the general region of the space represents finance. Whereas if bank and river are close in the same region of the space, then that region is more likely about phishing. So now let's talk about the semantic part of things. And here I'd like to take an example, which is parental control, which Mathias was just mentioning. In parental control, the aim, the aim of this feature is to assess whether a web page contains any explicit words.
And so a naive approach might be to say, well, we're going to look for words like sex. However, it's not that simple because there are many legitimate documents that talk about sex education, for example. So we really have to go to the meaning behind the documents. And we're going to leverage this aspect that I was just discussing, the closeness in the LSM space.
Again, if sex occurs close to toys in the, a word like toys, in the LSM space, then probably that web page is objectionable. Whereas if sex occurs next to education in that space, then that page is probably okay. In the LSM implementation, we use, in that case, two categories. One for the legitimate material, one for the explicit material. And we define special points in the LSM space. Which are called semantic anchors.
And then all we have to do is, for a new web page, evaluate whether the web page, the representation of the web page in the LSM space is close to the legitimate anchor or the explicit anchor. Let me just illustrate on a very simple 2D map. So let's assume we have an anchor which represents, in a sense, it's the centroid of all the legitimate pages that we've seen. And then we have somewhere else the explicit anchor, which would be the centroid of all the explicit documents, web pages that we've seen.
Then what we do is we map an incoming web page into that space. And then simply compute the distance between that representation of the web page and the two anchors. And in this particular case, we see that the web page is closer to the anchor of the explicit material. So we would conclude that this is an explicit page.
Now this was a very simple illustration. This is a very simple example. More generally, of course, we have more than two categories. We might have, for example, a document about finance, phishing, and so on and so forth. But the mapping, the principle is the same. We're going to map this into a multidimensional space. In this particular case, I used three dimensions.
But in general, it's much more than that. It could be up to a couple of hundreds. For example, we found that for a natural language like English, dimensionality of about 200 to 300 is appropriate. And so what we do is in a training phase, we're going to map all of those documents that we have to that space. So this results in all those little red triangles there.
And what you see is that those red triangles collectively define some regions in the LSM space which represents the categories we had to begin with. And then now when a new document comes in, we have a phase-course classification where we map that document into the space. And then based on where it falls into that space, we conclude that, for example, in this case, it's a document about computer science.
Now, of course, this feels a little bit like magic, but it's not magic. The basic information that we use is the difference between how often each word appears in each document and compared to how often it appears in the entire collection. And what this allows us to do is build up a matrix of words and documents.
So for example, in this case, let's assume we have M words in our vocabulary and N documents in our collections. We build up a matrix like this and, of course, the first thing to note is that this matrix is going to be very sparse because, after all, most words don't occur in most documents. However, each -- what this allows us to do is that each word now can be represented by a vector of dimension N and each document by a vector of dimension M.
The problem, as I said, is that each word can be represented by a vector of dimension M. So what we can say is that those vectors tend to be sparse and, of course, those two representations for words and documents don't really -- are not comparable. So in order to address this, we use a mathematical process called the singular value decomposition, which takes this matrix and decomposes it into various -- into those three matrices shown here.
What this allows us to do is to transform that original, let's say, the document, which was in yellow, to map it into a vector, VJ on the other side, the red vector, which now is no longer sparse. It's now dense. It has a dimension R, which is much smaller than either M or N.
So this kind of mathematical trick allows us to have a much more manageable representation to deal with. And so this leads to the kind of LSM space that we saw before. And this is a very interesting example where we can now map those new documents. One thing that's interesting to note is that, well, those documents have words, of course, and those words are also mapped in that space. In fact, there are those green -- I'm sorry. There are those green vectors.
here, those green squares. And so what we can do is two things. We can, for a blue triangle, which is a new document in the space-- we can find out the red triangle that's closest, that's essentially the document that's most similar to that document. We can also find the word that's most aligned with that document that would, for example, function as a good title for that document. So as you can see, there is a range of applications that could benefit from such a representation.
Having said that, of course, there are some caveats to the method. One is that we've been talking about semantic. However, it's a fairly shallow sense of semantic since it's only based on co-occurrences. There is no actual deep parsing or anything like that. Second caveat is that the word order is ignored. So there is no local constraints. This actually can be added explicitly to the framework. And we'll talk about that later on in the presentation. But that's just something to be cognizant about.
Also, this is a statistical method, which means that like all statistical methods, it lives and dies by the training data that's fed to it. And so, for example, let's assume we had a document which had both river bank and bank of Cupertino in it. Well, how could we possibly resolve the two senses of bank? In that particular case, we couldn't.
So it's important to have training data that is useful. Thank you. So we're going to have to do a little bit more of a research. So let's start with the first caveat. What is the use of the single value decomposition? Well, the single value decomposition is a very simple thing. It's a very simple thing. You can use a single-value decomposition to find the single-value decomposition. And the single-value decomposition is a very, very simple thing. You can use a single-value decomposition to find the single-value decomposition.
So let's start with the single value decomposition. So let's start with the single-value decomposition. The final thing I wanted to touch on before moving on to the API is clustering. I've talked about how this LSM space had nice mathematical properties, and that's an illustration of what we can do with this nice continuous space. And I'd like to take the example of the Kana to Kanji conversion to illustrate.
In Kana to Kanji conversion, we use the topic information to disambiguate between the various characters. So we have a very simple example of the Kana to Kanji conversion. So it's for those of us who are not native Japanese speakers, this is a little bit like analogous to resolving tail in the tail of a princess versus the tail of a peacock.
We use the context to sort of get to the orthography in that case. So a little bit similar in the Kana to Kanji conversion. Now, of course, the Japanese corpus is a large corpus. In this particular case, it contains over 300,000 documents. So how to best extract that topic information? Well, the first thing you have to do is to sort of go through each document and assign it a category. Not really feasible. Or you can take advantage of the fact that we have this nice continuous space to do clustering right there in the LSM space.
In that case, the procedure is very simple. We map all the document into a space, treating each document as a separate category. And then we do some data-driven computation. So we can do that in the space, right in the space to reduce the number of categories. And then optionally, we can iterate on that process. And as Mathias will say a little later, we have two implementations of clustering.
There are many -- I should say there are many clustering applications out there, many clustering implementations. We've implemented two as part of the framework. One is called k-means clustering, where you start with essentially k cluster centers randomly assigned usually, compute distances to those cluster centers, and then iterate as you adjust the center rates. This is fairly straightforward. Works quite well. The caveat is that it's sensitive to, well, those initial cluster assignments. If you start with another set of k clusters, you're likely to get slightly different clusters.
And, of course, the problem is to find k in the first place. The second implementation, which is available in the LSM API, is agglomerative clustering. Where here you compute all pairwise distances between all the points in your system. And you merge the closest pair and then iterate again, adjusting the distances appropriately. Now this solves the problem of initial cluster assignments, of course.
The caveat here is that if you have a large data set with, let's say, thousands and thousands of points, then the cost might be prohibitive. So anyway, that's just a brief overview of what we can do with LSM. And to -- To continue with the LSM API, I'm going to bring Mathias back on stage.
Thank you, Jérôme. So, let's discuss how you can actually use the LSM API. As I said before, if at all possible, especially if your data is primarily text-based, try to prototype with the command line tool as long as possible, because it's probably going to be easier for you to design your experiments. Once you then do move to the API, our API is Core Foundation based. All our data types can be used interchangeably with CF-based objects. You can put them into collections, retain, release them, and so on.
We have fundamentally three data types. Our fundamental type is the LSM map, which stores all our data. An LSM map can be in three states. In training state, it's ready to accept categories, ready to accept data. In this-- in these categories, once you have all the data you want for your training, the map goes into evaluation state, at which point it's ready to classify data you ask it to. And finally, a map can be stored to disk and read back from disk without having to be recompiled, and that's so you can archive it and unarchive it. The second important data type is an LSM text, which represents a document.
The first step is to create a list of tokens. As we will see, it doesn't have to be words necessarily. And finally, when you do an evaluation, what you get back is an LSM result from which you can extract various pieces of information that you feel important to know.
So to look at this in detail, you create a map with lsm_map_create, so this is very much terminology you should be familiar with from all this core foundation APIs. You then add categories and data to those categories with LSM map add category and LSM map add text, respectively. Once you have all your training data together, you call LSM map compile, which puts the map into evaluation state.
If at some point you want to go back and train with some more data, let's say in chunk mail filtering, every time the user hits the chunk button, what happens is the map gets put back into training state. The new mail gets added to the chunk or non-chunk category, and then the map gets compiled again. So you call LSM map start training to go back to training. You add your data, LSM map compile to go back.
To evaluate a document you call "LSM Result Create," which is a somewhat, uh, Interesting name because it doesn't have evaluate anywhere in the name, but what you get back from this is an LSM result. So that's, again, consistent terminology. And to archive a map, you call lsm_map_write to URL, lsm_map_create from URL will load it back. And finally, to compute clusters on a map, you call lsm_map_create_clusters, and if you want to reorganize the map based on those clusters, you call lsm_map_apply_clusters.
So that's it for maps. Here we have assumed that you get your texts from somewhere, and this is, we're going to discuss this now in a little bit more detail. There are three ways of putting data into a text. The simplest of them is you trust our built-in parser and call lsm_text_add_words with just a CFString of arbitrary length. We're going to chop up this string for you, extract the words.
And add those words to the text. Now, this is often sufficient. This is all the command line tool can do, and as we've seen, command line tool can do a lot of experimenting, prototyping, and even actual applications. However, there are some limitations there. First of all, we only, you're at the mercy of our rules for what consists of a word. Secondly, we throw away all words that consist purely of digits because those, in many applications, those don't really add a lot of information and clutter up the space because you have all those numbers among the words.
But in some applications, if you have zip codes or something like that, then the numbers become very important. In this case, and also in some languages, our parser may not do the right thing. In this case, you're welcome to write your own parser and to add the words individually. With LSM text, add word in the singular, which also works.
Finally, it doesn't, the words don't have to be text. As Jerome hinted at, we don't, LSM doesn't care what your words are. It doesn't look inside the words. All it uses the words for is distinguishing, saying this is different from this one, this is equal to this one. So you can have arbitrary binary tokens instead of the words. So you just add a bunch of bytes in a CFData. You call LSM text add token. It will be as good as a word as far as LSM is concerned.
Finally, when you evaluate such a text, you get back an LSM result with lsm_result_create, as we've said. Now, the most frequent result you want from that is lsm_result_get_category, which is, is this a good guy, is it a bad guy? Is it a junk mail, is it legitimate mail? Sometimes you also want to know the scores associated with those categories. Our scores are normalized between 0 and 1, so for each category, you get back a score, and if you add them up across all categories, you get 1.
In some applications, you don't actually want to know what the best category is for a document, but you want to know what words best represent a document, in which case you would call lsm_result_copy word, or in the binary case, lsm_result_copy token. So that's more or less all there is to the API. You will see more documentation in Xcode. You can read our headers. It should be relatively simple to figure out.
So what have we been using LSM internally at Apple? As I said, one of our first applications was junk mail filtering. In many ways, this is a relatively simple application. We have two categories: legitimate mail, junk mail, and we have In a typical map, you will find tens of thousands of words, if not more.
Now, there are some refinements that we did for junk mail filtering. First of all, we did not just call lsm_result_get_category. We don't treat the risk of throwing away a legitimate mail just as bad as the risk of letting through a junk mail. Most people would rather see an occasional junk mail than have a legitimate mail discarded.
So, we introduce a slight bias. We want to err on the side of caution and rather let through the extra mail every now and then. The way we do this is that we call not just lsm_result_get_category and compare that to our junk mail category. We also make sure that the score associated with this category is not just greater than 50%, which it always will be, if it's the first category, but greater than maybe 55 or 60. So, we want a threshold that is higher than just 50/50.
The second complication associated with junk mail is that the bad guys know that we're doing this kind of filtering. So, what they did was they started spelling their keywords in funny ways so they wouldn't be caught so easily by a simple filter. They added periods between the letters or they started using accents where no accents are really needed. We call these heavy metal umlauts. And so, we added a couple of heuristics to our parser.
Obviously, we're not going to tell in great detail what those heuristics are, but the basic idea is that we recognize such attempts of disguising your keywords, and we pluck out those keywords, and these attempts, in fact, make the keywords stand out even more, like sore thumbs. So, we're using the bad guys' methods to help us, because their words are going to be even more obvious afterwards.
Which we do with this flag, lsmtextappliesbyheuristics. The second problem is that, the third problem is that the map will end up with all sorts of offensive words, and we don't want third graders to look at the map in a text editor if they, if your average third grader knows how to fire up Emacs and look at the junk mail map. We don't want them to point at the words and say, "Haha, I found a naughty word." So, what we do here is add another flag, lsmmap#text. This is not cryptographic security or anything, it's just the map will no longer be readable in plain text.
Another thing to emphasize is that we're employing LSM as a last line of defense. This is not our ultimate or it is our final judge of junk mail versus good mail, but we're using a lot of rules before. First of all, the mails are already going to be filtered by your Internet service provider.
Most of them throw away a lot of their incoming mail, and some of them are pretty good at this. So a lot of the junk mail is going to be filtered before it even hits your machine. Once it hits your machine, first we apply the rules you have added explicitly in the mail rules, and anything that gets covered by one of these rules gets filed away as good mail. Another thing we do is we compare it against your address book. Of course, the bad guys can forge your senders so they look legitimate, but we want to err on the side of caution.
Anything that comes from somebody you know and is in your address book gets considered to be legitimate mail. And only once none of these rules fire do we use latent semantic mapping to decide as a last line of defense, is this junk mail, is it good mail? So this was one of our first applications of LSM that we employed in the OS. For one of the latest applications of latent semantic mapping in the OS, I would like to bring up my colleague Kim Silverman to talk about the application of LSM to Help. Thanks Matthias.
Thank you. So this is new, and we wanted to tell you about this because this is a different way that you might think about applying Latent Semantic Mapping in your own applications. OS X contains about 10,000 text documents to help users. You get these through the Help menu that's in the menu bar of all the applications, you type a query, and we bring back the best document. Well, we try to bring back the best document. The problem is finding which document is the most relevant. Let me show you an example. One of the most common search queries that people send to help in OS X is the word "save", believe it or not.
And if you type that into Snow Leopard, these are the documents that are returned. Let's take a look at them. Top is installing and uninstalling screensavers. Probably not very relevant to saving documents, which is what I bet this query is about. So, meh. What about the next one? Turning off the screensaver in front row. Meh. What about the next one? Converting grab screenshots to other formats. Meh. And so on it goes.
Oh, wait a minute, here's one that might be a little bit relevant. Saving a document in PostScript format. Marginally relevant. How many of you save documents in PostScript formats? Yeah, see, okay, so we're getting about a 5% hit rate. Retrieving the right document is not trivial. And Snow Leopard does not take a trivial approach.
At its core, it uses Search Kit to find help documents that contain the words that are in the query. The more query words that are in a document, the more relevant it might be. In addition, experts have added by hand synonyms to common typos and known words that are related to each other.
The problem is that... well, there are two problems. One is, when people type a query, the words in that query might also occur in documents that are not really relevant to that query. The second problem is that the words in the query might have synonyms that are used in the actual document. The person might have not used the right words.
So both of these suggest that maybe latent semantic mapping could help things along and be fruitful here. So we take all of these 10,000 documents and map them into a multidimensional semantic space. As Jerome has mentioned, words exist in the same space as the documents. When a user types in a query, we map that query into the same space, find the nearest documents, and show those to the user. How well does it work? Well, to evaluate, we took a hundred of the most common queries that people type to help.
For each query, we retrieved the top documents using the keyword-based method and the top documents using latent semantic mapping. About 20 people rated how relevant each document was to its query. And for each query, therefore, we ended up with two numbers. One was the average relevance of results returned by keyword search, and the average relevance of results returned by latent semantic mapping.
The horizontal axis here shows the relevance of results returned by the keyword search. Queries to the left produce irrelevant results, the right-hand end means highly relevant results. The vertical axis is the relevance of results returned by Latent Semantic Mapping. Higher is better. Every point in this space is a query.
Queries that land along this diagonal would mean that Latent Semantic Mapping performed about as well as a keyword search. Any queries that came out near the bottom right would have been answered better by a keyword search than by Latent Semantic Mapping. Queries that come out at the top left would have been answered better by Latent Semantic Mapping.
Here are the queries that were answered better by Latent Semantic Mapping. Here are those where both methods worked about the same. And here are the few where keyword search actually performed better, but not much better. This region is particularly important. These are the queries for which a keyword-based search totally failed, and LSM made the results significantly better. One advantage of doing an analysis like this is that you can identify those cases where LSM is not working so well. For example, this guy down here. Once you identify those cases, you can analyze them, figure out what went wrong, and do something to address the problem.
Which brings me to what can you do to improve how well Latent Semantic Mapping works in your own application? We send documents into the Latent Semantic Mapping Framework and we get out a result. You can improve the performance if you insert a pre-processor before you send the text from your documents into the engine, and improve it even more if you post-process the results on their way out. One way to pre-process those results is to use engrams. Jerome foreshadowed this when he mentioned syntactic relationships between adjacent words. Engrams are word pairs and word triplets, and they can capture different senses of meanings of words. For example, click in double click refers to a user action.
Click in Key Click refers to a system sound. Ngrams can capture these kinds of differences. You can ask the Latent Semantic Mapping engine itself to calculate all bigrams and, if you wish, trigrams in your training data and use these in the classification. Or you can preprocess the text and explicitly add tokens into your documents to represent Ngrams that you care about. Another thing you can do, and should do, is remove from your documents text that is not systematically related to the content, to the meaning of those documents. For example, HTML tags, or text that occurs in every document, such as "return to the contents".
Another thing you can do is stemming. For example, the words "save" and "saving" probably should be collapsed together for purposes of semantics, but the word "saver" should not. The point is: help latent semantic mapping, a statistical technique, by capitalizing on your application domain-specific knowledge and encoding that, implementing that as heuristics in a text preprocessor.
You can further improve the results by implementing a post-processor. Let me walk you through an example that we did. When you type a query to help in Lion, we suggest some completions for your query. When we return the results, we also return some related search terms. All of these are engrams generated by the way we use the Latent Semantic Mapping Engine. And generally, they're quite helpful.
But not always. For example, there are documents that tell you how to clean up your workspace. There are other documents that tell you how to set up your workspace. The most common substring that these contain is "up your workspace". So, we filter the output, the engrams, to exclude any engrams that begin or end with a preposition.
So how can you make latent semantic mapping work better for you in your own application? First of all, think about are there aspects of my application where we could use it? We've shown you a few examples, here are some more to get you thinking. Bookmarks. If your users have bookmarks, you could classify them by analysing the documents that those bookmarks point to.
Similarly, if your users have RSS feeds coming in, you could analyse their content and perhaps categorise them together or use this query approach that we've just described to allow users to retrieve the RSS feeds or the documents that they think are relevant by some query. If your users have media, you could go out on the web, retrieve reviews and categorise them by those reviews. Similarly, if you've got wines and cheeses, go out and find the reviews and the descriptions about those and categorise them by what people say about them. People have used latent semantic mapping to analyse DNA sequences. In fact, it's a pretty common technique for those.
Some guidelines. In trying to think about can latent semantic mapping help me in a task in my application, you should ask yourself, is the problem that I'm going to address syntactic or semantic in nature? If you're looking for syntactic patterns such as dates or email addresses, latent semantic mapping is not the appropriate technology. But if you want to sort things by topic or retrieve documents by queries, then it will help.
If you are training it to distinguish between categories, then you have to make sure that those categories are sufficiently distinct. It will be difficult for Latent Semantic Mapping to learn the difference between economics and business, but it will be much easier for it to learn the difference between economics and entertainment.
If you are training it, you may have lots of documents already that you've already categorized by topic. For example, Matthias showed you an example where we had lots of different text files of, we knew which programming language they were in. If you're doing that, Partition your training set randomly into 10 random chunks. Train on the first 9, hold out the last chunk for testing. And do that for all possible 90%/10% splits. That'll give you the best way to predict how will this work on documents that I haven't seen yet.
If, while you're testing you find that some of the results look strange, then there are some things that you can do to improve that. Some of them I've already told you about, pre-processing and post-processing. In addition to another way to help the pre-processing is to use stop words. The engine allows you to give it a list of words that you want it to ignore.
For example, words like "the" and "of" will occur in all documents regardless of their meaning. And if your training set is large enough, the engine can automatically learn these, but you might need an infinite training set. You could probably help it along by saying, "I know that the following words are not going to help for my topics." Also, experiment with the number of dimensions. By default, we use the number of dimensions to match the number of categories you're training it on. But if you're just sending in all documents, like we did in the Help case, then choose somewhere between 100 and 300 dimensions. We used 300 for Help.
The training data should, as much as possible, represent the full breadth of the domain, the kind of variability that you expect to see in the next document that your user brings along. And if you have multiple categories, try to have roughly even numbers of documents in each category.
People ask, "So how much training data should I have?" Well, it's hard to say. One rule of thumb is large enough to cover the variability in your topic domain. A rule of thumb is try to have at least 30,000 unique words. The more categories you need to distinguish between, the more instances of each category you need to let the engine see. And if the data is going to change over time, such as news, then you need to train on even more data.
Finally, we suggest, recommend in fact, that you integrate Latent Semantic Mapping with other sources of knowledge. It's not a replacement for other techniques, it's a complement to them. It can often improve the robustness of an overall system. Matthias already worked you through in some detail what we did with the Junk Mail Filter. In the same way, when we were doing the Kana to Kanji conversion, we did not replace everything with the Latent Semantic Mapping, but rather used it as just an additional source of information that was exploited in the final decisions.
So, now over to all of you. Go forth and map some text. Thank you. We're supposed to show you some more information. There's some documentation available. There's a mailing list at lists.apple.com and that's a good place to talk about with people what worked for you, what didn't work for you, and get hints and best practices.