Integration • 58:30
The Latent Semantic Mapping (LSM) framework is the text-analysis technology that underpins the junk mail filtering of Mac OS X Mail and the Mac OS X Kanji text input method. Using this API and text samples with known characteristics, you create and train maps, which you can use to analyze and classify arbitrary text. Learn more about this technology and the major concepts behind the Latent Semantic Mapping API. FInd out how this API lets you add innovative features to your applications, and learn development best practices to achieve the most robust performance.
Speakers: Kim Silverman, Matthias Neeracher, Jerome Bellegarda, Giovanni Donelli
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
My name is Kim Silverman. I manage the spoken language technologies. We do speech synthesis. We did the Alex voice in Leopard. Who has heard Alex? Thank you. We do other cool stuff. We do speech recognition. Most relevantly for us today, we do latent semantic analysis, figuring out what text documents mean. Have you ever looked at a document and said to yourself, "What the heck does this mean?" Well, latent semantic analysis figures that out, and that's quite useful. We use it quite successfully for the junk mail filter.
We released the API to you folk. And in Leopard, we have released new features that do clustering. You know, any sufficiently advanced technology is indistinguishable from magic. So to tell you about this magic, I'd like to introduce the senior software engineer who implemented the Latent Semantic Mapping Framework, Dr. Matthias Neeracher.
Good afternoon, everybody. And I probably have to correct Kim. It's not really magic. It's at best, it's a little protective hex or so. You can think of that. So, today we're talking about Latent Semantic Mapping, what it is, how it works, how you can use the API, and how you can avoid using the API. We're going to discuss how we used it in Mac OS X, and then move on to what's more important to you, how you can use it in your application.
So what is latent semantic mapping? In its most general terms, latent semantic mapping is a technology to analyze text or analyze other data in fact, and classify it into one of a number of categories. Now this is awfully abstract, so let me demonstrate. If you look at /usr/bin, there are not just binaries in there. There are more and more Perl scripts, Python scripts, Ruby scripts in there. And... So let's try to use Latent Semantic Mapping to train up a classifier that can tell these types of scripts apart.
So first of all, we need training data. And it turns out we have plenty of that. We have lots of libraries installed. If you look at it, you have something like a good five megabytes of source code for these libraries in all three languages combined. That's more than enough to train up a recognizer or a classifier. So let's train up a Latent Semantic Mapping classifier.
And that we can do with a command line tool that is actually also in /usr/bin. It's called lsm for Latent Semantic Mapping.
[Transcript missing]
And we use these source code libraries as training data. Now, that's a quite complex task. You read in five megabytes of data. You do complex number of numerical processing that somebody's going to talk about later. So that's going to take some time, about a second.
Now we have our map. The five megabytes of training data is now represented by about a megabyte or so of a map. We can use that map to evaluate files to see what category they fall into. So we have three categories. We have category one is Perl, category two is Python, category three is Ruby. So we're going to look at three particular scripts in Userbin and see what door Latent semantic mapping things they are behind.
So we see that acLocal, our guess is that this is going to be Perl because category one has by far the highest score. PyDoc, LateN Semantic Mapping makes this astute guess that this might be Python. And GenerateBridgeMetadata, our guess is that this is going to be Ruby. And now we're going to cheat and actually use a method We just look at the first line of these scripts, and we find out that, lo and behold, acLocal is a Perl script, Pydoc is a Python script, and GenerateBridgeMetadata is a Ruby script. So, in this case, we had 100% success, although you should not expect that to happen every time around. That would be truly magic.
Can we go back to the slides? So we use Latent Semantic Mapping. The best known example in Mac OS X is we use it for the junk mail filter to assess whether a mail message is legitimate mail or junk mail. Now in Leopard, we also use it somewhat similarly in parental controls to filter out the evil web pages from the good web pages.
We also use it in the Japanese input method in our kana to kanji conversion, because if there are multiple characters with the same reading, we can use Latent Semantic Mapping as one of many methods to disambiguate between the ambiguous readings. And we use it internally for localization to use the underlying topic of discourse to disambiguate words that can be translated differently in different contexts.
So you've seen what we do with it. How does it work? For that, I would like to bring up a colleague. There's going to be some math, but don't worry, it's not going to be on the test. Please welcome the scientist, distinguished scientist who brought this technology to Apple, Dr. Jerome Bellegarda.
Thank you, Matthias. So I'm Jerome Bellegarda. And what I'd like to do is give you a little flavor of how LSM works. And in some sense, this is relatively easy, because it's all in the name. Latent semantic mapping has three aspects to it. Well, it's mapping. And it's semantic. And it's latent. So let's go through it. First of all, it's a mapping. The concept originated from information retriever.
And it deals with entities such as words and documents. And what it does is it maps those entities, which are discrete in nature, into a continuous space, which we'll talk about. So second, it's semantic. That means it has something to do with the overall content and the meaning which is inside those documents.
And it's also latent. Latent means hidden in the general sense. In this particular context, it means that we're going to leverage the co-occurrences between words in order to infer some of their hidden meaning. And that will enable us to handle things like synonyms and, also, multiple senses of the words. So for example, when it's well-trained, the framework knows that car and automobile are synonyms.
Similarly, it knows that bank has to do with financial institution, most likely. But also, it's clever enough that if bank co-occurs with rate in the same document, it can infer that the topic of the document most likely is about finance. Whereas if the same bank co-occurs with a word like river, then it's most likely an article about fishing.
So what I'd like to do now is walk you through those three aspects of LSM in turn, starting with mapping. So as I mentioned before, what we have is we sort of gather a whole bunch of documents. That's called the training corpus. And what we're going to do with those documents is try to map them onto a space. And in this particular case, I use a three-dimensional example because it's easier to visualize. But in general, it could be fairly complex. It's a complex space with a lot of dimensions.
And so the point of the exercise is to take each of the documents that we have and map them onto points in that space. So in this particular case, the triangles, the red triangles that you see, are just representation of the original documents in that space. This is called the training phase.
And so we do that with all the documents that appear in the training, and so pretty soon, what we see is that we see regions of the space that characterize certain topics, such as finance and computer science and so on. Then during a phase called classification, what happens is that we take a new document-- might be a web page, for example, something like that-- and we're trying to see where it fits in the space.
And so what we do is we map it in the space using the framework, and lo and behold, it falls somewhere in there. And we sort of look at which cluster it falls within, and we can deduce that in this particular case, for example, the web page that we had is most likely about computer science. That's the mapping part. Let's go into the semantic aspect of it.
And I'd like to take the example of parental controls because my colleague Giovanni is going to go through a case study of that later, but very briefly, and I'm going to give you a little bit of a summary of what the subject here is to assess whether a web page is free of objectionable material or whether it contains words that are not suitable for children.
And so in the LSM strategy, what we're going to do is we're going to try to encapsulate what exactly characterizes an objectionable material versus what exactly characterizes legitimate material. And so for that, we have two natural categories, one having objectionable material, or so-called explicit material, and one for the legitimate material.
And so the implementation is that we're going to construct an LSM space. In this particular case, since we have only two categories, it's going to be two-dimensional. And then we're going to define what's called semantic anchors in that space. Those will be the points that most characterizes the objectionableness, if you will, of the material versus the legitimacy of the material. And then every time we have a web page, we're going to evaluate it against those two anchors and then figure out what it is. So let me illustrate. So here, again, we have the two-dimensional space.
And we have those two anchors. So the legitimate anchor might be this green square here. You can think of it as the centroid of all the legitimate web page that we've seen in the training data. And similarly, we have the-- objectionable material, the anchor of objectionable material that might fall in there. And again, you can think of it as the centroid of all the pornographic web pages and so on that we've seen in the training data.
And then a new web page comes in. We map it in the space. We then compute the distance between that representation, that point, and the two anchors. And we see that in this particular case, it is closer to the anchor of explicit material. In which case, we deduce that it's probably a page that you wouldn't want your children to see.
So let's move on and now talk about the latent aspect of things. And for this I'd like to go through a little case study, a little artificial construction here. Imagine a four feature calendar management system where you can have only four queries. You can ask about the time, about the day, about the time of the meeting. You can also cancel the meeting and that's all you can do. Very contrived example but it will I think illustrate some important points. So the LSM strategy here is we have four natural categories which means that we're going to construct four semantic anchors in the space.
And then of course as we evaluate the input, any input to that system that may not be necessarily in the same words as what's expected, we're going to... ... ... ... ... ... ...we're going to map this input into the space and figure out which query, which category it is closest to.
So again, because, well I'd like to first before we go through the actual latent space for that, why did I choose this admittedly contrived example? Well, those categories, those queries have been chosen because they illustrate a couple of different things. First of all, the predictive power of the words that appear in those queries are very, is very different. For example, the, you'll notice, appear in all four queries, meaning that if you see the, that gives you absolutely no information whatsoever about which query it is about, meaning its predictive power is essentially zero.
Whereas they, or cancel for example, they each appear in one particular query and nowhere else, meaning that if you see... ...the, you know that I'm talking about category number two in that case. So the predictive power is different. Also, we have all kinds of occurrences. We have overlaps.
We have meeting, appearing in, in two of the queries. We have ambiguities and so on and so forth. So that's kind of an interesting case study. Now let's assume that a user has been using this system and was successful with it, and then he goes, you know, goes in, takes a little break, goes... ...somewhere, comes back, doesn't quite remember how to query the system.
So he says something like, "When is the meeting?" instead of, "What time is the meeting?" And you'll notice that in that case, "When" has never been seen by the system. And so the question is, what will LSM do with that? So let's go through, that's the actual illustration. You can actually construct this, this example in, in using the LSM framework and see what, what happens. This is the two-dimensional rendition of this example.
And here I've mapped the words. So those are all the words. "The" you recall had no predictive power, so it just happens to be at the origin. "They" and "consul" you recall each predicted one query particularly well. And so they tend to score high on one dimension and low on the other.
And the other words fall somewhere in the middle, and you notice that "what" and "is" happen to co-occur all the time, meaning that they fall on top of each other in this space. Now about the, the query themselves. Here they are. Not surprisingly, "what is the day?" falls very close to "day," which is the word that predicts best. And similarly with "consul the meeting." And the other two fall similarly fairly close to the, their constituent words.
The question is, what happened with the new input? Remember, it was "when is the meeting?" And so you do the computation, and lo and behold, that's where it falls. And you see that the query that is the closest to it is this one. So that would be the query that's predicted by the system.
And so what we've seen is that in a very sort of data-driven way, the system has essentially figured out that "when" is a synonym for "what time." Isn't that great? Okay, so that seems like magic. It's not all magic. And now I'm going to walk you through some of the tedious details of the math that's behind it.
I'm going to keep it short and simple, hopefully. The two pieces of information that we leverage here is how often does each word appear in each document, and compared to how often does each word appear in the entire training data. And what we do is we form a matrix of words by documents. It is not a Christmas gift, despite the appearance. This is simply a two-dimensional array of numbers.
Each entry in this matrix is a weighted count, weighted appropriately, of essentially how many times that particular word, the green word, appear in this particular document, the red document. And in order to make -- so this defines, as it turns out, this appropriate vector representation for the words in the documents. But then it turns out they're not completely practical.
So we go through an additional step, which is called singular value decomposition, or SVD for short, that takes this matrix W and massage it a little bit, sort of decompose it into three different matrices that are easier to deal with. And this gives us the appropriate representation, the points in the space that we were looking for. And those -- the set of those points are essentially what the space looks like, the latent semantic space.
So coming back to my example from the beginning, we then have a set of documents, a bag of documents, and we map it into the space, just as we did in the beginning. Then when we have a new document, we classify it and find out its position in the space. We also-- we've seen in the case study that I presented before that from the document, of course, we can infer the words, and we can map those words equally in the space.
So we are left with a framework that allows us to respond to two questions. Given this blue triangle, which is the new input, this web page that I want to classify, we can ask, what is the closest document in my training corpus that, you know, the document that's closest to this particular input? And that would be given by the red triangle. And this is the kind of stuff that we do, the kind of question we answer when we do junk mail filtering and also parental controls, as you will see.
But we can also ask, given this blue triangle, given this new input, what word best characterize that document? And that would be the green square there. And this is the kind of question we answer when we do a Kanata-Kanji conversion, for example. Try to figure out which word best characterizes the document. And so we will either figure out which document is closest or which word is closest in that LSM space.
Now, a couple of caveats. You know, lsm is powerful. It's no panacea. It only works within the constraints of the particular paradigm. And in particular, it has three limitations. One has to do with a fairly shallow sense of semantic. We're not talking about semantics as is described in a typical artificial intelligence literature here.
The semantic aspect of lsm is completely tied to word co-occurrences as they appear in that matrix that I was talking about. So it's fairly shallow. Still powerful. The second aspect is that the word order is ignored. In other words, we take the document as a bag of words. Hence the little pictograph of a bag here.
[Transcript missing]
So how do we use the API? The first thing to notice is the demo I just gave you does not use the API at all. It just uses the command line tool. And in fact, in a large percentage of applications of Latent Semantic Mapping, the command line tool is perfectly adequate to do the prototyping, to possibly even pre-train a map, to do a lot of your testing.
So you can decide whether it's actually working out for your application before you write the first line of C code using the API. Once you do want to write that first line, the API is core foundation based. That means all the types in the Latent Semantic Mapping API can be retained, released. You can put them into NSArrays. You can do unspeakable things with them. So we have-- Latent Semantic Mapping framework is based around three types.
The LSM map, which is the data structure we use to store the transformed training data and to evaluate things against. The LSM text, which is where you would put in text or your other data that you want to use for training or for evaluation while you're building it up. And as a result of the evaluation, you get back an LSM result.
To dig a bit further into an LSM map operations, there are LSM map create to create a map. Then you start training it by creating a text by first creating categories with LSM map add category. And then you keep creating texts and calling LSM map add text to add data to those categories.
Once you're done with your training data, you call LSM map compile to put the map into a compiled state. Now, occasionally you may want to go back and add more data. For instance, in junk mail filtering, whenever the user hits the junk mail or not junk button, we go back, put the map back into training state, add the mail as further training data, and compile it again. That putting back is done with LSM map start training. In evaluation state, To evaluate a text, you call LSM result create because according to core foundation naming rules, what you really get back is an LSM result. So that's what the call is named after.
And to store a map to disk and to load it back in, you have lsmmap write to URL and lsmmap create from URL. And finally, the calls that we didn't talk about last time that are new for Leopard is lsmmap create clusters to discover the clusters present in a dataset. And then if you decide that they are good, that you want to use them, you call lsmmap apply clusters, which reorganizes the map around those clusters that you've discovered.
So moving on to an LSM text. The name might suggest that an LSM text is a sequence of words. And that's, in fact, what it often is. But it can be more than that or less than that, depending on how you view it. There are three ways of adding data to a text. The simplest way, and the way that is supported by the command line tool, is you just take a long string and let latent semantic mapping break that up into words and add those to the text. That's done with LSM text add words.
This default parser does not work in all circumstances. For instance, our parser discards numbers, because in many contexts, the numbers just clutter up the space and don't really add any information. But in some cases, it's crucially important to preserve numbers, in which case you want to write your own parser. Also, we might not parse all languages correctly. So for some languages, you might want to write your own parser, in which case you just write your parser, and whenever you found a word, you call lsm.text.addWord in the singular to add that to the text.
And the third option is you don't necessarily have to use words, because despite all the semantic stuff, latent semantic mapping is fundamentally agnostic of what's in your tokens. It doesn't have to be a word. It can be in any language. And in fact, it can't just be binary data. So if you want to use binary data, if you want to use a CFData instead of a CFString as your fundamental unit, you just call lsm_text_add_token.
Now you have your text and you want to evaluate it. You do that by calling lsm_result create, specifying a number of categories that you're interested in. Sometimes you have this kind of winner-takes-it-all evaluation, in which case you only want one category, the top one. In some cases you want to do some kind of an n-best thing, in which case you would have a number greater than one.
For this number of categories, you can inquire what the category number is. For instance, as we saw in the example, one for Perl, two for Python, three for Ruby. That would be lsm_result.get_category. You may want to know what the score is. We have the scores of the categories adding up to one conceptually, so it's like a probability. You want to get the score to find out what your confidence in the result is, and you do that with lsm_result_get_score.
If you do a word-based mapping, where you want to get back the best matching words, then the category number doesn't tell you a whole lot. That's just the index within the overall list of words. So you would rather prefer to have lsm_result_copy_word to get back the CFString of the words that you put in.
In a clustered map, you may need lsm_result_copy_word_cluster, so you get a CFArray of CFStrings. And similarly, if you have a map using binary tokens instead of strings, you get lsm_result_copy_token or copy_token_cluster, which just returns CFData or CFArray. So you would rather prefer to have lsm_result_copy_word to get back the CFString of the words that you put in. In a clustered map, you may need lsm_result_copy_word_cluster, so you get a CFArray of CFStrings. with a CFArray of CFData.
That's how we used the API. So how did we use it? Our first usage, I said, was chunk mail filtering. In a way, that's a pretty simple application. You have the good guys here, the bad guys there, no ambiguity whatsoever, but there are some refinements needed. First of all, we don't want to have equal error rates.
As aggravating as it is to have a chunk mail delivered to your inbox, it's much worse to have a legitimate mail drop out of sight in your chunk mailbox and not to be discovered until several days later. So we err a little bit on the side of classifying a mail as legitimate.
We do that by not just testing the top category against the chunk mail category, but by also testing that the score we got for that top category is greater than a threshold which we defined to be somewhat greater than 50 percent. So you might may want 55% certainty.
The second problem is that the people who send junk mail, they are evil but they are not necessarily stupid. So they know that they're parsing their mail, so they're making their mail difficult to parse. And they do that, for instance, by inserting gratidious punctuation or doing the heavy metal thing and inserting gratidious umlauts and other diacritics. So we need to counteract that, so we have a specialized parser that is just used for parsing junk which we activate with a flag to LSM text AdWords.
Our third problem is that if you look at the map generated by training up a junk mail filter, and you dump that out of the console with strings or with cat, you would see all sorts of offensive words and we don't really want that. You don't want your children to use cat on that map and see a word like mortgage.
So to counteract that, we added a little bit of a hashing function. It's not military-grade cryptography by any means, but it's just enough that you would not really be able to read anything useful with a text-based tool. And that might be useful in other contexts as well. So you can specify this flag, lsm_map_hash_text when creating a map.
And one final and important point is LateN Semantic Mapping is not used in isolation here. It's used as a last line of defense. If you think about the path that a mail message takes, first goes to your internet service provider and your internet service provider probably is going to throw away about 85 to 95 percent of incoming mail and if it does its job properly because much of it is so obviously spam you don't you never want to even deliver that to your end-user machine once it is delivered to your inbox you apply the handwritten rules that you have in mail dot app if you to sort mail early that's usually only used to define a mail early as good mail not to throw it away but you could also write rules to throw it away then what you do next is you look up the sender of the mail in your address book because that's another case where we err on the side of caution if you know the person who pretends to have sent you this mail we rather present you the mail and let you decide whether it's a desirable mail so add a mail to your inbox and you can send it to your email address book and address book is also used to preserve a mail and only if all those filters don't give a conclusive response do we use latent semantic mapping To decide whether a mail is to be discarded or whether it's to be kept. So this is kind of the simple case. For a more complex case, I would like to bring up my colleague, Giovanni Donelli, who implemented the Web Content Filter.
and I'm a software engineer, member of the team that designed the Leopard Web Content Filter. With my presentation today, I would like to show you what is it like to use Latent Semantic Mapping to solve quite a complex problem such as web filtering. But before I delve more into the details on how we use LSM, let me show you where you can find the web filter.
So as some of you may notice, Leopard introduces a brand new set of parental control functionalities. It allows you to impose several restrictions on the use of the computer and of its applications. One main area of concern nowadays, obviously, the internet, especially for parents. And I think we all agree that the web can be considered a potentially harmful place for younger minds. So the objective of the web content filter is to validate the material coming from the internet, making sure no kid gets exposed to explicit material.
So the filter is based on a two categories map, which has been trained to recognize explicit and legitimate material. Our training data is made by approximately 7,000 pages, which have been collected primarily manually in an effort to approximate the variety of the distribution of the web. This is just a text-based filter. We're not doing any fancy image analysis or skin recognition of any sort.
We're taking the HTML page source and converting that into simple text by removing all the layout information, all the tags that would add clatter and noise to the LSM map. In this process, we're obviously trying to retain as much information as possible, such as meta tags or image descriptions.
As you've heard from Matthias, LSM is currently ignoring numbers, but numbers could be very relevant to our classification. You see here a really good example. The USC 2257 identifies a law that regulates the handling of adult material on the Internet. So obviously the presence of this string can highly identify explicit material, and we convert numbers into tokens that you see here in the example, so that LSM can learn to distinguish specific words or specific flags that are potentially explicit.
The structural feature of a web page can also give us some hint on the type of material displayed by a web page. As you probably can imagine, explicit pages tend to have a lot of images and little text. So we are converting a structure of features, such as the number of images or number of words, into tokens, into just fake words that we insert in the text. And we just let LSM learn what features are to be considered potentially explicit or legitimate.
To test how well the filter works on our scale, we need to verify the accuracy of the web filter over a set of pages which have never been seen in the training phase before. For the purpose of collecting test data, we are using the Open Directory project. This is an open source effort to classify the web.
Users on Demos have already classified thousands of legitimate and explicit pages. We're taking approximately 100,000 of these documents, and we're making the filter evaluate every single one. And the filter associates a score between 0 and 1 to each single page. Then we aggregate this information in this graph. In the vertical axis, we have the page score. The higher the score, the higher the probability for a page to be explicit.
On the horizontal axis, we have the page score. On the vertical axis, we have the proportion of pages with that score. So let's see how our data distributed over its core. These are the legitimate pages. These are the explicit ones. So the shape of these two curves tell us that the explicit pages are all very similar. In fact, they are primarily classified within a very narrow score range. Legitimate pages, instead, are much more diverse. They cover a wider variety of topics. That's why the curve is much more evenly distributed in the first half of the core spectrum.
Somewhere where those two lines overlap, we can identify the equal error rate. And at this point, the filter makes equal mistakes in classifying explicit material and legitimate. The equal error rate is a really good metric for us of the filter accuracy. And as we evolve the filter, our goal is to minimize the equal error rate. By looking at this graph, we can also understand where we should put our fifth threshold and the risks that are involved with this decision.
So automatic testing is really great to understand how our filter works on the Internet. And it's extremely helpful to keep track of the filter progress over time as we change it and see if there are any regression. However, it can only be as accurate as long as the test data that we're using is a good representation of the web.
There are also some disadvantages of automatic testing. Primarily, it has a quite high initial cost to set up all this system to gather all this test data. It might not really be suitable in a prototyping phase. And most importantly, Automatic Testing is not really telling you what's going wrong with your LSM map.
It's not allowing you to easily understand what you need to change in order to improve your classification system. So what we found extremely helpful, especially at the beginning of the development of the web filter, was to analyze what LSM will learn in the training phase. You can do that from terminal. I use the LSM dump command, and you pass it as input the map file.
All of a sudden, we print out all the data gathered during the training phase, and let's analyze it more closely. Starting from the last column, So in here, we read all the words that LSM has seen in the training documents. These make the dictionary of the LSM map. These are all the words that LSM is able to recognize and assign a meaning to.
Then we have as many columns as the number of categories that are in our map, which is, in the case of the web filter, just two. And here we read the number of times a given word was observed in the training documents provided for each specific category. So by looking at the values on these two columns, we can have an idea on what LSM thinks about specific words.
So for example, The word advertising was found five times more in the training document provided to train the legitimate category. And therefore, LSM is thinking that this word is legitimate. The word "storings" then was found almost equally in both categories. Therefore, the semantic meaning of this word is very low. And LSM would probably ignore this word as it's classifying a document.
So the idea here is that we can go through all this list and verify whether LSM has really learned to recognize what words are to be considered explicit or legitimate. The problem here is that the number is pretty big. In our web filter, we have over 100,000 unique words. So this is not something that you can do manually. You want to write a script that identifies the most significant words that are being used to classify a specific category.
So for the sake of the example here, let's say we want to identify the most representative, the most significant words that have been used to classify the specific category. Well, we first need to select those words that have very high word count. These are words that appear in almost every single document that is provided in the training phase.
And then out of this world, we need to select only those that appear significantly more in one category versus the other ones. And we call a word with this feature an indexing word, because its presence in an unknown-- and porn is an indexing word for the specific category-- because its presence in an unknown document highly shifts the classification of this document towards the specific category.
So one of the very first thing that we wanted to verify when we were developing the web filter was to see whether the top indexing word of this specific category were actually related to adult material. And this is what we obtained. So without doing a very complex analysis, we could see very well what the problems were here.
The filter is considering explicit words that are just frequently used in English or that are just part of the Internet vocabulary. It was pretty obvious that in the first prototype, the training data that we were using was very imbalanced. It did not really approximate well the distribution of the content of the whole web. We also realized that finding a good representation of the web was a quite hard task to accomplish. So in addition to finding better training data, we should also exclude from the classification words that don't really have any semantic meaning associated.
And we called the words that I excluded from the classification "stop words." And this graph shows you the distribution of the training pages over their score. The blue dot are pages that we know are legitimate. The red dot are pages that we know are explicit. On the vertical axis, we have the page score. And once again, the higher the score, the higher the probability for a page to be explicit. So if our filter was perfect, all those red dots would be at the top of the graph, and all those blue dots would be at the bottom.
Now, without going too much into the details here, we see that by increasing the number of stop words, the two group of pages start to be clustered at the edge of the graph. And this means we are making a better classification and fewer error in classifying explicit and legitimate documents.
Another frequent problem that you're going to face while working with LSM is to understand the reason behind a document classification, or most importantly, a document misclassification. And you see here a really good example. In one of the first iterations of our web filter, the Wikipedia page about Barbie was being restricted.
So to figure out what are the reasons behind a nice classification, you can use, once again, from Terminal, the lsmdump command. You pass as input the map file, and the text documented is the subject of your investigation. In LSM, we print out all the data or part of the data that has been used to evaluate this document. So let's analyze it once again.
In the last column, we read the words that have been used to evaluate this document. These are words that are part of the document and that are also part of the LSM map. Then once again, we have the word frequency. These are the number of times a given word was observed in the training document for each specific category. These are the exact same word counts that you will find in the simple map dump.
And then in the very first column, we read the percentage value relative to the number of times, or... Yeah, a percentage value relative to the number of times a given word is repeated in this document. So, for example, in this specific Wikipedia page, the word "Barbie" is repeated almost ten times -- or, sorry, every ten words in the document.
So let's go back to our initial question. Why is this Wikipedia page being blocked by our filter? What we can see here on top, that the word "doll" and "dolls" were found significantly more in this specific category. And therefore, these two words are driving the classification towards this specific category, and the filter is restricting the Wikipedia page. But to fix this problem, we simply needed to add more training documents related to legitimate toys. And that was enough to fix this specific issue.
So it is now 5:50, and you can relax knowing that your children are safely surfing the web thanks to Latent Semantic Mapping. Thank you. So you've seen how we use latent semantic mapping for our purposes. How can you make it work for you, yourself, for your applications? You have to think about what you could categorize with latent semantic mapping. You could use it to categorize your bookmarks in your browser.
You could use it to categorize RSS feeds by content. You could categorize your books or other media library by going and fetching reviews or abstracts and using those to put those media into categories. You could categorize your wine cellar. You could go wild and do math science and categorize DNA sequences. It's really up to you. But there are a few things you should know before you consider using LSM.
First of all, LSM handles tasks that are semantic in nature. If a problem is syntactic in nature, like identifying dates, times, or addresses in text, that's not an area where LSM is going to do well. A problem needs to be semantic in nature. That is, for instance, categorizing documents by topic.
Secondly, the categories you pick need to be distinct enough. If you have something like an economy page of a newspaper versus a business page, then Latent Semantic Mapping is not going to be able to tell you what page an article belongs on, probably. If it's the economy section versus entertainment section, although there still are overlaps, Latent Semantic Mapping is likely to be able to tell those apart.
You also need good training data. You need a good quality of your training data. It needs to be representative of the full breadth of your domain. For instance, as Giovanni said, if you don't train with toys, you might mischaracterize whether dolls are an adult or a child concept.
Your data needs to be reasonably clean for the purpose of classification. That is, if you're not interested in HTML tags, you should not present the HTML tags to the training. You should try to fix the typos in your training material. And very important, your training data needs to be as balanced as possible in each category. If you have vastly more Material in one category than in another, that's also going to skew evaluation results.
That also leads us into the other point. You need enough training data. You need enough data to cover the variability in your material. As a rule of thumb, if you use a large vocabulary application that is general English text rather than very domain restricted text, you need more than 30,000 unique words.
And the more categories you have, the more training data you need. And if you have something like you categorize stuff that comes in from a news feed or something like that, if your data changes over time, you need more initial training data. Or, of course, you can also go back and retrain later, adding more material.
So how do you test the accuracy of your results? One method which is kind of tried and proven in research and in practical application is to partition your training data into a number of chunks. You use all but one of them to train up your map. And the last one that you held out, you use for testing.
It's important that you don't use testing data that is already present in the training of the map, because that's completely going to invalidate your results. Then you run your test. You rotate things around by holding out another part of your input data, and then average out the results.
So what if your results are not as good as you would like them to be? First thing you could try is what Giovanni also did, is to use a list of stop words. You saw in his diagrams that this was a simple and very effective means of improving results.
That is, holding out words that appear in all categories roughly equally. Secondly, you can try experimenting with the number of dimensions. By default, our number of dimensions is the number of categories. But if you have a problem where you use 1,000 categories, there is no point in doing your math with 1,000 dimensions.
For a natural language problem, you should restrict things to about 100 to 300 dimensions. And if that still doesn't work, you can also try adding some frequently used word pairs or words triplets by doing a custom parser. And if you have something like 18 USC 2257, that might be a useful to use and to add to your map.
As we have seen, for instance, in junk mail filtering, it's often useful to combine Latent Semantic Mapping with other sources of information and knowledge because Latent Semantic Mapping does well where other techniques don't do so well and vice versa, other techniques complement Latent Semantic Mapping. We've used that in a junk mail filter where we have handwritten rules, server-side rules. We use it in Kanato Kanji conversion where we apply a whole bunch of heuristics and use Latent Semantic Mapping as one method among many.
That's all there is to know. Now go forth, map some text, map some other data, and tell us about your results. For more information, you can go to the WWDC website. You can also contact our evangelist, Matt Brands. We have a public mailing list that you're welcome to subscribe to. It's fairly low traffic, so it's not going to burden your inbox with more spam in the interest of spam reduction. And you can also go look up our documentation.