Integration • 58:30
The Latent Semantic Mapping (LSM) framework is the text-analysis technology that underpins the junk mail filtering of Mac OS X Mail and the Mac OS X Kanji text input method. Using this API and text samples with known characteristics, you create and train maps, which you can use to analyze and classify arbitrary text. Learn more about this technology and the major concepts behind the Latent Semantic Mapping API. FInd out how this API lets you add innovative features to your applications, and learn development best practices to achieve the most robust performance.
Speakers: Kim Silverman, Matthias Neeracher, Jerome Bellegarda, Giovanni Donelli
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it may have transcription errors.
My name is Kim Silverman. I manage the spoken language technologies. We do speech synthesis. We did the Alex voice in Leopard. Who has heard Alex? Thank you. We do other cool stuff. We do speech recognition. Most relevantly for us today, we do latent semantic analysis, figuring out what text documents mean. Have you ever looked at a document and said to yourself, "What the heck does this mean?" Well, latent semantic analysis figures that out, and that's quite useful. We use it quite successfully for the junk mail filter. We released the API to you folk, And in Leopard, we have released new features that do clustering. you know, any sufficiently advanced technology is indistinguishable from magic. So to tell you about this magic, I'd like to introduce the senior software engineer who implemented the Latent Semantic Mapping Framework, Dr. Matthias Neeracher.
Good afternoon, everybody. And I probably have to correct Kim. It's not really magic. At best, it's a little protective hex or so. You can think of that. So today we're talking about latent semantic mapping, what it is, how it works, how you can use the API and how you can avoid using the API. We're going to discuss how we used it in Mac OS X and then move on to what's more important to you, how you can use it in your application.
So what is latent semantic mapping? In its most general terms, latent semantic mapping is a technology to analyze text or analyze other data, in fact, and classify it into one of a number of categories. Now this is awfully abstract, so let me demonstrate. If you look at /user/bin, there are not just binaries in there. There are more and more Perl scripts, Python scripts, Ruby scripts in there. And... So let's try to use latent semantic mapping to train up a classifier that can tell these types of scripts apart.
So first of all, we need training data. And it turns out we have plenty of that. We have lots of libraries installed. If you look at it, you have something like a good five megabytes of source code for these libraries in all three languages combined. That's more than enough to train up a recognizer or a classifier. So let's train up a latent semantic mapping classifier.
And that we can do with a command line tool that is actually also in /usr/bin. It's called lsm for latent semantic mapping. create a classifier that is named just Perl, Python, Ruby, Map, And we use these source code libraries as training data. Now, that's a quite complex task. You read in five megabytes of data. You do a complex number of numerical processing that somebody is going to talk about later. So that's going to take some time, about a second.
Now we have our map. The five megabytes of training data is now represented by about a megabyte or so of a map. We can use that map to evaluate files to see what category they fall into. So we have three categories. We have category one is Perl, category two is Python, category three is Ruby. So we're going to look at three particular scripts in user bin and see what DAW latent semantic mapping thinks they are behind.
So we see that acLocal, our guess is that it is going to be Perl, because category one has by far the highest score. PyDoc, latent semantic mapping makes this astute guess that this might be Python. And generateBridgeMetadata, our guess is that this is going to be Ruby. Now we're going to cheat and actually use a method we just look at the first line of these scripts and we find out that lo and behold, aclocal is a Perl script, pydoc is a Python script, and generate bridge metadata is a Ruby script. So in this case we had 100% success, although you should not expect that to happen every time around. That would be truly magic. Can we go back to the slides?
So we use latent semantic mapping. The best known example in Mac OS X is we use it for the junk mail filter to assess whether a mail message is legitimate mail or junk mail. Now in Leopard, we also use it somewhat similarly in parental controls to filter out the evil web pages from the good web pages.
We also use it in the Japanese input method in our kana to kanji conversion, because if there are multiple characters with the same reading, we can use latent semantic mapping as one of many methods to disambiguate between the ambiguous readings. And we use it internally for localization to use the underlying topic of discourse to disambiguate words that can be translated differently in different contexts.
So you've seen what we do with it. How does it work? For that, I would like to bring up a colleague. There's going to be some math, but don't worry. It's not going to be on the test. Please welcome the scientist, distinguished scientist, who brought this technology to Apple, Dr. Jerome Bellegarda.
Thank you, Mathias. So I'm Jerome Bellegarda, and what I'd like to do is give you a little flavor of how LSM works. And in some sense, this is relatively easy because it's all in the name. Latent semantic mapping has three aspects to it. It's, well, it's a mapping, and it's semantic, and it's latent, so let's go through it. First of all, it's a mapping. The concept originated from information retrieval.
And it deals with entities such as words and documents. And what it does is it maps those entities, which are discrete in nature, into a continuous space, which we'll talk about. So second, it's semantic. That means it has something to do with the overall content and the meaning which is inside those documents.
And it's also latent. Latent means hidden in the general sense. In this particular context, it means that we're going to leverage the co-occurrences between words in order to infer some of their hidden meaning. And that will enable us to handle things like synonyms and also multiple senses of the word.
So for example, when it's well-trained, the framework knows that car and automobile are synonyms. Similarly, it knows that bank has to do with financial institution, most likely. But also, it's clever enough that if bank co-occurs with rate in the same document, it can infer that the topic of the document most likely is about finance, whereas the same bank, if the same bank co-occurs with a word like river, then it's most likely an article about fishing. So what I'd like to do now is walk you through those three aspects of LSM in turn, starting with mapping. So as I mentioned before, what we have is we sort of gather a whole bunch of documents. That's called the training corpus. And what we're going to do with those documents is try to map them onto a space. And in this particular case, I use a three-dimensional example because it's easier to visualize. But in general, it could be a fairly complex space with a lot of dimensions.
And so the point of the exercise is to take each of the documents that we have and map them onto points in that space. So in this particular case, the triangles, the red triangles that you see, are just representation of the original documents in that space. This is called the training phase. And so we do that with all the documents that appear in the training. And so pretty soon what we see is that we see regions of the space that characterize certain topics, such as finance and computer science and so on. Then during a phase called classification, what happens is that we take a new document, might be a web page, for example, something like that. And we're trying to see where it fits in the space. And so what we do is we map it in the space using the framework, and lo and behold, it falls somewhere in there. And we sort of look at which cluster it falls within, and we can deduce that in this particular case, for example, the web page that we had is most likely about computer science. That's the mapping part. Let's go into the semantic aspect of it. And I'd like to take the example of parental controls because my colleague Giovanni is gonna go through a case study of that later. But very briefly, the subject here is to assess whether a webpage is free of objectionable material or whether it contains words that are not suitable for children. And so in the LSM strategy, what we're gonna do is we're gonna try to encapsulate what exactly characterizes an objectionable material versus what exactly characterizes legitimate material. And so for that, we have two natural categories, one having to do with objectionable material, or so-called explicit material, and one for the legitimate material. And so the implementation is that we're gonna construct an LSM space. In this particular case, since we have only two categories, is going to be two-dimensional. And then we're going to define what's called semantic anchors in that space. Those will be the points that most characterizes the objectionableness, if you will, of the material versus the legitimacy of the material. And then every time we have a web page, we're going to evaluate it against those two anchors and then figure out what it is. So let me illustrate. So here, again, we have the two-dimensional space. And we have those two anchors. So the legitimate anchor might be this green square here. You can think of it as the centroid of all the legitimate web page that we've seen in the training data. And similarly, we have the objectionable material, the anchor of objectionable material that might fall in there. And again, you can think of it as the centroid of all the pornographic web pages and so on seen in the training data. And then a new web page comes in. We map it in the space. We then compute the distance between that representation, that point, and the two anchors. And we see that in this particular case, it is closer to the anchor of explicit material, in which case we deduce that it's probably a page that you wouldn't want your children to see.
So let's move on and now talk about the latent aspect of things. And for this, I'd like to go through a little case study, a little artificial construction here. Imagine a four-feature calendar management system where you can have only four queries. You can ask about the time, about the day, about the time of the meeting. You can also cancel the meeting, and that's all you can do.
contrived example, but it will, I think, illustrate some important points. So the LSM strategy here is we have four natural categories, which means that we're going to construct four semantic anchors in the space. And then, of course, as we evaluate the input, any input to that system that may not be necessarily in the same words as what's expected, we're gonna map this input into the space and figure out which query, which category it is closest to. So again, because -- well, I'd like to first, before we go through the actual latent space for that, why did I choose this admittedly contrived example? Well, those categories, those queries have been chosen because they illustrate a couple of different things.
First of all, the predictive power of the words that appear in those queries is very different. For example, "the" you'll notice appear in all four queries, meaning that if you see "the," that gives you absolutely no information whatsoever about which query it is about, meaning its predictive power is essentially zero.
Whereas "they" or "cancel," for example, they each appear in one particular query and nowhere else. If you see they, you know that I'm talking about category number two in that case. So the predictive power is different. Also we have all kinds of occurrences. We have overlaps. We have meeting appearing in two of the queries. We have ambiguities and so on and so forth. So that's kind of an interesting case study. Now let's assume that a user has been using this system and was successful with it. He goes and takes a little break, goes somewhere, comes back, doesn't quite remember how to query the system.
So he says something like, "When is the meeting?" instead of, "What time is the meeting?" And you notice that in that case, "when" has never been seen by the system. And so the question is, what will LSM do with that? So let's go through -- that's the actual illustration. can actually construct this example in using the LSM framework and see what happens. This is the two-dimensional rendition of this example. And here I've mapped the words. So those are all the words.
The, you recall, had no predictive power, so it just happens to be at the origin. They and console, you recall, each predicted one query particularly well. And so they tend high on one dimension and low on the other. And the other words fall somewhere in the middle, and you notice that what and is happen to co-occur all the time, meaning that they fall on top of each other in this space. Now about the query themselves. Here they are. Not surprisingly, what is the day falls very close to day, which is the word that predicts is best.
And similarly with console.theMeeting. And the other two fall similarly fairly close to their constituent words. The question is, what happened with the new input? Remember, it was when is the meeting? So we do the computation, and lo and behold, that's where it falls. And you see that the query that is the closest to it is this one. So that would be the query that's predicted by the system. And so what we've seen is that, in a very sort of data-driven way, the system has essentially figured out "When" is a synonym for "what time." Isn't that great?
Okay, so that seems like magic. It's not all magic. And now I'm gonna walk you through some of the tedious details of the math that's behind it. I'm gonna keep it short and simple, hopefully. The two pieces of information that we leverage here is how often does each word appear in each document and compared to how often does each word appear in the entire training data. And what we do is we form a matrix of words by documents. It is not a Christmas gift, despite the appearance.
This is simply a two-dimensional array of numbers. Each entry in this matrix is a weighted count, weighted appropriately, of essentially how many times that particular word, the green word, appear in this particular document, the red document. So this defines, as it turns out, this appropriate vector representation for the words in the document. But then it turns out they're not completely practical. So we go through an additional step, which is called singular value decomposition, or SVD for short, that takes this matrix W and massage it a little bit, sort of decompose into three different matrices that are easier to deal with. And this gives us the appropriate representation, the points in the space that we were looking for. And the set of those points are essentially what the space looks like, the latent semantic space.
So going back to my example from the beginning, we then have a set of documents, a bag of documents, and we map it into the space, just as we did in the beginning. Then when we have a new document, we classify it and find out its position in the space. We've seen in the case study that I've presented before that from the document, of course, we can infer the words, And we can map those words equally in the space. So we are left with a framework that allows us to respond to two questions. Given this blue triangle, which is the new input, this web page that I want to classify, we can ask, what is the closest document in my training corpus that, you know, the document that's closest to this particular input? And that would be given by the red triangle. But, and this is the kind of stuff that we do, the kind of question we answer when we do junk mail featuring and also parental controls, as you will see. But we can also ask, given this blue triangle, given this new input, what word best characterize that document? And that would be the green square there. And this is the kind of question we answer when we do a Kanata-Kanji conversion, for example. Try to figure out which word best characterizes a document. And so we will either figure out which document is closest or which word is closest in that LSM space.
Now, a couple of caveats. You know, LSM is powerful. It's no panacea. It only works within the constraints of the particular paradigm. And in particular, it has three limitations. One has to do with a fairly shallow sense of semantic. We're not talking about semantics as is described in a typical artificial intelligence literature here. the semantic aspect of LSM is completely tied to word co-occurrences as they appear in that matrix that I was talking about.
So it's very shallow, still powerful. The second aspect is that the word order is ignored. In other words, we take the document as a bag of words, hence the little pictograph of a bag here. And, Of course, this is a limitation that can easily be overcome by, instead of words, you can consider word pairs, for example, or even word triplets. And we do that occasionally in some applications. But that has, of course, implications on the size of the matrices and so on and so forth. The third limitation has to do with the fact that the singular value decomposition is a method which falls within the realm of linear algebra. And of course, as we know, the world is not always linear. So there are some situations where, of course, it wouldn't be proper. And Mathias will go, at the end of the presentation, through some guidelines. How do you know when LSM will help and when it won't? Another point I would like to make is the critical importance of training data. For example, I was talking earlier on the bank of the Bank of America versus the bank of the river bank. And that's well and good if they appear in different documents. But what about if they appear in the same document? Then you wouldn't know, right? We wouldn't know whether it's the financial bank or the physical bank. And in which case there is some ambiguity that would need to be resolved. Another caveat is that LSM tends to be sensitive to the writing style implicit inside the document. So if you imagine an article written in the Wall Street Journal would be very different in appearance, in the kind of vocabulary, the words that are chosen, and so on, the syntax even, from the same article, well, an article on the same topic, but written in the style of the Associated Press. And sometimes LSM is sensitive to that. And finally, the last caveat is that in some applications where we might have a large matrix, of a large sparse matrix, computing the SVD is something that might be a little computationally intensive. Of course, this is something that typically you do once offline for training purposes, so that's not so bad. All right, so now let me go into some of the new aspects of the -- some new things that we've added to the LSM framework for Leopard in particular.
And that has to do with clustering. The problem that we're trying to address here is that sometimes the categories are ill-defined. So far, in all the examples that I've presented, the categories were pretty clear. But sometimes they're redefined. Let's take the example of Kana to Kanji conversion, for example.
The idea here is to leverage the topic information to disambiguate between ambiguous characters in Japanese. So this is a little bit analogous to the word "tail" in the tale of a princess versus the tale of a peacock. the word is pronounced the same way, but of course it doesn't mean the same thing. Well in Japanese there are many similar kind of problems that have to be solved in order to do proper Kanata-Kanji conversion. And so what our colleagues did when they were trying to solve this problem is they gathered a big corpus of Japanese text, over 300,000 documents. And well, how do you extract the topic? You know, what kind of -- how -- what would be the granularity that would be appropriate in order to run LSM? Not so obvious. You have two solutions. The first one is, well, you take all those documents, $13,000, and you assign them to a small number of topics, maybe 50 or so. And of course, that's extremely tedious.
So another solution is to leverage the representation we have already in the LSM space and do data-driven clustering in that space in order to reduce the number of categories in a data-driven way. And then optionally, then once this is done, rerun the method to get a new LSM space and so on and so forth in an iterative fashion. So in order to make that easy, we've added some clustering APIs in the framework. And in particular, of course, there are many different methods of doing clustering. We've just offered two particular methods. One is K-min clustering and the other one is agglomerative clustering. So I just want to go briefly through what those two methods are. So imagine you have some data and you suspect that they cluster into, in this particular case, three clusters. What K-mean clustering does is it starts with some random seeds.
Let's say we start with three blue seeds here. And what we're going to do now is we're going to try to -- we're going to assign to those seeds all the data points in the space. And progressively the means of the cluster will sort of converge toward the center of the cluster. And at the end of the process, we have three clusters in the space. The other method is called a clustering. And it's more of a whereas K-means was a top-down process, this is more of a bottom-up process. So we first find the two points that are the closest in the data.
And then again, iteratively, we just continue to gather more and more data in a bottom-up fashion until finally we get the two clusters. So those two methods are offered in the LSM API. There are tradeoffs associated with each of them. Typically agglomerative clustering doesn't work very well when you have a large data set because it's computer intensive to compute all those distances.
But k-means clustering tends to be sensitive to the random seeds that you start with. So sometimes you have to go through several rounds of initialization in order to get good clusters. So that's what I wanted to sort of give you a flavor of the theory and sort of principles behind the LSM framework.
And now to walk you through the API, I would like to bring Matthias back on stage. Matthias Pintscher: Thank you, Jeroen. Thank you, Jeroen. Thank you, Matthias. Thank you, Matthias. How do we use the API? The first thing to notice is the demo I just gave you does not use the API at all. It just uses the command line tool. And in fact, in a large percentage of applications of latent semantic mapping, the command line tool is perfectly adequate to do the prototyping, to possibly even pre-train a map, to do a lot of your testing.
So you can decide whether it's actually working out for your application before you write the first line of C code using the API. Once you do want to write that first line, the API is core foundation based. That means all the types in the latent semantic mapping API can be retained, released. You can put them into NSArrays. You can do unspeakable things with them. So we have latent semantic mapping framework is based around three types. There is, The LSM map, which is the data structure we use to store the transformed training data and to evaluate things against. The LSM text, which is where you would put in text or your other data that you want to use for training or for evaluation while you're building it up. And as a result of the evaluation, you get back an LSM result.
To dig a bit further into an LSM map, operations there are LSM map create to create a map. Then you start training it by creating a text by first creating categories with LSM map add category. And then you keep creating texts and calling LSM map add text to add data to those categories. Once you're done with your training data, you call lsm_map_compile to put the map into a compiled state. Now, occasionally, you may want to go back and add more data.
For instance, in chunk mail filtering, whenever the user hits the chunk mail or not chunk button, we go back, put the map back into training state, add the mail as further training data, and compile it again. That putting back is done with lsm_map_start_training. In evaluation state, to evaluate a text, you call lsm_result_create, because according to Core Foundation naming rules, what you really get back is an lsm_result. So that's what the call is named after.
And to store a map to disk and to load it back in, you have lsmmap write to URL and lsmmap create from URL. And finally, the calls that we didn't talk about last time that are new for Leopard is lsmmap create clusters to discover the clusters present in a data set. And then if you decide that they are good, that you want to use them, you call lsmmap apply clusters, which reorganizes the map around those clusters Ms. Coley.
So moving on to an LSM text, the name might suggest that an LSM text is a sequence of words. And that's, in fact, what it often is. But it can be more than that or less than that, depending on how you view it. There are three ways of adding data to a text. The simplest way, and the way that is supported by the command line tool, is you just take a long string and let latent semantic mapping break that up into words and add those to the text. That's done with LSM Text Add Words.
This default parser does not work in all circumstances. For instance, our parser discards numbers, because in many contexts, the numbers just clutter up the space and don't really add any information. But in some cases, it's crucially important to preserve numbers, in which case you want to write your own parser. Also, we might not parse all languages correctly. So for some languages, you might want to write your own parser, in which case you just write your parser. And whenever you found a word, you call lsm_text_add_word in the singular to add that to the text. And the third option is you don't necessarily have to use words, because despite all the semantic stuff, latent semantic mapping is fundamentally agnostic of what's in your tokens. It doesn't have to be a word. It can be in any language. And in fact, it can't just be binary data. So if you want to use binary data, if you want to use a CFData instead of a CFString, as your fundamental unit, you just call lsm_text_add_token.
Now you have your text and you want to evaluate it. You do that by calling lsm_result_create, specifying a number of categories that you're interested in. Sometimes you have this kind of winner-takes-it-all evaluation, in which case you only want one category, the top one. In some cases, you want to do some kind of an n-based thing, in which case you would have a number greater than 1.
For this number of categories, you can inquire what the category number is. For instance, as we saw in the example, one for Perl, two for Python, three for Ruby. That would be lsm_result get_category. You may want to know what the score is. We have the scores of the categories adding up to one conceptually. So it's like a probability. You want to get the score to find out what your confidence in the result is. And you do that with lsm_result_get_score. If you do a word-based mapping, where you want to get back the best matching words, then the category number doesn't tell you a whole lot. That's just the index within the overall list of words. So you would rather prefer to have lsm_result copy word to get back the CFString of the word that you put in. In a clustered map, you may need lsm_result copy word cluster so you get a CFArray of CFStrings. And similarly, if you have a map using binary tokens instead of strings, you get lsm_result copy token or copy token cluster, which just returns CFData or CFArray of CFData.
That's how we use the API. So how did we use it? Our first usage, as said, was junk mail filtering. In a way, that's a pretty simple application. You have the good guys here, the bad guys there, no ambiguity whatsoever. But there are some refinements needed. First of all, we don't want to have equal error rates. As aggravating as it is to have a junk mail delivered to your inbox, it's much worse to have a legitimate mail drop out of sight in your junk mailbox and not to be discovered until several days later. So we err a little bit on the side of classifying a male as legitimate. We do that by not just testing the top category against the junk mail category, but by also testing that the score we got for that top category is greater than a threshold which we defined to be somewhat greater than 50%. So you may want 55% certainty.
The second problem is that the people who send junk mail, they are evil, but they are not necessarily stupid. So they know that we are parsing their mail. So they're making their mail difficult to parse. And they do that, for instance, by inserting gratitious punctuation or doing the heavy metal thing and inserting gratitious umlauts and other diacritics. So we need to counteract that. So we have a specialized parser that is just used for parsing junk mail, which we activate with a flag to lsm_text_adwords.
Our third problem is that if you look at the map generated by training up a chunk mail filter and you dump that out of the console with strings or with cat, you would see all sorts of offensive words, and we don't really want that. You don't want your children to use cat on that map and see a word like mortgage. So to counteract that, we added a little bit of a hashing function. It's not military-grade cryptography by any means, But it's just enough that you would not really be able to read anything useful with a text-based tool. And that might be useful in other contexts as well. So you can specify this flag, lsm_map_hash_text when creating a map.
And one final and important point is latent semantic mapping is not used in isolation here. It's used as a last line of defense. If you think about the path that a mail message takes, It first goes to your internet service provider. And your internet service provider probably is going to throw away about 85% to 95% of incoming mail if it does its job properly. Because much of it is so obviously spam, you never want to even deliver that to your end user machine. Once it is delivered to your inbox, you apply the handwritten rules that you have in mail.app to sort mail early. That's usually only used to define a mail early as good mail, not to throw it away. But you could also write rules to throw it away. Then what you do next is you look up the sender of the mail in your address book, because that's another case where we err on the side of caution. If you know the person who pretends to have sent you this mail, we'd rather present you the mail and let you decide whether it's a desirable mail. So address book is also used to preserve a mail. And only if all those filters don't give a conclusive response do we use latent semantic mapping. to decide whether a male is to be discarded or whether it's to be kept. So this is kind of the simple case. For a more complex case, I would like to bring up my colleague, Giovanni Donelli, who implemented the web content filter.
Giovanni Donelli, I'm a software engineer, a member of the team that designed the Leopard Web Content Filter. With my presentation today, I would like to show you what is it like to use latest semantic mapping to solve quite a complex problem, such as web filtering. But before I delve more into the details on how we use LSM, let me show you where you can find the web filter. So as some of you may notice, Leopard introduces a brand new set of parental control functionalities. It allows you to impose several restrictions on the use of the computer and of its applications.
One main area of concern nowadays, obviously, the Internet, especially for parents. And I think we all agree that the web can be considered a potentially harmful place for younger minds. So the objective of the web content filter is to validate the material coming from the Internet, making sure no kid gets exposed to explicit material.
So the filter is based on a two categories map, which has been trained to recognize explicit and legitimate material. Our training data is made by approximately 7,000 pages, which have been collected primarily manually in an effort to approximate the variety of the distribution of the web. This is just a text-based filter. We're not doing any fancy image analysis or skin recognition of any sort. We are taking the HTML page source and converting that into simple text by removing all the layout information, all the tags that would add clatter and noise to the LSM app. In this process, we're obviously trying to retain as much information as possible, such as meta tags or image descriptions.
As you've heard from Matthias, LSM is currently ignoring numbers, but numbers could be very relevant to our classification. You see here a really good example. The USC 2257 identifies a law that regulates the handling of adult material on the Internet. So obviously the presence of this string can highly identify explicit material, and we convert numbers into tokens that you see here in the example so that LSM can learn to distinguish which specific words or specific flags are potentially explicit.
The structural feature of a web page can also give us some hint on the type of material displayed by a web page. As you probably can imagine, explicit pages tend to have a lot of images and little text. So we are converting a structure of features, such as the number of images or number of words, into tokens, into just fake words that we insert in the text. And we just let LSM learn what features are to be considered potentially explicit or legitimate.
But to test how well the filter works on our scale, we need to verify the accuracy of the web filter over a set of pages which have never been seen in the training phase before. For the purpose of collecting test data, we are using the Open Directory project. This is an open-source effort to classify the web.
Users on Demos have already classified 1,000 of legitimate and explicit pages. We're taking approximately 100,000 of these documents, and we're making the filter evaluate every single one. And the filter associates a score between 0 and 1 to each single page. Then we aggregate this information in this graph. In the vertical axis, we have the page score. The higher the score, the higher the probability for a page to be explicit.
Excuse me. On the horizontal axis, we have the page score. On the vertical axis, we have the proportion of pages with that score. So let's see how our data distributed over its core. These are the legitimate pages. And these are the explicit ones. So the shape of these two curves tell us that the explicit pages are all very similar. In fact, they are primarily classified within a very narrow score range. legitimate pages instead are much more diverse. They cover a wider variety of topic. And that's why the curve is much more evenly distributed in the first half of this core spectrum.
Somewhere where those two lines overlap, we can identify the equal error rate. And this point, the filter makes equal mistakes in classifying explicit material and legitimate. The equal error rate is a really good metric for us of the filter accuracy. And as we evolve the filter, our goal is to minimize the equal error rate. By looking at this graph, we can also understand where we should put our fifth threshold and the risks that are involved with this decision.
So automatic testing is really great to understand how our filter works on the Internet. And it's extremely helpful to keep track of the filter progress over time as we change it and see if there are any regression. However, it can only be as accurate as long as the test data that we're using is a good representation of the Web.
There are also some disadvantages of automatic testing. Primarily, it has a quite high initial cost to set up all this system to gather all this test data. It might not really be suitable in a prototyping phase. And most importantly, automatic testing is not really telling you what's going wrong with your LSM map. It's not allowing you to easily understand what you need to change in order to improve your classification system. So what we found extremely helpful, especially at the beginning of the development of the WebFilter, was to analyze what LSM will learn in the training phase. You can do that from terminal. I use the lsm dump command, and you pass it as input the map file.
LSM would print out all the data gathered during the training phase, and let's analyze it more closely. Starting from the last column, So in here, we read all the words that LSM has seen in the training documents. These make the dictionary of the LSM map. These are all the words that LSM is able to recognize and assign a meaning to.
Then we have as many columns as the number of categories that are in our map, which is, in the case of the WebFilter, just two. And here we read the number of times a given word was observed in the training documents provided for each specific category. So by looking at the values on these two columns, we can have an idea on what LSM thinks about specific words. So for example, The word advertising was found five times more in the training document provided to train the legitimate category. And therefore, LSM is thinking that this word is legitimate. The word "storence," then, was found almost equally in both categories. Therefore, the semantic meaning of this word is very low, and LSM would probably ignore this word as it's classifying a document.
So the idea here is that we can go through all this list and verify whether LSM has really learned to recognize what words are to be considered explicit or legitimate. Problem here is that the number is pretty big. In our web filter, we have over 100,000 unique words. So this is not something that you can do manually. You want to write a script that identifies the most significant words that have been used to classify a specific category. So for the sake of the example here, Let's say we want to identify the most representative, the most significant words that have been used to classify this specific category. Well, we first need to select those words that have very high word count. These are words that appear in almost every single document that is provided in the training phase.
And then out of this word, we need to select only those that appear significantly more in one category versus the other ones. And we call a word with this feature an indexing word because its presence in an unknown-- and porn is an indexing word for the specific category-- because its presence in an unknown document highly shifts the classification of this document towards the specific category.
So one of the very first thing that we wanted to verify when we were developing the web filter was to see whether the top indexing word of the explicit category were actually related to adult material. This is what we obtained. So without doing a very complex analysis, we could see very well what the problems were here.
The filter is considering explicit words that are just frequently used in English or that are just part of the Internet vocabulary. It was pretty obvious that in the first prototype, the training data that we were using was very imbalanced and did not really approximate well the distribution of the content of the whole web.
But we also realized that finding a good representation of the web was a quite hard task to accomplish. So in addition to finding better training data, we should also exclude from the classification words that don't really have any semantic meaning associated. And we called the words that I excluded from the classification "stop words."
And this graph shows you the distribution of the training pages over their score. The blue dots are pages that we know are legitimate. The red dots are pages that we know are explicit. On the vertical axis, we have the page score. And once again, the higher the score, the higher the probability for a page to be explicit. So if our filter was perfect, all those red dots would be at the top of the graph, and all those blue dots would be at the bottom. Now without going too much into the details here, we see that by increasing the number of stop words, the two group of pages start to be clustered at the edge of the graph. And this means we are making a better classification, a fewer error in classifying explicit and legitimate documents.
Another frequent problem that you're going to face while working with LSM is to understand the reason behind a document classification, or most importantly, a document misclassification. And you see here a really good example. In one of the first iterations of our web filter, the Wikipedia page about Barbie was being restricted.
So to figure out what are the reasons behind a nice classification, you can use, once again, from terminal, the lsmdump command. You pass it as input, the map file, and the text documented as the subject of your investigation. In LSM, we print out all the data or part of the data that has been used to evaluate this document. So let's analyze it once again. In the last column, we read the words that have been used to evaluate this document. These are words that are part of the document and that are also part of the LSM map. Then once again, we have the word "frequency." These are the number of times a given word was observed in the training document for each specific category. These are the exact same word counts that you will find in the simple map dump.
And then in the very first column, we read the percentage value relative to the number of times. Yeah, a percentage value relative to the number of times a given word is repeated in this document. So, for example, in this specific Wikipedia page, the word "Barbie" is repeated almost 10 times -- or, sorry, every 10 words in the document.
So let's go back to our initial question. Why is this Wikipedia page being blocked by our filter? What we can see here on top, that the word "doll" and "dolls" were found significantly more in this specific category. And therefore, these two words are driving the classification towards this specific category, and the filter is restricting the Wikipedia page. But to fix this problem, we simply needed to add more training documents related to legitimate toys. And that was enough to fix this specific issue.
So it is now 5:50, and you can relax knowing that your children are safely surfing the web thanks to latent semantic mapping. Thank you. Thank you. Giovanni, so you've seen how we use latent semantic mapping for our purposes. How can you make it work for you, yourself, for your applications?
You have to think about what you could categorize with latent semantic mapping. You could use it to categorize your bookmarks in your browser. You could use it to categorize RSS feeds by content. You could categorize your books or other media library by going and fetching reviews or abstracts and using those to put those media into categories. You could categorize your wine cellar. You could go wild and do math science and categorize DNA sequences. It's really up to you. But there are a few things you should know before you consider using LSM.
First of all, LSM handles tasks that are semantic in nature. If a problem is syntactic in nature, like identifying dates, times, or addresses in text, that's not an area where LSM is going to do well. A problem needs to be semantic in nature. That is, for instance, categorizing documents by topic.
Secondly, the categories you pick need to be distinct enough. If you have something like an economy page of a newspaper versus a business page, then latent semantic mapping is not going to be able to tell you what page an article belongs on, probably. If it's the economy section versus entertainment section, although there still are overlaps, latent semantic mapping is likely to be able to tell those apart.
You also need good training data. You need a good quality of your training data. It needs to be representative of the full breadth of your domain. For instance, as Giovanni said, if you don't train with toys, you might mischaracterize whether dolls are an adult or a child concept. Your data needs to be reasonably clean for the purpose of classification. That is, if you're not interested in HTML tags, You should not present the HTML tags to the training. You should try to fix the typos in your training material. And very important, your training data needs to be as balanced as possible in each category. If you have vastly more-- material in one category than in another, that's also going to skew evaluation results.
That also leads us into the other point. You need enough training data. You need enough data to cover the variability in your material. As a rule of thumb, if you use a large vocabulary application that is general English text rather than very domain restricted text, you need more than 30,000 unique words. And the more categories you have, the more training data you need. And if you have something like you categorize stuff that comes in from a news feed or something like that, if your data changes over time, you need more initial training data. Or, of course, you can also go back and retrain later, adding more material.
So how do you test the accuracy of your results? One method which is kind of tried and proven in research and in practical application is to partition your training data into a number of chunks. You use all but one of them to train up your map. And the last one that you held out, you use for testing. It's important that you don't use testing data that is already present in the training of the map, because that's completely going to invalidate your results. Then you run your test. You rotate things around by holding out another part of your input data, and then average out the results.
So what if your results are not as good as you would like them to be? First thing you could try is what Giovanni also did, is to use a list of stop words. You saw in his diagrams that this was a simple and very effective means of improving results.
That is, holding out words that appear in all categories roughly equally. Secondly, you can try experimenting with the number of dimensions. By default, our number of dimensions is the number of categories. But if you have a problem where you use 1,000 categories, there is no point in doing your math with 1,000 dimensions. For a natural language problem, you should restrict things to about 100 to 300 dimensions. And if that still doesn't work, you can also try adding some frequently used word pairs or word triplets by doing a custom parser. And if you have something like 18 USC 2257, that might be a useful tablet to use and to add to your map.
And as we have seen, for instance, in chunk mail filtering, it's often useful to combine latent semantic mapping with other sources of information and knowledge because latent semantic mapping does well where other techniques don't do so well and vice versa. Other techniques complement latent semantic mapping. We've used that in a chunk mail filter where we have handwritten rules, server-side rules. We use it in Kanatukanji conversion, where we apply a whole bunch of heuristics and use latent semantic mapping as one method among many.
That's all there is to know. Now go forth, map some text, map some other data, and tell us about your results. For more information, you can go to the WWDC website. You can also contact our evangelist, Matt Brands. We have a public mailing list that you're welcome to subscribe to. It's fairly low traffic, so it's not going to burden your inbox with more spam in the interest of spam reduction. And you can also go look up our documentation.