Application Technologies • 1:00:29
Latent Semantic Analysis is the text-analysis technology that underpins the junk mail filtering of Mac OS X Mail and the Mac OS X Kanji text input method. After years of refinement and pent-up demand from developers, this advanced technology is now available for your own applications. Latent Semantic Analysis enables you to analyze documents, classify them into categories, and find which words best characterize the meaning of those documents and categories. Orient yourself to the overall process of training, creating, using, and adapting this technology. Gain an understanding of the major concepts behind the API, see the types of problems it can address and the cool features it might allow you to begin to offer in your application, and learn the development best practices to achieve the most robust performance.
Speakers: Kim Silverman, Matthias Neeracher, Jerome Bellegarda, Yasuo Kida
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
You've heard that any sufficiently advanced technology is indistinguishable from magic. Well, there's been some technology lurking under the covers in Mac OS X for a while that some people have said seems pretty close to magic. It's called Latent Semantic Mapping. And today, we're going to release that magic so that you guys can build that magic into your applications.
We've applied Latent Semantic Mapping to quite a range of problems. We'll show you junk mail filtering. We'll show you how it enabled us to achieve the world's best Japanese text input method. Thank you. Kida-san, Kida-san, the guy. So let's get into it. I'm delighted to be here today, and I'm delighted to be able to introduce the senior engineer who is the architect of the code base, Dr. Matthias Neeracher.
Thank you, Kim. Good afternoon, everybody. I'm really thrilled to be here today and talk to you about Latent Semantic Mapping, or as we affectionately call it, LSM. We're unfortunately not quite important enough to be renamed to Core Semantics or something like that. So this afternoon I would like to talk to you about what LSM is and I actually am going to start with a toy example just to give you an idea of what it could do. Then we're going to bring up somebody who can explain you how it actually works. I'm going to talk about how to use the LSM API or APIs that we have. We're going to present some internal case studies, how we've used it.
Then we're going to talk about what's really interesting for you, what it can do for you, how you can use it in your applications. And we're going to have extended time for questions and answers at the end. So if you have questions, you're going to get a chance to ask them at the end.
So what is LateN Semantic Mapping? In the most general and abstract terms, LateN Semantic Mapping is a technology to analyze data and classify it into categories. Now that's awfully abstract and general. So in fact, it's probably going to be easier if I just give you a brief example of this. So could I have the demo machine on screen? Thank you very much.
So one of the things we've done with the Latent Semantic Mapping framework is we've tried to make pretty much the whole API available as a shell, as a command line tool, as well. So you can prototype your applications very easily, especially if they are text-based. And you can quickly find out whether an ID is going to work out or not. And in fact, certain parts of your application, if you do pre-training, you might never have to write to the C API for an app to do that. You could just do it in a shell script.
So I'm going to demonstrate how to use the Latent Semantic Mapping command line tool to do a simple categorization task to be able to tell apart Objective-C header. If you see a .h file, is it Objective-C or is it C? So we're going to train up a little map to tell those apart.
[Transcript missing]
So first thing we need to do is to create a map. And the command line tool is just called lsm. We want to create a map and we have two categories: Objective C versus C. Each of those has a number of files. We separate it to two groups of files with a semicolon, so we say categories.
are separated by a group delimiter, a semicolon, and we're going to call the map Objective C versus C. So let's pick a rich, juicy source of Objective-C headers to train up. Let's pick the foundation headers. So we have the semicolon. Now we're going to bring in a source of C headers. Let's pick Audio Toolbox.
Now, it's going to parse a couple dozen headers for each side. It's going to put these into a map. It's going to compile the map, which is a complex mathematical process. So it's going to take some time, let's say a couple hundred milliseconds. Okay, we're done. We have a map that, in its 300 kilobytes, incorporates everything there is to know about telling apart an objective C header from a C header.
Now I'm going to prove that it works. For that, we are going to do LSM Evaluate. And now let's pick a framework where you might not know in advance whether it's Objective-C or C. So for that, let's pick Core Data. Yes? Sure. Thanks, Kim. That was a little bit too big.
Now let's evaluate all of these headers and core data and see whether LateN Semantic Mapping places them in category 1, Objective C, or in category 2, C. Oops. And it would help to type this correctly. OK, so we have core data dot h. As you see, category one wins quite decisively.
We're going to talk a little bit later about what these course means. What's important to know is category one wins. Now, core data defines. Blueprint semantic mapping unfortunately thinks that this is a plain C file. Category two wins. However, this is not unfortunate. This is, in fact, a plain C file. It just has a couple of defines in it.
And all the other headers you see, category one, category one, all of these are Objective-C. Now, you might be saying, I could have done that with a grep. Or I could have told that by looking at the file name and saying that it starts with ns. And you would be absolutely right if you said that.
The point here is not that this is a particularly hard task. The point here is that we were able to train this up without any domain-specific knowledge. We just fed the LateN Semantic Mapping system a bunch of files from each category, and it was able to pick out which file belonged where.
And as my colleague is going to explain later, this was not even a particularly well-chosen example. Could we go back to the slides, please? Thank you. So you've seen what it does. Now, that was the black box for you. We're going to pry that box open. And for that, I would like to bring up my colleague, Jerome Bellegarda, the Apple's distinguished scientist who brought this technology to Apple.
Jerome? Thank you, Matthias. Good afternoon. Delighted to be here. Actually, my job is very simple, because how it works, it's all in the name. Latent semantic mapping. So first of all, it's a mapping. And what we're going to do is we're going to go from discrete words and documents into a mathematical space which is multidimensional in general. In the examples that I'm going to go through in this session, that's all going to be low dimensional, typically three dimensional so we can see.
But in general, it's going to be multidimensional. And this space is going to be-- the particularity of this space is that it's going to be continuous, which makes it easy to compare things in this space. LSM also has the word semantic in it, which means that this mapping is going to be based on the overall content and meaning buried inside the document. Sometimes not even in an obvious way, which is the part that is latent.
And the way we're going to uncover those relationships, which are sometimes hidden, is we're going to look at words that co-occur with each other. Exactly which co-occur with what. So for example, if you look at the document and you see the words maybe bank and interest, it doesn't tell you much about what the document is all about. If you further see rate and money, then you know that it's probably having something to do with your finances.
If on the other hand, you see bank and interest, and then later on river and trout, then it's probably some sort of article about the outdoors and fishing in particular. And so that's really the basis of latent semantic mapping. We're going to try to uncover those hidden relationships between the words. So in the next few slides, I'm going to go through each of those three aspects of LSM, mapping, semantic, and latent. Let's start with the mapping.
So as I mentioned at the beginning, we're going to have some documents. Maybe we have the finance documents and some fishing documents, like I was talking just the slide before. We may have some other things, like computer science documents, maybe some hunting documents, and so on. And what we want to do is we want to map those documents onto a space, multidimensional space. In that case, it's going to be just three-dimensional.
And so imagine in that space, we have corners, so to speak, regions in that space, which corresponds to what we're interested in. So there will be a finance corner and maybe a computer science corner. So what we want to do is take each of those documents and map them into the appropriate corner of the space. And that's a process called training. That's what happens when you create a map, just like Mathias did a few minutes ago. And so we're trying to do that. We're training.
So we are mapping all those documents on two points in the space. And hopefully, once we are done with that, we have all those little triangles here, the red triangles, which are representation of each of the documents that we started out with. Now, the problem is now we have a new document.
And we haven't seen it before. And what we want to do is we want to classify the document against the LSM space that we've just created. And that's what the lsmevaluate command that Mathias just presented to you does. So in this case, we are just mapping that document onto the space. And in this particular case, as you can see, it's aligned with the computer science cluster.
So we can reasonably conclude that this document has to do with computer science. That's the mapping part. Now, let's take a look at the semantic. And for this, I would like to just bring out a case study. And that's the junk mail filter, the one that you have on your machines right now.
And so when we started working on a junk mail filter, we decided to do something a little bit different from what was the state of the art at the time. And at the time, the state of the art was, well, you look for keywords in your message. So if you see the keyword "sex," for example, then you might possibly see some sort of junk message or whatever.
But in fact, we wanted to go beyond that and try to assess whether the whole topic, not just a few keywords, but the whole topic of the message was consistent with the user interest. Because after all, you could have sex education, which might be a legitimate topic. So to do that, we used an LSM strategy, which is to encapsulate what the user likes and what he dislikes inside of the LSM space that you saw before. And so in that particular case, we used two categories, one for junk and one for legitimate.
And the implementation in this particular case, of course, the space is very simple. It has just two dimensions. And so we defined within that space two points. We call that anchor, semantic anchors, one that represents the junkness of the collection, training collection, and one that represents the legitimacy of the collection. And then what we have to do now is every time we have a message coming in, we just evaluate whether it's closer to one or the other.
So let's take a look at what happens in this simple case. So we have this space, which is two dimensional. And we have three kinds of entities in that space. We have the legitimate semantic anchor, which might be somewhere in here. And you can think of it as the centroid of all the messages that we've seen in the training collection that were marked as legitimate. Then we have the junk anchor, which might fall somewhere here. And again, you can think of it as the centroid of all those messages that were junk.
And then we have the incoming message, which falls somewhere in that space. And so now what LSM does, very simple, it looks at distances. It looks at which anchor is it closest to. In this particular case, you can see that it's closer to the junk anchor. And so LSM would assign to that particular junk category a higher score. And that's what you saw in the simple example that Mathias went through.
So that's the semantic part. Now let's take a look. Oh, before we go on, I'd like to just mention, a little bit of bragging here, the performance of our junk mail filter. We did quite a number of experiments along the years to validate that this LSM approach, which again was a little bit different from the state of the art, is actually competitive with the state of the art.
And as you can see, we got on different databases, including some very standard databases that are used in the literature. We got performance that is equal or better than the competitive methods out there, which are typically based on what you may have seen referred to as Bayesian methods. And an advantage of the junk mail filter, in addition, is that it's adaptive.
It's very easy to reconfigure the LSM space in order to track the user's interest as they evolve. They may be interested in another topic, or they may lose interest in a topic, and so on and so forth. And we are able to model all of this. And of course, as you know, junk mail filter has been integrated into Mac OS X since Jaguar.
So now let me say a word about latent. And for this, I thought I would go through an actual example, the kind of example that Mathias went through, but a little bit, perhaps a little bit richer, where we have four categories in that example. So think of it as a four feature calendar management. It's a calendar where you can only do four things. It's elementary, but I think it will make the point nicely.
So in that particular calendar system, you can issue four commands, or queries, if you will. You can ask what time it is at the moment. You can ask what day it is. You can ask what time is a particular meeting, and you can cancel that meeting. And that's all you can say.
And so we're going to use an LSM strategy to try to approach that problem. And so we're going to use one category, one category per feature, which means in that space, we're going to have four semantic anchors. And again, because it's a very simple problem, it's going to be a two-dimensional LSM space.
But instead of having two semantic anchors, like in the junk mail filter, now we have four. And then again, we're going to evaluate whatever the user decides to issue as a query or a command. We're going to evaluate that input against the four anchors. will tell us actually what we need to do.
how LSM works. So why did I choose this particular example, which is not very realistic? Well, as I mentioned, it's got four categories. Now let's assume the user wants to know what time is the meeting, but he messes up. You know, users always mess up. He messes up, and instead of inputting what time is the meeting as the application demands, he says something like when is the meeting. So we have to classify that input against the four categories. So let me point out why this is a difficult task.
Well, first of all, different words in this example have very different predictive power. For example, the appears everywhere, which means it has zero predictive power. All you know is the. You cannot tell which command it is or which query it is. However, they and console each predict a particular query or command well. And so those two words have high predictive power.
There are all kinds of all manners of co-occurrences and overlaps. What is the, for example, occur in three of the four categories? And then, of course, we have ambiguities, because meeting, for example, occurs also in two of the categories. And then, of course, the big one is that when has never been seen. In fact, it's not even part of the underlying vocabulary.
And so we don't know what the system will do with that. Well, let's take a look. So in this particular example, again, it's so simple that we in two dimensions. We have the first LSM dimension and the second LSM dimension. And then we have one thing that I didn't make clear, perhaps, is that you can map the categories in that space.
You can also map the words. So let's take a look at. How the words map in that space. Those are the words. And I mentioned the. As you remember, the occurred in all the categories, which means it has zero predictive power. That's why it sits right there at the origin.
I also mentioned they, a console. I mentioned that those words had high predictive power. And in fact, that's why they align along-- they have high coordinate along one dimension and a low coordinate against the other. They align against the principal component, if you will. And then the rest of the words fall somewhere in the middle.
And I'd like to point out that what and is fall right on top of each other in that space, because of course, they always co-occur. And here are the queries and commands, the four categories, if you will, what we would refer to as the semantic anchors in that case. So not surprisingly, what is the day falls very close to day. Because that's the command that is predicted by it the best.
And similarly, with console.theMeeting, the other ones fall somewhere in the middle. You'll see that they fall within the vicinity-- well, in the vicinity of their constituent words. Now the question is, what about this new wording, this variant that the user introduced? What can LSM do to classify it? Well, first of all, it maps it into that space, and that's what it's going to do.
It's going to map it to the data that falls. Again, that's an actual example. And so you can see now that when we look at the distances to the various semantic anchors, we see that the query that would be predicted on the basis of this mapping would be what time is the meeting. That's where the latent part of latent semantic analysis comes in. Because in a sense, that means that in a very data-driven way, what the system has figured out is that when is a synonym So that's pretty close to magic. But as-- Thank you.
But as Kim mentioned at the beginning, it's not all magic. There is some very sound mathematical principles behind all of this. And I'm going to try to give you a flavor of that. The basic information that we rely on when we do this maneuvering is we are trying to see how often each word appears in each category and contrast this with how often each word appears in the entire training data.
That's kind of the bread and butter of LSM. And we arrange that information. So we go through each of the words in the training data, and we try to keep track of this information. And then we arrange it in a matrix, which looks something like this. So here we have M is the size of the underlying vocabulary. That's how many unique words do you have in your training collection. And then you arrange them in rows like this. Then N would be the number of categories you have. So in a previous example, I had four.
And so you arrange now your categories in columns like this. And so each category now is sort of a vector of numbers, each number representing roughly the number of times that this particular word occurred in this particular category with some appropriate weighting. So this gives us a representation of each word and each document. Each word is sort of a row of numbers, and each category is a column of number.
However, you'll notice they don't evolve in the same spaces because they all have the same dimensions. And furthermore, those numbers are mostly-- I mean, many of them are zero because, after all, most words don't appear in most documents, in most categories. And so we're going to go through a particular trick here, which is from linear algebra, which is called singular value decomposition, which we're going to take that matrix-- let's call it W-- and then we're going to decompose it onto different sub-matrices.
And the important thing to realize is that each word, which was at the beginning represented by a long vector here of sparse vectors, is now going to be associated with a vector of now dense values. And similarly, each category here is now associated with a vector of dense values here.
That's all that matters. And the important thing is now r, which is the number of dimensions that are retained by this procedure, is now much smaller than either m, the size of a vocabulary, or n, the number of categories. That's really the power of this mapping, if you will. And so when we do this-- the LSM space that we had before-- remember, we had a bunch of documents, and we were mapping it into categories and so on.
And then we had a new document. We were classifying it and so on. Well, it turns out we can also extract all the words appearing in the document. And as part of this decomposition, we also have values for the words, which are embedded in that space. So now we can ask ourselves several questions.
Given this representation for the new-- document here, we can say, well, which is the topic of this new document? And the answer would be the closest category, just like we saw for junk mail. But we could also ask ourselves, what is the single word that best describes that document? And the answer would be the closest word in the LSM space. That's in a nutshell. Thank you. That's what LSM is all about-- computing distances in a mathematical space.
Now, of course, that sounds like pretty neat technology, and so on, so forth. However, there are some caveats. This is not going to solve all the problems of the world. For example, LSM has a somewhat limited descriptive power, in the sense that, for example, the semantic part of it is heavily tied to, well, co-occurrences between words.
So that's what we call shallow in an artificial intelligence kind of way, as opposed to deep parsing and things of that nature. The second limiting factor, or characteristic of LSM, turns out it could be a strength also in some ways, is that word order is ignored, meaning that it doesn't matter that two words co-occur near each other, or one is at the beginning and the other one is at the end of a document. It doesn't matter as long as they co-occur. So. In fact-- In fact, they could appear in reversed order, and so on, so forth. That wouldn't matter.
And of course, this decomposition that I've talked about, the singular value decomposition, that's a linear algebra operation. So it assumes that everything is nice and linear. And of course, the world is not linear. So you could also bump into some limitation there. Another thing to note is that LSM is a statistical method. And as-- As such-- As such, it works as well as the training data on which it is trained.
And so, for example, I mentioned the bank and interest and river at the beginning of the talk. Well, imagine now that we have a single document in which river bank appears and Bank of America appears in the same document. That would be very difficult to infer from that document which bank we are talking about here, because the two senses would co-occur. So again, it's very important to have good training data so that the categories are well separated.
And it turns out that this method is also somewhat sensitive to writing style. You can certainly appreciate that the style in which you write an email would be very different from the style in which Mac OS Help is written, for example. And this is also-- this influences the information that LSM finds.
And finally, I just want to mention that in some applications, there is some training cost in doing this singular value decomposition, especially if the underlying matrix is large. So that's also something to keep in mind. Of course, in an application like junk mail, that wasn't an issue because the dimension was so small.
So in the rest of my contribution here, I just would like to go through a couple of guidelines to let you know what you can do's and don'ts of being successful with LSM. So the first, and perhaps more important, is that you have this cool technology that you've learned about at WWDC. You want to go back and try it on the application you have. Well, is it really the right tool for the job? That's the first question.
Can LSM handle the task? If the problem you're trying to solve is syntactic in nature, such as finding dates or times or email addresses, or whether it's Objective-C versus C, right, Matthias? Then that's probably not the right approach, because you can use grep. So you have to make sure that the problem is semantic in nature. So if your problem is, I have a bunch of documents. I'd like to sort them by topic. Now that's semantic. So that's a problem for which LSM would be well indicated.
So that's the first guideline. Ask yourself, is the task really suitable for LSM? Second is-- and I've already touched upon that-- are the categories distinct enough? After all, you're trying to classify, which means you're trying to have the categories as different as possible in order to do a good job. If they overlap, you can't do a good job. And so, for example, if you had a bunch of documents talking about economy and a bunch of documents talking about business, and you would like to separate them, yeah, probably can't do a very good job of it.
However, if you have a bunch of documents talking about economy versus a bunch of documents talking about entertainment, well, sure enough, you'll have the odd document that sort of overlaps. But by and large, you'll do a much better job of classifying. So here, that's the second guideline. Now, I'd like to say a few words about training.
In terms of both the quality of the training data, that is, the training data, as I've already mentioned it a little bit, it has to be representative of the variety that you can find in the domain. So if you forget something, like in the example I had at the beginning, if you don't have a category for computer science, well, obviously you're not going to be able to classify documents on computer science.
The data has to be clean for the purpose of classification, meaning that you have to remove things that don't matter, for example, HTML tags. If you happen to have them, they are not germane to the classification, why keep them? Fixing typos is sometimes important, sometimes not, depending. In the Jack May filter, we kept the typos, of course.
And also, the various categories have to be-- the training data-- the training data has to be balanced in each of them. That's important because, remember, you're trying to compare the number of times that words appear in each category versus the number of times they appear in the entire training data. So the more balanced the training data is across all the categories, the better, the more unbiased you're going to be in the classification.
So that was the quality. It's also important to talk about quantity of training data. Of course, it has to be large enough to cover all the variability that you expect in your data. And for a junk mail filter localization, for example, our rule of thumb was that we wanted to have about 30,000 unique words in the underlying vocabulary across the entire training collection.
That's the M, the size of the vocabulary, the M value that I was talking about before. Now, of course, this value has to be larger as more categories are added, and larger still if the data changes over time, because again, like news, you have new information coming in all the time.
You want to be able to take advantage of it. Now in terms of testing, it's also important. Again, this is a statistical method, so you have to test the performance of the technique. And the way to do that is to use validation data. So the way we like to do things at Apple is we take the training data, we partition it into chunks, typically 10, and then we train on the first nine and test on the last and see what happens.
The last chunk is called the held-out data. It's important to have held-out data because if you test on the training, you cannot predict the performance you're going to get on something you haven't seen before. And in fact, we even go further than that. We also repeat sequentially on all those chunks. It's called round-robin. So now you train on the last nine and you test on the first and so on, and you permutate. And then we average all the results, and that gives us a better predictor of the performance.
now what if you do all those things and the outcome looks still somewhat weird well first of all check the previous guidelines you know was was the task appropriate for lsm and so on so forth another thing is to try with a short stop word list there are words like the or in or of that are likely to appear pretty much in the same way across all the categories and they tend to not be very informative and furthermore they tend to muddy the results so sometimes we get better results by removing them try experimenting with the number of dimensions that are retained because sometimes like the the default that we set in the in the in the framework is the number r that i was talking about before defaults to the number of categories so in the junkbait filter for example that would be two but for natural language problems like the ones that kidasana is going to talk about later typically we want r to be between 100 and 300.
Finally, because remember LSM doesn't keep track of world order, so sometimes it's valuable to keep track of the most frequent world pairs. For example, you can appreciate that a blowtorch, for example, may occur in a context which is quite different from just blow and just torch. I had actually a better example with blow, but it was forbidden to use it here. Same thing, if it's important to use New York City as a single entity, then by all means add it to the data. But make sure that you do this sparingly, because obviously this will impact the dimension of your matrix, and therefore it could impact the computational cost.
And that's pretty much all I wanted to say. One more thing. And that is that it's often to your advantage to integrate LSM with other sources of knowledge. Because of the semantic approach that LSM takes, it tends to complement other techniques that are not based on semantics. And so it can often improve the robustness of the overall system if you combine LSM with other things. And I'd like to offer two examples.
The junk mail filter complements, instead of replacing, things that are normally done in a spam filtering, like whitelist and blacklist and unwritten rule. The junk mail filter comes after all of those things, not instead of. It's just an additional source of knowledge. And similarly, in the Kanata-Kanji conversion, which we're going to talk to you about in a little bit, LSM is just one of several sources of information. information that is being exploited.
And so that concludes what I wanted to say about how LSM works, and I'd like to turn it over to Matthias once again, who's going to talk to us about how to use the API. Thank you. Thank you very much, Jerome. Thank you very much, Jerome. Now you've heard how it works. Now we are going to hear how to make it work for you.
Now, the first point I would like to reiterate, it's always cheaper not to use the API than to use it. It's often a lot easier to use the command line tool to prototype instead of using the API. However, at some point, if it works for you, you're going to want to use the actual API. Our API is core foundation-based, so our main data types are core foundation types. You can retain them, you can release them, you can put them into containers.
Our three types in the API are the LSM map, so you see... Latent Semantic Mapping really revolves around the lifecycle of an LSM map. You create the map, you train it by adding data to it, you compile it, then you evaluate it with data, you can archive it to disk, load it back in.
Whenever you're dealing with data, whether it's for training or for evaluation, You're using an LSM text to represent the document that you're going to use for training or for evaluation. And finally, on evaluation, you're getting back an LSM result from the evaluation. So let's start at the beginning of the lifecycle.
You create an LSM map, and here we finally come to the all-important blue boxes on the slides. You create an LSM map. By calling LSM Map Create, and like any creation API in the core foundation space, you can pass in any allocator, but for most purposes it's good to pass in the default allocator, also affectionately known by its friends as null.
You can optionally pass in a number of flags. Here, as an example, we've passed in the flag map pairs, which will allow you to map a pair of words like blowtorch as a single unit and keep track of that. In most cases, you don't actually need these flags, so it's okay to pass in zero.
Now you've created an empty map. Your next job is to fill it up. You add categories by calling lsm_map_add_category. You're going to get back an integer which identifies the category. You then add data to the map by creating an lsm_text, by adding a number of words to the text and having them parsed.
And then by adding the text to the map in one particular category. Now let's look at that in a little bit more detail. The text you add is passed in as a CFString. You also can pass in a locale identifier, but we would not necessarily advise you to do so. The default is going to work out quite well.
And furthermore, like most of the LSM calls, you can pass in a number of flags. Here we're passing in a flag, LSM text preserve case, which means instead of just mapping everything to lowercase as we normally would in this case, we would like to preserve the case. We're going to talk about some other flags later.
Once you're done adding words, you add the text to the map. Here we say, "This text is one of the good guys. This belongs in the good category." So we add it to the good category. Once we've done that, we have no more need for the text, we can release it.
Now, this is the case where LSM parses the text for you. There might be more subtle situations where you would like to do the parsing yourself, in which case, instead of calling add words, you call add word, which doesn't do any case mapping, doesn't do any transformation whatsoever. It just takes the CFString you pass into it as the gospel truth, adds it to the map, or adds it to the text.
You don't even have to use words. You can also, if necessary, you can use binary tokens, because deep down inside, LSM doesn't care what you give it as identifiers. All that matters is an identity comparison. Does it compare equal to another token or not? For that, you have to call LSM text add token.
Once you're done training, you compile the map. This is LSM map compile. The example I've shown to you, this takes a couple hundred milliseconds, or maybe a couple milliseconds. In mail, after each message gets evaluated, when you click the chunk button, the map gets recompiled. I very much doubt that any of you have ever sat in front of your computer and twiddled your thumbs while the map was recompiled. However, I should point out that in natural language problems, when there are large and sparse maps, that can actually take an appreciable amount of time.
Now you have a compiled map, now you're in a situation to evaluate. For evaluating, first thing you do is create a text that looks exactly the same way as for training. You create the text, you add words to it, you add single words to it, you add tokens to it. And finally, the all-important evaluation result you get by calling lsm_result_create.
You pass to it the map, you pass the text you want to evaluate against, You pass the maximum number of categories you would like in the results. If you're kind of have a winner-takes-all mentality, then passing in one would be the right thing to do. If you're always second-guessing yourself, you're passing in two.
And if you want to know the total rankings, you're maybe passing in a thousand if you have a thousand categories. Again, there is a final argument for flags, which in most cases are not going to be used, but we are going to present one situation where they are going to be used.
At that point you have no longer a need for the text, but you're going to have to extract what the result actually says. So, you're calling routines like get_category, which takes the result and it takes the index into the result. So, get_category result 0 gives you back the number of the index of the best category of that evaluation.
You can also get the floating point score that we've seen here in this example, again with an index of 0 to get the best of those scores. Passing non-zero indexes to get second best and so on. And finally, when you get tired of that, you throw away the result.
If you feel after evaluating that you want to add more data, for instance, in the junk mail example, you're all trained up and yet you see another spam or you see a false positive, you click the junk or not junk button. The map we call LSM Map Start Training to put the map back into training state so you can add more training data to it. And once you're done, you call compile again. Similarly, in many situations, you don't want to compute the map on the fly. Once it's computed, you want to store it.
So you write it out with lsm_map_write to URL and load it back in with lsm_map_create from URL. In some situations, you know exactly that the only reason you're loading the map is because you immediately want to add more data, in which case you can pass in the flag KLSM Map Load Mutable. Normally, when you load from a URL, you're getting back the map in evaluation state, where you can evaluate. If you say KLSM Map Load Mutable, you're getting it back in training state, which in some situations may save you time and memory.
So that's all there is to the API. There are a number more calls, there are properties you can set, but all of that is pretty easily discoverable from the documentation you have received and you should have with your seed on Xcode. So how have we used LateN Semantic Mapping internally? Our first application was using LateN Semantic Mapping for junk mail filtering, which was an application designed by my colleagues Devang Naik and Jerome Bellegarda. So here, in a way, it's a very simple application. You just have two categories. You have legitimate mail versus junk mail.
There are a couple of subtleties here. First of all, getting a false positive, having a mail that is terribly important to you classified as junk mail, is a worse situation than having the occasional junk mail slip through. Or at least for many people it is. So therefore we're not strictly going by the top result, being junk. We're also going to test that the top result has a confidence score that is higher than just the 50% it would need to win. We want a little bit level of extra confidence.
Second, as Jerome said, we're going to use multiple sources of knowledge. And one of these sources is if the sender of a mail appears in your address book, then we're always going to err on the side of caution and think it's a legitimate mail. Which, as we all know, is not always true. But we certainly don't want that.
We don't want to toss a mail and afterwards it turns out that you have actually won a prize, or you were supposed to come to a meeting, you have finally secured your mortgage. So we're going to use the address book as an indicator whether a mail is legitimate.
Now, junk mail filtering is also especially tricky in one regard. In many semantic applications, the semantics are just waiting to be discovered. This is not the case in junk mail filtering. The junk mail senders are going to great lengths to disguise the fact that they are sending unwanted mail. They're not going to proclaim, this mail here is unwanted. So, one of the things they do is they insert punctuation into their words.
You might have seen this, a dot after each character. Or they use, for an accent, they use what we call in the trade the heavy metal band syndrome. You insert a couple of extra umlauts and accents in the hope that your users are still going to be able to read the text, but that the junk mail filter is going to be thrown off.
For that, to counteract that, one of the flags we have for LSM text ad words is LSM text apply spam heuristics. goes through the text with a somewhat, with an eye to treating the text as hostile text, it's going to make a guess that if it's seeing periods between characters, it's a longer word that is just broken up this way.
Or if it sees weird characters that are, kind of look like letters, like the copyright sign, but are not usually treated as letters, it treats them as letters anyway. Nice thing is, this makes the mail even better trainable because these weird spellings are going to stand out like a sore thumb in training.
Another problem we're dealing with here is that the map potentially contains a lot of offensive words, and we don't want a third grader armed with cat or less to look at the map and discover all these words that aren't supposed to be in his vocabulary yet. So we have a simple flag, KLSM Map Hash Text, which is not industrial-grade cryptography, but it does the job to secure the map from the most casual prying eyes.
Finally, you're not really restricted to using just the text of the mail. You could use meta information if you have a tiny one-by-one picture pointed to by a URL, what is referred to as a web bug, I think, to track whether the message gets read. That's a piece of information you can just add to the map as a pseudo-word tiny picture, because that's something that is very trainable. Overall, this was still a relatively simple application of Latent Semantic Mapping. For a much more involved application, I would like to bring up my colleague Kira San, the manager of OS Engineering Tokyo, to talk about Japanese input.
Kira San? Good morning, no, good afternoon. My name is Yasuo Kida, and everybody calls me Kidasan. I'm a manager of OS Engineering Tokyo of Frameworks Group. I'll first explain a little bit about how Japanese input method work first, and the challenge we faced, and how we took the advantage of LSM to tackle the challenge.
You might know Japanese language uses thousands of characters, and it doesn't fit in a keyboard. So how we enter them? Well, Japanese input method uses a process called kana-kanji conversion. You enter pronunciation using phonetic characters called kana, and the input method converts the input into kanji. An issue is there's so many possible conversions and you need to pick the correct one. For example, when you enter the pronunciation, those are actual possibilities and the input method chooses the correct one. This is similar to voice recognition or handwriting recognition. It can never be 100% correct, but accuracy is critical for the usability of the system.
So, a Japanese input method called Koteri calculates the probability like this. It uses a model called Class N-gram. It looks up each word in the possible conversion from the dictionary and get the frequency of it. And also, it looks at part of speech, probability of part of speech connecting each other in a conversion, and multiply them together to get the probability of a conversion. In actual calculation, we use a value called cost, which takes logarithm of the probability value. So by doing that, we can add integer together, add integer together, instead of multiplying floating point numbers.
This model works pretty well. However, there are ambiguities that cannot be solved by Engram, because Engram looks at neighboring words. For example, in English, if you say, "What a beautiful tail!" Is it about the most part of an animal or about the story? You don't know. This ambiguity can be resolved by knowing a topic of the discourse. This is where Latent Semantic Mapping comes in. For example, if a word "princess" appears in the text, most likely the tale is about the story instead of the tale of an animal.
So we built an LSM map that does this. We first prepared a Japanese corpus containing 300,000 documents, and we clustered them into 3,000 categories. The purpose is to capture the topic in appropriate resolution. And then we removed words that are low frequency and removed words that appear equally in many categories. For example, word this or that gives little information about the topic. And we set the dimension of the LSM map 200 like this code. We called LSM map set properties with the number 200.
You might wonder how we came up with those magic numbers, like category 3,000 categories or 200 dimensions and so on. We iterated measurements, number of categories, 2,000 to 5,000, threshold for removing words and dimensions, and some other parameters in order to optimize the conversion accuracy as well as the size of the LSM map. The size is important because the size of the LSM map tends to become very big when you have a large corpus. So you want to make the map space efficient.
At runtime, Kotori captures what you have typed so far, and that history is kept for each document. And when it converts an input stream, it uses LSM to obtain words that best match the word's history. For example, when you have typed "performance", "violin", and so on so far, the LSM map might return those words like "music", "competition", and "soprano", and "cella", and so on. This is a real example that was returned from Kotori LSM map, of course translated to English.
And we adjusted the cost of probable words that appear in the possible conversions. And by doing that, the probability of each word is now aligned with the topic of the discourse. And consequently, the probability of the whole conversion reflects the topic. So in a word, LSM boosts the probability of words that most align with the topic of the discourse. And this is exactly what humans do. And the result? Kotori version 4 that took advantage of LSM was ranked number one in conversion accuracy on a benchmark conducted by an independent research.
So, thank you. At the end of the day, the improvement achieved by adding LSM is relatively small, but it was still enough to pass the competitions. I hope this gives you a hint of how you could take advantage of LSM. Thank you. Now, Matthias, welcome to the stage.
Thank you very much. Thank you, Kida-san. So what has LSM done for you lately? Maybe it has thrown away a junk mail for you. Maybe it has converted some kana into kanji for you. But more importantly, what can LSM do for you? What can it do for your application? Latent Semantic Mapping can categorize lots of things. Think of bookmarks. Right now, In Safari, you're hitting "Add this to my bookmarks folder" and you're going to get a pop-up menu that always starts at the same place.
How about if your web browser fetches all the web pages in your bookmarks menu, categorizes them, and uses that to propose the category to add your bookmark into? You could do the same thing for RSS feeds. RSS readers are, I think, a very fluent category right now, fluid category right now.
You could use it if you had a library application. You could use it to categorize books or CDs or DVDs by fetching book reviews from a site to categorize. If you're a yuppie and would like to keep track of your wine and cheese database, Be welcome. If you're a biologist and would like to categorize your DNA sequence, are you a man or a mouse? LateN Semantic Mapping might be able to tell you.
You're limited only by your imagination for what you can use LateN Semantic Mapping for, and people often ask us, "Is this a suitable field to apply LateN Semantic Mapping for?" Ultimately, it comes down to the tools are relatively easy to try. Try it for yourself, find out whether it works for you. For more information, talk to John Galainzi, our technology manager. There is a public mailing list as of today or tomorrow that you can subscribe to. And as usual, we have sample code. We have a brand new sample code application written by my colleague, Jia Pu.