Write a Spotlight Plug-in Today - WWDC 2005

Application Technologies • 1:03:14

Learn how to create a Spotlight plug-in to make your custom file format searchable by Spotlight. We'll show you how to write a Spotlight plug-in to export information from a basic file, create custom keys that the Finder can display, and debug your plug-in.

Speaker: Dominic Giampaolo

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. I'm Dominic Giampaolo. I'm a Spotlight weenie and member of the Spotlight team. And today we're here to talk about Spotlight importers. What are we going to talk about? First off, what is Spotlight? Well, of course you've all seen it. You've used it probably. You've seen the keynote presentation. But we're going to go into a little bit more depth about the architecture and how it's put together.

The next thing we're going to talk about is importer background, what you need to know to write an importer, the context in which importers run in, and some other information about importers. Then, of course, we're going to write an importer, since this is a hands-on session. And then we're going to talk a little bit about some tips and tricks and debugging, performance, things you need to know about importers once you've written one. Then we're going to talk about advanced importers. Advanced importers are a little bit more sophisticated in the things that they do, and that will be the second importer that we write.

The Spotlight architecture. What is Spotlight? Well, at the first level, it's a system for storing, querying, and retrieving information about files. It's a kind of different lens or a different view onto the files that you have on your hard disk. It's an alternate way to look at them. What is it made up of? Well, at the center, you have a server. The server centrally orchestrates everything. There's some daemons that run on behalf of the server and do work for it and help it out.

The part that we're here to talk about today are the importers. Importers are probably the most crucial part of the Spotlight system, because that's how we get metadata into the system. Spotlight lives and dies by the quality of the metadata that we have. If we don't have good metadata, we don't have anything. We have searching by file names. So importers are a very crucial part of the system.

And of course, there's a client API that applications use to query the system and to find out information about files. But we're not going to be talking about that at this session today. That will be later on this afternoon in Javier and Vince DiMarco's talk at 2 o'clock.

So why do you care? Spotlight makes your documents easier to find. So when someone remembers something about their file, oh yeah, I entered this label or this track name, they can find it, because Spotlight can index that information. It makes your documents display more richly. So you don't have a display that's limited to just the name and size and last modified time, which really is irrelevant for a lot of documents. You can have things such as the title or the authors or the duration of a piece of music, a lot more rich information about it that will help the users to organize things and to remember things about the documents.

A big point is that it doesn't require any code changes in your application. This is a big point. If I got up here and said, well, Spotlight's going to take you six months to work into your application and to become a part of, you'd all take me out back and beat me up.

You all have schedules. Releases to meet. You don't have a lot of extra time. What I'm here to tell you is, in fact, that to be part of the Spotlight system doesn't require significant changes, or doesn't require any changes in your application, doesn't require significant investment of effort either.

It also makes it easier to share data with other applications. So what's good about this is that instead of having every app have to know about every other app's file format to get just some little piece of information, there's a clean and consistent way to get at that information about files, the metadata about files. through the spotlight APIs.

So, let's get graphical. What does it look like? At the center, you have the Spotlight server. On the left, you have documents, which are created by applications. When those documents are modified, importers run to extract the information and pull it into the Spotlight server. The Spotlight server then puts that information into either the metadata store or the content index store as appropriate.

We have two separate stores because there's different ways of indexing information. The metadata attributes are stored in one way, and the content index is stored in a format that's appropriate for searching it. It's more sophisticated for that kind of search. And then on the right-hand side, you have applications that talk and communicate through the Spotlight API.

So, let's step through it. What happens when you write a file? Again, at the center, we have the Spotlight server, which is always running in the background and keeping an eye on things. An application, that's you guys, comes along, writes a document. That document gets modified, the file is closed, the kernel says, boom, hey, that file was modified to the Spotlight server. Spotlight server, in turn, runs an importer. That's you guys.

The importer takes the information out of the file, pulls it into the Spotlight server, which, of course, updates the data stores, and in turn, sends updates to applications that have live queries or are asking information about that file. So, that's kind of, you know, at a high level, how the Spotlight system works. What are the concepts behind Spotlight? What we're here to talk about today are importers. Importers are plug-ins that are loaded by the Spotlight system.

Importers are plug-ins that are loaded by the Spotlight system and publish metadata from a particular file format. So an importer handles some set of file formats, knows how to take data that's in the file, metadata, pull it out, and pass it back to the Spotlight server. There's the MD Item. Items represent files.

All items in the Spotlight system are represented by files or correspond to a file. Items are made up of attributes, and attributes are a name, a type of data, and a value. The value is the data that's associated with the name. There's also a MD Query API, which is a way to, that's how you find items. Now, we're not going to talk about the last two things. Again, that's the talk later this afternoon. We're going to focus on importers.

So, how can you integrate with Spotlight? The first level, if you work with an existing file format that already is supported by the Spotlight system, such as JPEG or MP3 or a variety of other formats, put useful metadata into your documents. This is a big thing. A lot of file formats, like I said, such as JPEG, already have support for metadata. And if your app allows the users to edit that information, you're already part of the Spotlight system. You're enabling people to work with metadata and to be a part of the Spotlight system. The next level, if you have a custom file format, then write an importer.

Other things that you can do to help, don't type files incorrectly. We've seen a few cases where people say they'll go and write out this very nice 10, 20, 40, 50 megabyte file of binary data and say that it's text. That really hurts us because it can take quite a while to figure out that this isn't one of the 20 or 30 different character set and codings that are out there and we can waste a lot of time.

So make sure that you type your files correctly. If it's actually binary data, don't type it as text, please. Get your temp files where Spotlight won't index them, such as in /temp. If you don't already do that, that's a good place to put things. Or if you want to, put them into a directory whose name ends in .noindex or .build.

This way Spotlight will see that and not index any of the files that are in that subdirectory. So this is a good place to put temporary files that you don't want Spotlight messing with. Now, I'm going to do a quick little demo. Can we switch to demo one?

Okay, so I'm going to do a search here for "doc." And you see this pulls up a whole wide variety of things, and it looks pretty nice, right? There's some system preferences, there's folders, there's images, and if I click on an image, there's some information stored for it. Oh, not for that one, for a TIFF file.

I can't do that actually. So here we can see the dimensions and so on. There's PDF documents which have lots of metadata title. And you might say, "Wow, that's really great. Look at all that. They probably hard-coded all that stuff." Well, what happens if I was to search for, let's see, 8, 6, 7, 5, 3, 0, 9. Okay. What do we get here? A few of you got that. All right. This is file badbug.dbg.

Well, what's that? What's a .dbg file? Aside from also being my log-in, it's a little file format that I invented for the purposes of this demo where I record some information about bugs that get assigned to me at Apple. And so if I click on the info icon, we can see, oh, debug session name. This is about Jenny's file system woes. And the radar number. Radar is the name of the bug system at Apple, 8, 6, 7, 5, 3, 0, 9. And there's some keywords associated with it.

And what's cool about this is that this file is seamlessly integrated as part of the Spotlight system. If I was to do a get info and pull it up into Finder, Finder doesn't know anything about .dbg files and nor does Spotlight. But because of the way I've written the importer for this file format, which is very simple file format, but the way I've written the importer gets my data integrated in a very clean way into the rest of the system. And this is what we're going to show you how to do later today when we go through an advanced importer. Okay. Back to the slides.

Slides? Thanks. So, Apple ships a whole bunch of importers for quite a variety of file formats, everything from JPEG and Photoshop, QuickTime, of course, PDF. If you work with these file formats, your job is done. You guys can get up and leave. No, stick around. There's not much that you have to do, actually, except for, again, putting useful metadata in the documents.

But if you don't work with those file formats or you have a custom file format in addition to some of those, then you have to write an importer, and that's what we're going to talk about now. So, here's how we're going to do it. Some rules of the game, then we'll go through a simple importer. We'll talk about performance, tips and tricks in debugging, and then we'll move on to an advanced importer that handles that .dbg file format I just showed you.

Some basics. Importers need to publish metadata that helps the user. That's what this is all about. A lot of times people hear about Spotlight and they go, "Man, this is just great. I am gonna publish so many attributes, you wouldn't believe it." That's a little bit too much excitement. What's better is to keep in mind you want to publish things that are going to help the user find their document. That doesn't mean 75 attributes.

You want to do things that are gonna help display the--have--allow the document to display more richly. So keep that in mind, keep that in forefront when you're making your choices about what kind of metadata to publish. Importers should not use the Spotlight system to store large blobs of binary data or icon previews. That's really not what it's about. Again, Spotlight is about searching user metadata.

And what that comes down to is that too much noise can confuse the user. If your documents show up for pretty much anything that anyone types, because you have attributes that match everything, basically, that's not helpful. That doesn't help the user find their documents. That just basically spams them. So you want to keep it so that you have relevant metadata for the documents of your type.

Another aspect is that there's no user interface to an importer. In fact, we disallow it in the context which an importer is run. So you can't pop up a dialog box that says, "Do you want me to publish this attribute or this attribute?" You have to keep everything to be run silently in the background.

UTI types. This is something that not a lot of people are familiar with. It was in a system that was introduced in the Panther time frame, the Uniform Type Identification System. And it's a pretty neat way to manage type information about files. The file name or type, or file name extension or type code determine the file type.

For any UTI type, there's only going to be one importer. One importer, however, can handle a variety of UTI types, such as our image importer, which can handle JPEG, TIFF, Photoshop, and so on. You have to get your UTI declarations right. This is probably the most difficult part of writing an importer, and that's something we're going to spend a fair bit of time on in the coming slides.

Attributes. Well, what are good attributes? Things that, again, the user will associate with the document. If the user actually bothered to type in some text for a document, then that's probably something that's a candidate for being published as an attribute. things like the copyright, the title, track names, layer names in documents.

Text content, however, is a special attribute. If you have a file format where, for example, you have a notes field or some other sort of large body of text, that's probably not appropriate to be published as an individual attribute, but it's more likely to be published as a KMDItemTextContent attribute, which is a larger body of text that gets indexed into the content index store.

Bad attributes are things like app-specific implementation details, window position. People don't search for that window that I put at x equals 572 and y equals 379. That's useful to you, but that doesn't belong in the Spotlight system. Here are some attributes that we have predefined. There's some general attributes such as the KMDItemTitle, the authors, and so on. We have type-specific attributes for images, movies, audio.

There's a full list of them in the mditem.h header file, which is part of the metadata framework, which lives inside of the core services framework. If you command double click on any one of those names in Xcode, it'll pull up the header file and there's a whole bunch of HeaderDoc in there that explains the types and what the attributes are about.

So, writing a basic importer, what do you have to do? First off, you start with the metadata importer template in Xcode. If you pull up a new project window in Xcode, you go down to the Apple Plug-in section and choose Metadata Importer. You edit the Info.plist file that's in the template file.

You have to write some code, of course, and that's done in the GetMetadataForFile stub function that we provide. And then you install it, and that's basically all there is to it. So let's go through the steps. In the Xcode Metadata Importer, you see there's a standard Apple Plug-in section, and there's Metadata Importer. You click Next, give it a name, and you're ready to go.

Then you have the Info.plist file. Here, if you need to define the UTI type for your file format, you want to edit the UT exported type declarations key and the information that's in there. And this is a fairly complicated step, so we'll go through that when we write the importer.

You have to make sure that your UT type conforms to section is correct. That's, again, very important to get your documents displayed properly in the rest of the system. Then, once the UTI is defined, you need to bind that UTI type to the importer that you're writing. And that's done in the LS Item Content Type section. Here is where you identify which UTI types this importer will handle. And this is the binding that matches your importer with a particular file type. So when the system sees a file of that type modified, it knows which importer to call.

Of course, there's a bunch of other informational fields, name, size, name and version number and so on of the importer bundle. You should edit those. However, one final warning, don't just do a global search and replace of com.apple and just say com.mycompany. That will break the file format and will not work.

Implement the code. Well, I can't tell you how to do this for your file format, but we have a stub function called getMetadataForFile that wants to efficiently extract the attributes from your file format and put them into a dictionary, which that's a CFMutable dictionary that is used by the system to then publish those attributes.

The last thing you have to do is to install your importer. There are a couple of locations that you can install it. You can put it standalone into /library-spotlight, and if you're just releasing a separate importer by itself, that's probably the most logical location. You can put it into the user's home directory in /library-spotlight.

Or you can put it in with your application. You can have an app bundle importer. If you have foo.app, you put it into the contents library Spotlight folder, and you put the package or the bundle for your importer there, and then the system will recognize that when your app is installed. There's a precedence. As I mentioned, you can only have one importer for a particular UTI type. The precedence that we look at is /library/users/home/directory/library/sp otlight/library/spotlight/system/library/ spotlight/app_bundled_importers. So, let's go on and write a simple importer. Demo 1, please.

As I said, in NextCode, if you come up to here and choose a new project, and don't worry, I'm not going to actually write all the code by hand here alive. I scroll down, and if I see Metadata Importer Template, and choose Next, and give it a name, Fufu, what we get is a sort of standard template that has a bunch of pre-populated files for a metadata importer. But, like I said, we're not going to write the code completely from scratch here. We're going to start with the IFF importer.

demo. And the first thing that we do is pull up the InfoP-List file. So this is going to be an importer for the old Amiga IFF file format. I just happened to be browsing around the developer sample code and found a piece of code that Alie Ozer wrote that knows how to open up these image files. And I said, "Oh, that's great because I don't have to do anything." And so I wanted to just publish the little bit of metadata that I could out of these files.

And so the easiest thing was just the width and height for the purposes of this demo. So we're going to go through the process of what it takes to write an importer for this Amiga interchange file format picture. Now, the first thing that we do is the UT exported type declaration section. We have to identify what the UTI type name is and we're going to call it com_amiga_interchange-files.file-format. We give it a simple little description, IFF image file format. And now we have to say what it conforms to.

And I'm going to talk about this a little bit more later, but for now we can just say this is an image and so it conforms to public.image and that takes care of everything for us. We have the OS type which is the four character code which is unlikely to be assigned for this so I just chose .IFF as a four character code. And then we have the file name extensions which is the most likely one to be assigned for this file. So we're going to call this .IFF. We're going to call this .ILBM.

And then we're going We've also chosen to define a MIME type for this UTI because UTI subsumes MIME types for character codes and everything, and we've called it image/iff. So that takes care of defining the UTI for the com.amiga.interchange.file.format type. Now, as I said, you have to bind, you have to connect that UTI type, which exists in the UTI space, with this importer.

And we've done that with the LS item content types. This is saying, this Info.plist is declaring that this importer, his role is to handle the CommaMegaInterchange file format type. I've also edited some of the other informational fields to give the name and so on, but that's not so interesting.

Now, in the code, here we'll pull this up and make the window a little bit bigger, we have the stub function, get metadata attributes from file. And there's a couple of parameters that are passed to it, but I'm just going to call off to the Objective-C code that Ali had written and that I've wrapped up in a little stub function and to get it. So we have a CFString, which is the path to the file, and an NSMutable dictionary, which are the attributes. These are toll-free bridged to each other. So we call off to the Objective-C code, and I'll pull this up.

We're not going to actually go through the process of reading the IFF file format. I'll just talk about how it's called. So the first thing we do is we create an NSData object and we say NSData data with contents of file and we pass it the path that we were given. We failed to get the file because it doesn't exist or there was some error. We just return false and we're done.

Once we have the file data, we have this IFF ILBM info structure named PIC. And we set up this parsing, parse the file into this PIC structure, and that has the metadata that we want to publish for this file, for the width and height. I've actually disabled the code that decodes the image.

Dominic Giampaolo So then the next thing that we have to do, publish the width and height. This variable, pic.bmhd.w, apologies for the wonderful naming conventions, we create an NSNumber for that width variable. And then we say, for the key, kmdItem pixelWidth, we're going to set the value num into this attributes dictionary.

And then we're going to do the same thing for the pixel height. So for that key, we've created an NSNumber pic.bmhd.h for the height, and we set the value and push that into the, put that into the attributes dictionary. We're all done, and we return true. If you're actually looking at this code on the DVD, there's two extra releases in there that shouldn't be there, but I didn't make the change before the DVD cutoff date. So that's, we have that. Let's build our importer. Okay, build succeeded. These are very fast machines. And now I'm going to pull up a terminal. - And now the last step that I have to do is to install the importer. I'm going to copy build IFF importer.

is the founder of the Spotlight library, and he's been working on a lot of these things. Sure enough, if I scroll through this, we have Users, Apple Library, Spotlight, IFF Importer. So as a first pass, we've seen that the Spotlight system has found our importer, and that it was able to load it, and so we've got a first level of success. Now, what we're gonna do is see if we can actually import some data from a Spotlight.

[Transcript missing]

dash D2 dash F sample image, we'll get a little bit of output here. And let's go through what we got. So this is the output from when I ran that command. Says import of sampleImage.IFF and the type comAmigaInterchange dash file dash format. And it says it's gonna do it using the IFF importer that we just installed.

So this part looks great. And it says that it's sending these attributes for that file to the server. There's a bunch of other stuff such as the content creation date and the content type, which were done for us. And then, lo and behold, there we have our pixel width and pixel height. Great. So with very little effort, we've done a very simple importer and published two attributes for this file. Now you say, great, Dominic, nothing ever goes that smoothly. Demos are always nice and everything works fine the first time you do it.

Let's say someone comes in and says, "Dominic, Dominic, Dominic, what did you do? com_amiga_interchange-file-format." That's not the right way to declare it. You should just say com_amiga.iff for the UTI type of this file. Sure, no problem, Bob. I can make that change. It's just two strings that I have to edit.

What I'm going to do here is introduce a change into the Info.plist file and then show you what can happen or one of the problems that you can run into when you're making changes. I saved that, made my two little edits, and hit Build. It's fast. It's just the plist file. Go back and I'll copy this into place.

I've installed my importer again. I say, "Okay, no problem, Bob. Let's take a look here. We'll do an MD import. -d2-f of that sample image again. Ooh, that's not good. Let's see. So, like I said, I've purposely introduced a problem here to demonstrate how you would go about and fix it. So what we can see happened is it says com_amiga.interchange-file-format. There's no importer for it.

But wait, I changed my importer. I changed the Info.plist file, right? So what's going on? Well, the problem is that because we just copied in the importer for the UTI, the system didn't notice that the plist file had changed. So what you really need to do before you install it is always remove the previous one.

I do an RM of the old importer, and then I copy things in again. Now the system will have properly noticed that the file was changed, the InfoP-List file was updated, and sure enough, there we go, our type is now Comm Amiga IFF, and the pixel width and height got published properly.

So this is one of the things that you can run into when you make changes to the InfoP-List file that you need to be aware of. You don't want to be too clever. You always want to remove the old copy and copy in the new one. And as we'll see later on, you can also run into issues if you have multiple copies, if you have backup versions of your importer sitting around, you need to be careful with that, because the old versions can sometimes mysteriously override things, and I'll talk about how to figure that out. So that hints... And that's the end of writing a simple importer. If we go back to the slides... Let's recap what we did.

The first thing was we edited the Info.plist file. We defined the UTI type for the Comm Amiga interchange file format, and we bound that importer to the UTI type that we defined. We had to write the code, which was easy for us thanks to Oli. We just ripped it off and packaged it up, and then we installed it. We ran the mdimport command and we tested it out.

Now, let's start to talk about performance. This is a very crucial topic for importers. Importers, like we just ran, are very simple, straightforward. They simply, you know, you run it against a single file, everything looks great, no problem, I'm done, time to, you know, crack open a brewski and call it a day. Well, it's not quite that simple. You actually need to check your performance and you need to be careful about a few things.

When you run it against a single file, like I said, it's a very simple sort of scenario. However, MD importers run in different contexts. If someone comes along and plugs in a hard drive that has, I don't know, 5,000 or 10,000 instances of your file format on it, your importer is going to run in a very long-lived context.

It won't be run for every single one of those files, but it will be run for many of them at a time, repeatedly. So if you have a memory leak, or if you do something that really abuses or thrashes the system, the user is going to notice it. So you need to be careful about performance. You need to use caution when reading large files. You don't want to just blindly suck in 50 megabytes of data like that IFF importer did.

If we were passed a file which wasn't really an IFF file, and we did NSData with contents of path on a 50 megabyte file, "Thrash the machine pretty thoroughly." So you need to be cautious when you read large files. Not only should you avoid reading all the data if you can, you should just seek around and use your knowledge of the file format to get the bits that you need, but you also have to avoid allocating lots of memory. I just used that because those Amiga IFF images typically are 320 by 200, and I was lazy, it was a simple demo, but that's not the kind of thing that you would do in a production one.

Don't blindly reuse code. This is a frequent thing that's convenient to do to get a first pass up, where you just take your existing document reading code, pull that in to the importer template, and just inflate your full document structure, pull out the three integers that you need to publish or the few strings that you want to publish, and then call it a day.

That's probably not the right thing, because if you're re-inflating an entire document structure with all the fields and all the data, you're probably using a lot more system resources than you need to. It's much better if you, again, use your knowledge of the file format to find the pieces of data that you need and get those out efficiently and publish those as quickly as you can. I've mentioned mdimport-p. This is a good thing to do, where if you take a large directory that has a lot of copies of your files in it, run mdimport.

mdimport-p and that directory name, it will go through and iterate and import all of those files. It will print out some statistics about how long each file took, where you spent a lot of the time. This way you can start to diagnose if you're using too much memory, if you're taking too long to process files, and understand what's going on.

Some tips for importers: Use standardized attribute names where you can. Don't invent new things just to be gratuitous because it's KMDItem My Company Title. No, it's probably KMDItem Title. Use the existing ones that I mentioned that are in MDItem.h when you can. So don't invent new things gratuitously. If you have something like a headline, well, that can probably be classified as a title.

Don't forget text content when it's applicable. We don't all sit around writing word processors and text editors, but a lot of different applications actually have quite a bit of text in them. For example, this presentation here has a fair bit of text in it, yet it's not really a word processor, and so we publish things through the KMDItem text content attribute.

As I mentioned earlier, if you had a notes field or comments or some other large body of strings, you can publish that information as text content, and it will be associated with the document. It's a document, and it'll be indexed efficiently, and the document will be found when a user searches for those words that show up in there.

Don't publish too much. Again, there's sometimes a lot of initial excitement, "Oh baby, I can publish all these attributes." Well, it's important to keep in mind what is going to help the user find the document. What are things that they might be likely to search for? If you need to remove attributes, for example, you see that the file has been modified and now there is no longer copyright information that's been taken out, or the title or the author has been deleted, you can remove it by simply not publishing that attribute. In the Template Importer project, there are two files that get included: schema.xml and schema.strings. We're going to talk about those in a minute, but if you don't use them, just delete them from the project.

Now, here's where it gets tricky. The find points of the Info.plist file. Or I should say, not tricky, but a little bit more advanced and subtle. There are a couple of things. The UT imported type declarations instead of UT exported type declarations. The difference here is that if you are defining a UTI type and you are the master of that definition, such as in the IFF importer example, you would use UT exported type declarations. That means that you're the master copy of the definition for that UTI type.

If you're shipping an importer for some other file format that you are not the controller of or the owner of, you probably want to use UT imported type declarations, which says, if this type declaration does not exist, then I would like to define it and here's the information for it.

This would be the case if you're shipping your importer separately from your application and then you intend to import it from the same file format. You would want to update your application with the UTI information, the full UTI information later on. You can ship the importer separately and the information will get overridden later on.

Keep all the values that you store for the keys lowercase when appropriate. In the description fields it's not as important, but in the other fields generally keep everything lowercase. Don't forget about the OS type attribute if you need it. That's the four character code. So if you use the four character codes, define the OS type so that that way that connection is made, or that binding exists for that way of identifying a file type with your importer.

The UT type conforms to section is pretty important as well. There's two notions of conformance for the UTI system. There is the physical hierarchy, that is how the file is stored, and then there's the semantic hierarchy. So how is a file stored? Well, either it's a plain file, you know, a stream of bytes like we're all used to, or it's a package. There are also directories, but that doesn't really apply here.

So you have public.data for a flat file, and you have com.apple.package for things that are package-based file formats. So that's the physical hierarchy, and you should say what your file conforms to on the physical side of things. The semantic hierarchy is what is stored in the file. So here you have something very generic, public.content, which just says I've got stuff in my file, or composite content when you have mixed media, such as text and pictures. So you have a lot of different types of pictures, as you would in a PDF file.

And then you have additionally things for specific data types, such as public.image, if you have a new image format, or audio and so on, movies, etc. Again, Chris Lynn has a talk, I think it's at 5:00 today, about the UTI system, and I strongly encourage you to check it out to learn more about that aspect of things.

As you saw, I used a couple of command line tools to help me with diagnosing or figuring out whether my importer was written correctly and was working. The most important one for importers, strangely enough, is named MD Import, which imports files and prints out some global state about the Spotlight system.

There's MDLS, which will list the attributes of a file, and MD Find, which will let you issue a query to see if your attributes are being published properly so that you can query against them and find them. Read the man pages for the full list of options for those programs. They're definitely very helpful.

Again, testing and debugging, what you saw that I did was, I started simple. I ran mdimport-l, saw that my importer was listed. That's job number one, did the Spotlight system even see it? Then I ran mdimport with dash d1, or I did dash d2 on a file and saw that it was imported.

Be careful if you have multiple copies. As I alluded to, if you start having multiple copies around that you've copied with the Finder or in the Shell, the system can start to find too many of them and the Info.plist files can conflict. So you need to be careful with that.

Always start from a fresh importer template. Sometimes you may have a couple of different file formats. You say, okay, boom, I got my first one done. Now I'm just going to make a copy of this project and start on the second file format and whip through that one.

And, you know, be done by 5:00. You don't want to do that because, as you saw at the end of the file, there are some UUIDs that are automatically generated as part of the template. If you just make a copy of it, those UUIDs will be the same and that's not good because that will cause the plug-in system to get confused when two bundles have the same UUID.

You may need to actually run the debugger on your importer. It would be great if everybody wrote code that worked the first time, like it did in the demo, right? Well, it didn't work out that way the first time when I was writing it, so sometimes you have to run GDB on an importer. The way to do that, since it's a plug-in, is you run GDB on the MD import program, and then you can set a future breakpoint, which will say when this function gets defined, set the breakpoint, and then call it when it gets hit.

So if you just type, you know, set a breakpoint on get metadata for file, you'll get control of the application of MD import when your importer is called. You can also do this in Xcode. If you set the project executable to be MD import. And then, you know, set the appropriate command line arguments as well.

when it still doesn't work. So this is what happened to me late last week as I was trying to do the final touches on this presentation. I just couldn't for the life of me figure out why, you know, when I went through the example of switching the interchange-file-format to IFF, what was going on? Well, it turned out that there were some issues with the LS register knowing about too many copies of it in the plist file. So, the way to diagnose this is with LS register-dump.

What this will do is dump out the full state of the UTI system so that you can see who is defining what and if things are conflicting that you, you know, don't want to be that way. So with LS register-dump, you can see who it is that has defined all of the UTI types and if there are multiple definitions for the same UTI.

If there are, you can force re-importing a particular one with -f -v or you can force re-importing a particular one with -f -v or you can use -kill -seed to get it to fully recreate its UTI structure. The LS register program is part of the application services framework and it lives in that nice long path right there. You can run that and it will, you know, you can use that to clear up any potential issues that you have. That's again a last resort, however, it is, you know, some to advanced importers.

In the demo that I did, you saw how I had this .dbg file format, which is, you know, not known to the system, and yet those, the attributes that I published showed up very cleanly and looked like a native part of the system. That's what we're going to talk about now. So, how does that happen? First off, I defined some new attributes.

One of the things that I did was I had to go and edit the attributes section of the schema.xml file. The schema.xml file is where you define new attributes. So, for example, I have two of them right here. You have attribute name is comfoo-track-title, and I say that it's a CFString, it's type. Another one is, for example, a user ID, and I say that that's a CFNumber.

The next part that you have to modify is the Type section. Here, you add each attribute that your importer publishes to the All Adders list. So even if it's a standard attribute, you would want to add it to the All Adders section, in addition to the new ones that you're defining.

If you want an attribute displayed in the system, so for example, I had the K, K, well I didn't call it K, I said COM, debug, debug session name, or radar number, then I add those to the Display Adders, and then the system will know that these are attributes that should be displayed.

Why would you have an attribute that you don't actually display? Because you might do something that searches for it on behalf of the user to identify things, but it's not something that you would want to display. You would want them, or you would want shown in the Finder or in the Spotlight window.

Don't redefine existing attribute names. So you should never have something that says KMDItemFu that we already have in MDItem.h. You would never add that as a new attribute in your schema.xml. You might add it to the All Adders list or the Display Adders list, but you would not actually define it. You can run md check schema on the schema.xml file to see if you got the syntax right. And md import dash x and dash a will also dump out some state so that you can see if once your importer is installed, did the schema take effect.

Now, of course, if we named our attributes com, foo, track title, with all those nice underbars, people wouldn't be very happy. So there's a way to map from the internal name of an attribute to the external name or the user visible name. So you can define human readable strings in the schema.strings file, which is a UTF-16 encoded file, so you need to use an appropriate editor to edit it.

If you're in Xcode, you don't have to think about it, but you should know that it's a UTF-16 encoded file. So here you can define the mapping from com, foo, track title to be equal to track title, which is what the user would see. You can also describe your attributes. Now, this is something that's used if you've ever brought up the Finders advanced find window.

In there, you can bring up the searching for any kind of attribute, and there's a list of attribute names on the left column and a short description. This is that description. And in this case, it's the title of a track, which isn't a whole lot more help, but that's the example. You can localize your attribute names as well. So in the individual language .lproj directory in the schema.strings there, you can have, for example, comfoo track title equals traccia titolo if you had the Italian one, and the associated description.

If all we had were strings and numbers for attributes, that might not be bad, but there's some modifiers as well that you can apply to an attribute that are kind of useful. You can have a multi-valued attribute. So what's a multi-valued attribute? This is an example where you have an attribute name, such as "kmdItemKeywords", and then you have an array of values that are stored for that attribute. The array of values are independent. Each entry is independent of each other.

So, for example, if you had kmdItemAuthors, if you just had them all glommed together separated by spaces, you might run into a problem where you have Leon, Van, Bon, and Bon, Scott. Well, is Van, Bon, and Bon, which one is which part? But by putting them as separate entries in an array, then when you do any matching, there's no confusion as to which one is which part.

But by putting them as separate entries in an array, then when you do any matching, there's no confusion as to which one is which part. But by putting them as separate entries in an array, then when you do any matching, there's no confusion as to which one is which part. whether or not you've got two different pieces just because they're separated by spaces.

So, if you have something for which there are a set of values, like I said, layer names, track titles, a set of authors, or a set of keywords, these are all examples of things that would be multi-valued. The next modifier is called No Search. Now, why would you publish metadata that you don't want to search, you would ask? Well, that's a good question. But there are things that you would want to search as an application that you don't want to be searched by default by the user.

So, when we do a search in the spotlight menu or in the finder, we generally search over all the attributes that are associated with the file except for those that are marked No Search. This comes up for us, you know, with the CamdItemContentType attribute which has strings like public or apple in it. And if you search for apple and every single file matches, you can search for the same thing. So, you can search for the same thing. So, you can search for the same thing.

So, you can search for the same thing. So, you can search for the same thing. So, you can search for the same thing. So, you can search for the same thing. So, you can search for the same thing. So, you can search for the same thing. matched because of com.apple is somewhere in the hierarchy, you'd be a little bit disconcerted.

So, however, we do search for things for KMDItemContentType equals a specific value. So if you mark an attribute as no search, what will happen is that if you name it explicitly, then you'll match against it. Otherwise, it won't match in the search over all attributes type of query.

The last modifier is the Uniqued modifier. The Uniqued modifier is useful when you have a lot of strings that have the same values. So if you had a set of team names, for example, you wouldn't want to store the San Francisco Giants 10,000 times if you could store it in a more compact representation.

So when you have something that has on the order of 10s or 15s, 10s, 20s, 30s, less than, let's say, 100 values for a particular attribute, then you might want to tag it with the Uniqued equals True modifier so that that way it will be stored more efficiently. Here we have some examples down below. So you have Track Title is multivalued, and that's a very good example.

Or you have a private name, and we say No Search. Now that's a good example. This is something where if a user, you have a special UI that does searching for things, and that would, you would set, you would do a search for, one of the clauses might be, "Komfu private name is equal to some value," but normal searches wouldn't trip over that attribute. Team names, again, is the example I gave for unique values. And you can combine these. You can have multi-valued, unique, if you wanted, and so on.

This slide is a fairly esoteric topic. We use it internally for things like the display name of an application. So we have localized display names for something like a dress book, which can be the Japanese characters for the name "address book," or in Italian, "rubrica indirizzi," or in English, "address book." When you do a search, if you type, for example, "bibl," you're going to, in French, you're going to find "biblioteque." You won't find the "biblereader.app," because that's going to have a different name, a localized name.

So if you wanted to publish a localized value for an attribute so that it would only be matched in the user's preferred language, the way to do that is for the attribute name, in this case, "kmditemdisplayname," you publish a dictionary that has the mapping of this language equals this value. And then, when a user does a search in their preferred language, or by default it's in their preferred language, it will only be matched against that language code. Now, let's move on to the actual code of the Advanced Importer. Demo 1, please.

All right, let's go back here. We'll pull up this dbg-importer Xcode definition.

[Transcript missing]

Here we have a very simple file format. On the left side there is a keyword, and on the right side is some value. I just invented this for the purposes of this demo. We have session: and what the name of the debug session was, some keywords about it, and a radar number. We want to import this file format into the Spotlight system and publish these things, these keys and values, as attributes, which kind of makes sense. We're going from a very simple flat file to something that can be searched very easily in the Spotlight system.

So the first thing I have to do, because this file format doesn't have a UTI type, is give it a UTI type. Again, since I'm going to be the master of this file format, I'm going to use the UT exported type declarations. I've chosen to call it com.apple.dbg.demo, since we use the reverse DNS naming convention for UTIs.

And then I've given a reference URL, which is sort of useless in this case, and a simple description of DBG file. Now here we have the UT type conforms to section. As I said, you have the physical conformance and the semantic conformance, or semantic hierarchy. And so it's public.data because it's just a simple flat file, and it's public.content for the semantic hierarchy because it's just some generic content.

The tag specification, so here we have the OS type, which is a four character code, I just chose dbgz, there's not really a four character code for this file that anybody's going to be setting, but if there was, that's what it would be. And the file name extension, which is going to be .dbg, and that's the typical way that the UTI would be associated with the file.

Then we have to bind our importer. Again, this is how we go from the UTI system and say that this file is of this type and that this is the importer that handles it. We have the LS item content types and I say that I handle com.apple.dbg.demo. I've set the name and some other informational fields.

Down here are the UUIDs that were automatically generated when I created this project. That's what I was talking about earlier. These are automatically created on your behalf, so if you were to copy a project, you need to be careful because you don't want to have duplicate UUIDs. So that's the Info.plist file.

Now let's take a look at the XML file, the schema.xml file that I defined. And here we have the Attributes section, as I mentioned. And we have the name, com.apple.dbg.session_name, and the type is just simply a CFString. The attribute name, com.apple.dbg.radar_number, and I've chosen just to make it a string because they can be quite long and I don't have to worry about whether it fits in an int or so on.

In the Type section, we have the type name, which is com.apple.dbgdemo. And now I'm going to associate what attributes do I want associated with this type and which attributes do I want displayed with this type. And here in the All Adders section, I say com.apple.sessionName.radarNumber.kmdItem Keywords, because these are all the attributes that I'm going to publish. And I would like that same set of attributes displayed for that type. So that takes care of the schema.xml file.

In the schema.strings file is where I do the mapping from those kind of ugly names of com_apple_sessionname to the pretty names that showed up in the UI when I did the get info on it. So we have debug session name and radar number. I've also given a short description, and that's all we have to do in the strings file. Now let's take a look at the code. Get this so it's a little bit more visible.

Here we have our stub function getMetadataForFile. And again, the arguments that are passed are the path, the CFStringRef to the path, the attributes that we're going to fill out, and as well there's a content type UTI. So if your file, if your importer handles multiple file formats, this could help you determine, this would be one way to help you determine which type of file you were passed.

Now, in the last example, I used Objective-C and kind of went to the other extreme. In this example, I'm going to go the other way and just use straight C code. So I'm going to convert that path to the file to a regular C string using the function CFStringGetFileSystemRepresentation. And I'm going to do the same thing for the content type UTI, even though I'm not going to use it here.

And then I call this function GetDebugInfo on the path and I pass it the attributes dictionary. This function, GetDebugInfo, is pretty basic. It's just going to open the file and parse through each of the lines and read them in and split apart the key name on the left side and the string value on the right side.

So, parse the line. I'm not going to go into that. That's just ripping things apart, ripping the text apart. And I create a CFString with the C string for the value. And I take a look at the key, and I say, okay, if I got the session keyword, then I'm going to do a CFDictionary add value of com apple dbg session name, and I'm going to store the value that was in the file for that string.

If I got radar, if I got the keyword radar, then again, I will do CFDictionary add value on this attribute name, and I'll give it that value. Now, the last one that we're going to look at is the keywords attribute. And this one's a little bit more tricky, but is an important thing. You have to get the types of the data that you publish correct. If you only had one keyword, you can't just publish a string, because the keywords attribute is a multi-valued attribute.

So, what we have to do is create an array and put our value into it, even though it's going to be the only entry in that array. The way we do that, create a mutable array called keywords, and we append the value to it, which is the value that we got out of the file.

We add the Keywords array for the KMDItemKeywords attribute name, and we add that to the Attributes dictionary. So you see there's a couple of pieces. You have the Attributes dictionary, which contains a key, KMDItemKeywords, and for that key is stored an array. And in that array, we in this case only have one value. If you had multiple values, you would append the values to the array, and then all of those values would be associated with that key.

Once we've done that, and we've finished iterating the file, we return zero for success, and we're... Then we return, in this case, true from getMetadataForFile, and that is what we have to do to publish it. We build the importer again. It's already installed actually, but also let's see. Spotlight.

install it, follow the real protocol, Library Spotlight, DBG importer, remove that one, copy it into library on bad bug. What we can see are that we have the attributes that we published, com, apple, dbg, radar number, 8, 6, 7, 5, 3, 0, 9, the session name and so on. So our importer got published, or published the correct metadata. We see our keywords are published correctly and everything's happy. So that's basically what you have to do for more sophisticated importer to publish additional attributes. So if we go back to the slides.

So again, what did we do here with the Advanced Importer? We defined some new attributes. We had to go into the schema.strings file, or schema.xml file, define the attributes, which in this case was the DBG session name and radar number, declared the human readable forms of those strings in the schema.strings file, and we also went through and showed how we would export an array-valued item.

Dominic Giampaolo So, what have we gone through today? Spotlight uses published metadata to make documents easier to find. They display more richly. When you have nice metadata, the system can display that even if it doesn't know anything about that type before or when Tiger shipped. So, nobody knows anything about the .dbg file format, but we were able to publish data or display those files a little bit more nicely. Any file can be part of Spotlight if it has an importer, and that's where you guys come in. Your file format can be just as seamlessly integrated as what I've shown here today with the .dbg file format.

Importers publish metadata from files, and they do this so that users can find that information. Importers are pretty easy to write. That Amiga IFF image file format, that literally took me, I'd say, about 90 minutes, maybe two hours at the most, to cobble together, and that's getting the code, cleaning it up a little bit, ripping out some stuff, and putting it together. Clearly, for a more sophisticated file format, it's going to take a little bit more time, but we're not talking about an enormous investment. It's on the order of a few hours or a couple days, not three weeks or two months.

Be cautious about what you publish. Keep in mind, you want to help users find things, not spam them with your files for everything that they can possibly type. Performance is critical. This is a very important thing. One of the biggest problems we had with shipping Tiger was that importers could cause a significant impact on the rest of the system, so you need to be careful of that.

What you need to do. What's the takeaway from this presentation? If you have a custom file format, write an importer. That's the biggest thing. We're going to be here today in a lab later on. We'll have a lab tomorrow morning. Come talk to us. We'll help you do it. I think in a lot of cases you'll find that it's a pretty simple thing to do, and then your files are part of the Spotlight system.

Put useful metadata in your documents for file formats that you're writing an importer for, and for file formats that we already support. Make sure to put metadata in there that users will associate with the document. And then, like I said, come visit us in the lab so that we can help you.

A couple other final things. If you're looking for more information, of course, the developer website has a whole ton of stuff. And there are some related sessions that I would strongly suggest you check out. There's the using the spotlight query APIs, that's the top half of things, later on today at 2 p.m. There's the spotlight lab that starts at 3:30. The UTI talk which is going to be at, yes, it is 5 o'clock today. And the lab that we have tomorrow morning. And with that, I guess we'll open it up to Q&A.