OS Foundation • 28:36
Learn how the various UNIX command-line utilities have been enhanced in order to deal with extended metadata (including but not limited to resource forks). Dealing with extended meta-data both locally and remotely is discussed.
Speaker: Nicolai Krakowiak
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
We're here today to talk about metadata in the command line, something that really has sort of not been focused too much at Apple and stuff like that. When the transition from 9 to 10 occurred, some of the nice metadata things that occurred in 9 sort of made it to the GUI layer on Mac OS X, but was sort of neglected on the command line. And we're looking at what we're doing for some of those issues for Tiger. So my name is Nicolai Krakowiak. I'm a member of the BSD Technologies group at Apple. wonderful group led by Jordan Hubbard. And we're going to start by talking a little bit on metadata in general, like the history of metadata. Back in the day, someone created a file, and they needed a way to be able to keep track of that file, so they sort of invented a file name. It seemed like a common thing to do. You can also see, for example, my presentation.key. We have.key at the end, so it's not actually a file, maybe it's a directory. But the.key then tells us a little bit more about the file, file. In this case, it's probably a keynote application. Also, other things that are interesting to say about files are like, you know, date, like creation time, last time I modified it. If you're on a multiuser system, to be able to give it permissions, to be able to control rights between different users accessing it. And then there's also other, like, pieces of implicit metadata, things like the file size that are sort of intrinsic qualities of the file that are nice things to know. And for the most part, this is it. This is a list of metadata that you've had with file systems for, you know, close to 30 years now. There are some notable exceptions. Apple, of course, innovating like it does, had this thing called a resource fork where it could attach arbitrary key value pairs to a particular file. And the nice thing is that it was persistent with that file. So you could do things like icon bitmaps or internationalized strings, comments, labels. The good news is the industry is finally catching up. We have a lot of file systems these days, everything from NTFS, NFS v4, FreeBSD's UFS, all of these file systems are getting support for these sort of extended metadata that's sometimes referred to, which usually falls into two particular classes. Access control lists, so we can do finer-grained permissions. If you attended Rusty Tucker's talk yesterday, and Mike Smith, they discussed a little bit about our plans for Tiger to to be able to support these access control lists. And as an implementation detail in some file systems, they mentioned also that it will live in something referred to as these extended attributes. So these extended attributes is what we're really gonna be talking about when we talk about metadata today. And they allow you to do lots of interesting things, more we'll get into later. Like one example is like keywords. So I can say to my presentation.key, this is a WWDC related thing, this is a presentation, This is about metadata. And that's sort of useful for a couple of different reasons. So after today, you should walk out of this session with knowing Apple's strategy for extended attributes and the command line, high-level details for various interoperability situations, whether you're in a heterogeneous sort of environment, you have to deal with Windows machines and Linux machines, and how those sort of details will sort of be able to interoperate. An overview of the API. and a few examples. Primarily because as far as developers are concerned, there are a couple of things that we're really going to try to emphasize. One is if you have a copy engine, how to modify it to be able to preserve this metadata. Because going through the traditional file access and things like that, we're going to try to mitigate the sort of migration as much as possible. But it is much easier now to lose the sort of data that users will be associating with files. Also, if you go ahead and generate content yourself, what are some of the restrictions that we're placing on EAs for at least for now, and how you should use them.
So what are EAs good for? I mentioned things like comments and labels and things like that. But it's basically used as a generic method for being able to do storing and retrieving of attributes on a particular file. And you want these attributes to sort of live in the file and follow the file around. So if a user goes ahead and tags a particular file on their desktop and then sorts it into their documents folder, that information then transmits itself along with that file and continues to persist with the file.
You can also do things like associating collections of files by things like the keywords or being able to do particular tags. I don't want to get a little bit too close to searching. This isn't really -- this underlying interface is not quite what we were using for search spotlight? Spotlight? Spotlight? So if you're interested in doing things for keywords for searching, you should be using the spotlight interface and not the interface we're going to discuss here. You can also do things like richer descriptions, like comments, icons, things that are expensive to calculate, like you can do checksums to guarantee or at least sort of guarantee file integrity or even small previews.
And there's an emphasis today especially on that word "small." Some details about the EA interface as it's planned. The attribute names are arbitrary UTF-8 strings up to a length of 255 characters. We're going to enforce a namespace policy so that way the third party developers won't have conflicting sort of attributes. We recommend using the reverse DNS style naming, only instead of using dots, we'd like you to use underscores. I'm told there's a very good reason for doing this and it manifests itself in like the Cocoa But I don't want to go into that, because I don't quite really understand. Everything else is reserved. You'll see an example today where Apple has something that starts with System. for instance. We wanted you to stay out of that namespace.
Now, attributes themselves are small, again, emphasizing small, arbitrary data blobs. We'd like to keep the limit right now under four kilobytes. There's many reasons for both implementation and performance concerns. And this may be increased in the future, so we don't want you to, you know, statically code this into your applications. But we would definitely stay under the limit for at least Tiger. They also contain separate permissions for the files. Again, from the ACL talk, Mike Smith mentioned that there's a particular right that you can give not just to the file, but also to the metadata associated with the file. So you can control that with very fine-grained access. us.
The extended attribute is treated as a blob entirely. You can't do extended attribute control over individual elements. Where does it live? Apple's come up with a very ambitious goal of providing a file system agnostic implementation. So if you're dealing with a file system that's really modern and supports these sort of functions, it can be done there. But also file systems that are much older, that don't have this concept of extended metadata, you can modify, or it will be able to be used transparently and be able to store and retrieve this extended metadata. Again, these aren't streams, though. The extended attributes, you want to keep them down as key value pairs. And one of the key distinctions between these two is the extended attributes are basically items that you want to set, you know, atomically, but you're not going to be able to seek around this, like, as a large data blob or separate, you know, file stream. Currently, for Tiger, we're planning HFS Plus and AFP support. HFS Plus is on your Tiger DVD. and it's backwards compatible, changed the volume format, so it doesn't require a new format. You can go ahead and have a Panther machine do an upgrade to Tiger and start to be able to set these extended attributes on a file. Now, there's a caveat associated with that, meaning that if you now boot back into Panther and look at a file, the extended attribute information is going to encompass the resource fork, so you'll still be able to access the resource fork in a backwards compatible manner, but any extended attributes that you set in Tiger, you won't be able to view in Panther. Moreover, if you go ahead and try to move that file on Panther, you can also potentially lose that metadata. So it's recommended that you do a full one-way transition when going from Tiger to Panther as far as metadata is concerned.
Other file systems, it's going to be implemented in terms of an Apple double file, which is very similar to what we do for resource forks today. So for those of you who aren't familiar with an Apple double file, if I have a file with HFS Plus, I can have two separate sort of data containers within this file.
The data fork, which is what you consider the normal contents of the file, if you open it up in text edit, that's what you see. Also a resource fork. Now, some file systems can't support the data and resource fork within a single container object. So there's a file format called Apple double, which if you have a file called my file, you'll now get a._myfile and that will contain all the resource information. And this will allow us on NFS and UFS and file systems that don't support extended metadata directly to be able to still contain this metadata.
Particularly, when I mention copy engines, you may not -- there's a very broad term for copy engines. You have anything that deals with files that opens them up and saves them potentially has a copy engine. So if you're at the POSIX layer and you have a file called foo and it has an extended attribute blob and you open it up in an editor, your editor might make a copy foo.temp which is where all the modifications get done. And then when the editor goes to save that file, it doesn't have a copy of the extended attributes because it's not aware of the extended attributes. And then you'll end up losing those extended attributes. And that will represent metadata loss. And as time progresses, we expect this extended attribute information to become very valuable. So your users are not going to want to be able to lose it. In this particular case, we're looking at a possible trick to help you if you create a copy of a file and it doesn't have any extended attributes and you try to rename it on top of another file, there might be some things that we can do to help save you.
Other things that we're doing, or at least investigating, to sort of ease the transition between the sort of metadata-less sort of world and more of this extended metadata. If you have an application that currently supports metadata but through a different API set, like both FreeSD and Linux support sort of POSIX layer, you know, extended attribute API, we're investigating providing compatibility with that so your application would just work. We're not quite sure exactly what the usage coverage is in this, And since there is so much work, again, I'm not promising anything, but we're trying to make your lives as easy as possible. Other things like a copy engine API. Some of us, when we're looking at a lot of these code, I mean, there's probably a dozen copy engines in Mac OS X alone if you sort of look around. And it's sort of silly, you know, not to have a particular call where I can say, you know, copy this file. In particular, with both extended attributes and ACLs going into the system, it might be nice for developers to be able to say, just call this copy API. Any changes that we sort of add to the file system layer will be sort of shielded from it, and the copy will always copy everything that's there. Also support things like external serialization. Being able to say, take the extended metadata from file foo, extract it out, and then put it on the file bar. There are lots of cases where that sort of functionality would be very useful and to encapsulate it into a particular library is something that we're looking at providing.
Also, we've gotten a lot of complaints about the._files on things like NFS mounts, and looking at Reader filtering out those._files, because, again, that resource half of the Apple double file isn't really a file. I mean, it sure looks like a file. It kind of smells like a file.
But it really shouldn't be treated like a file. So if you wanted to go ahead and iterate through a directory of files and setting the extended attributes, you wouldn't want to be trying to set the extended attributes on the container that's holding your extended attributes. That's just not valid behavior.
So although they're there, looking at possibly providing a mechanism to be able to mask those out. But again, if that provides a problem, an option so that they still show up. Conversely, in the other direction, if you've ever shared an HFS plus volume over NFS, you don't have any of those dot underbar files that are actually there. And this provides some sort of problems for applications depending on how you share things. Now, by far the best thing you can do is go ahead and use AFP, which will have a complete layer of support for the extended attributes. But in some situations, we know that that's not possible. So we're looking at trying to mitigate that sort of transition.
So, again, we're basically at a point now where we have Apple double revisited. Before we had the command line utilities weren't really aware of these sort of files and they didn't really respect them. But for Tiger, we're planning full command line support for all of the utilities that access files. So if you have extended attributes associated with a particular file foo and use copy or move or ditto or use rsync, we're looking at ways of either extending the protocol where possible or just going and using the Apple double file over the wire. Archivers, tar, zip, be able to make archives of things and not have to worry about whether or not you're preserving the metadata or not. And then the command line editors and things like that, vi, emacs, pico. sort of making it so that you don't have to use the sort of like rsyncx, tarhfs, these projects on the web that have evolved to sort of fill the space where Apple sort of, you know, sort of neglected their command line users being able to treat this data as first class data.
So the short of it is, tiger to tiger should just work. And any place where that doesn't work, it's definitely going to be a bug. Now, tiger to non-tiger, we absolutely want to support, but there are a lot of very tricky issues and a lot of very tricky situations. The matrix of compatibility is actually very, very large if you iron it out. And we've looked at a lot of these issues and trying to come up with the best scenario and the best situations to be able to deal with those situations, but it is gonna be somewhat difficult while everyone transitions to be able to support this excited metadata. Fortunately, from past history and stuff like that, everybody should be using this stuff within 10 or 15 years, so.
All right, here's an overview of the API. All of the functions that you need are stored in sysxaddr.h and userinclude. And there's one particular important data structure, xaddr.io, which contains an XA position, which is basically a reserve for future use. For now, we want you to set that to zero. XA size should be the size of an attribute. And then an XA data pointer, which is going to be a block of data that you've allocated to be able to store these attributes. And then there's going to be basically three sets of functions, set x adder, get x adder, and list x adder, and a corresponding components that are prefixed to the letter f that instead of dealing with file names directly deal with file descriptors as the arguments on files.
So how do we write an extended attribute? Let's include the header, you know, sysxheader.h. Operate on a file called testfile. The key will be comfoo mothership URL, and we'll point it to apple.com. There's an options field for some of these calls. Like, for instance, sysxheader has the ability to differentiate between replacing an attribute that's already there or being able to create a new one so you can control whether or not you want to clobber a value or not. And then the primary structure will have X attributes.
So we go ahead and set the XA data pointer to the value of the string. And the size is going to be the length of the string plus the null terminator. And then call set X at or the arguments of being the file name, the key, come through mothership URL. And then pointer to the X at or IO struct. In this case, options are zero. If the size is less, or is negative one, that indicates an error, an error null will be set. You can check. And otherwise the value should be set correctly.
To read that data back, we go ahead and have a test file, comfoo-mothership-url. And we'll make a call to get xAddr. But in this particular case, the third argument you'll notice is 0, which would normally be that pointer to the xAddr.io struct. This tells the API to return the size of the particular value of that key. So you can then go ahead and allocate the appropriate amount of storage. So we go ahead and set xis to xasize for xa-attribute. and then malloc sufficient data. Calling again now with a pointer to the X at or IO struct. Again, negative one indicates an error. Otherwise it should be set correctly.
Now that you can set and get these attributes, how do you find out what particular attributes are set on a file? That's using the listXAdder API. Very similar to the getXAdder API, if the third argument is null, it returns the amount of space required to hold all of the particular keys.
The format returned by listXAdder is a series of null-terminated strings that are concatenated together. So we go ahead and malloc sufficient size to be able to store the block, call listXAdder again, And then to iterate over that block, you just go ahead and have like a little pointer to the first key, and then while you're still within that range of that block, you go ahead and increment by the length of that particular key plus its null terminator. So now I have a couple little demos and things like that. Unfortunately for the WWDC build, you do have support for HFS. But the copying utilities and things like that, like CPMV, haven't been included yet. Some of that work is still yet to be done.
We have three parts to this demo. First, we'll go ahead and we have a couple of command line utilities that we'll be using. These aren't going to be shipping in Tiger, but they'll be some sort of interface very similar to it that allows you to both store and retrieve the extended attributes directly from the command line. So let's start with an empty file called foo. Set an attribute on it, com apple demo, and the value being metadata is cool.
Go ahead and set another thing, a comment, the command line rocks. And give credits to a person named Kevin. You can go ahead and list these attributes, indicates the length of the attribute, you know, the key name and also the value. You can see I have modified CP here, so if I do CP foo and bar and do a list on them, you can see the metadata is being preserved.
You can also do things like symlinks. Again, listing on the symlink, it lists on a backing file. With those options, you can control whether or not you read the extended attribute information off the file itself, or you can also put extended attributes on the symlinks themselves. You can also put extended attributes on directories.
In this case, you'll see a system.finder info is set by default on symlinks. For hardlinks, you have the extended attribute information show up just as you'd expect. If you delete an extended attribute value, in this case removing com apple credits, you'll see that it's been removed from the hardlink and the actual file. setting an attribute on a directory, no problem.
I also mentioned that you should be modifying your copy engines. And so this is a pretty straightforward sort of mechanism. I'm going to walk through function. It's 80 lines and does all the work that you need to be able to do to take one file name and then extract all the extended attributes and save it on another file.
So we'll start with the options for the argument, the source file name, destination file name, and those options. We'll have a set of local variables, a name for iterating over the key space, like in that for loop that I showed on the list x adder slide, a name buff to be able to store the names returned from list x adder, and an attribute buff to be able to hold the attributes as we get them and set them. Again, the x position argument should be set to zero.
So much like we showed on the slide, list x adder with a null gives you the size, and you can malloc a block, and call it list x adder again with the appropriate information being set. And you'll have all of the names stored in this name buffer. Now we'll iterate over the list of names.
Again, that's that for loop construct where we keep advancing by the length of the string. And then we'll need to determine how much space is necessary for the attribute. You know, allocate that space, get the attribute, then set the attribute. And again, that's relatively straightforward. Get x adder with the third argument being null. Save that as x a size. Call malloc, allocate the appropriate amount of space. Save that in the data pointer argument. Get the attribute. calling get x adder and then move this to the top of the screen.
and call set X adder setting that attribute. The reason I did this in the terminal and on the slides is so now I can actually give a demo of that running. So we'll touch a file foo, touch a file bar, set a series of attributes on that file foo, and then call this -- I'm sorry, I'm listing now the attributes on foo and bar. Notice there's no attributes on bar. this copy X adder utility and copying them from foo to bar, doing the list again shows you the attributes have been preserved. So again, the amount of work necessary to go ahead and preserve this metadata isn't that hard, but adding it to the appropriate locations, not just again these files that you create but also any directories that you create is pretty critical.
For the second part, you need to show, so let's touch a file foo, touch a file bar, set a bunch of testing, oop, wrong one. Sorry, I didn't, Again, showing that... I did that, didn't I? All right. So, oh, first, actually, I want to look at the source. So what's an example of using extended attributes? I thought of something that's probably something that you shouldn't be doing by default, but illustrates an interesting sort of point. I'm going to be setting per-file encryption using an extended attribute to store key information. Now... One of the reasons why I'm doing this is because you don't want to go ahead and be searching for the particular key, like, through a list of files.
That's not a common operation. But you do want that information to be persistent across, you know, copies and moves and things. So we'll open up a file. I can actually scroll down to the pointer. Open up a file. And then we're going to try to get a key. We have this little function called get attribute. a file pointer, the name of the attribute, and a small buffer to store this key, and we'll call encrypt on it and then be done. Getting the attribute, again, XA position should be null. Setting the attribute buff, data pointer value and the size, and then calling the variant X or F get X adder, which I mentioned takes the first argument as a file descriptor as opposed to a file name. And then a little poorly written encrypt loop, just doing XOR on the key, entering back over and writing it out as a file. So how does this work?
So create a file foo, put a secret message in there, and then look at it in hex dump, 'cause it'll get ugly when we encrypt it. As you can see, it still says, this is a secret sentence. Listing the attributes on the file, there are no current attributes. Attempting to encrypt the file, we get an error from fget x adder that that key has not been set. We'll call it a little command line utility, setting it to the letter x. Doing another list, making sure that it's been set. Encrypting it, it's now super secret. You can see.
I'm calling it again, and since our super secret cipher is symmetric, we get a back, this is a secret sentence. So again, it's not really difficult to be dealing with this extended attribute information, but it is very critical because things like this key, if you were to lose it by saving a file, now a user might not be able to have access to their data, and that's sort of horrible. Actually, I'm thinking, done with the demo, we can cut back to the slides.
in summary. Metadata is great. There are a lot of uses I know that you guys are going to come up with for being able to manipulate and do interesting things. But by far the focus for Tiger is not just innovating around the metadata but by far preserving it as we're going to be creating it. Hopefully we don't also want to be destroying it a lot at the same time, particularly accidentally. Prevent namespace collisions using the reverse DNS names only instead of dots, use underscores. And stick to small amounts for now. The values should be under 4K. And we'd really appreciate that. Also, don't go hog wild and implement a file system layer on top of the extended attributes. It's not what it's for.
For more information, please contact Jason Yao. This is email address, [email protected]. He's a lucky guy. For more information on encryption, you can look at the keychain services. Then for the extended attributes information in general, there's the Mac OS X file system information on the DVD, and then we'll hopefully be putting up some stuff on the reference site.