Extended Meta-data for UNIX Users - WWDC 2004

OS Foundation • 28:36

Learn how the various UNIX command-line utilities have been enhanced in order to deal with extended metadata (including but not limited to resource forks). Dealing with extended meta-data both locally and remotely is discussed.

Speaker: Nicolai Krakowiak

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So what are EAs good for? I mentioned things like comments and labels and things like that. But it's basically used as a generic method for being able to do storing and retrieving of attributes on a particular file. And you want these attributes to sort of live in the file and follow the file around. So if a user goes ahead and tags a particular file in their desktop and then sorts it into their documents folder, that information then transmits itself along with that file and continues to persist with the file.

[Transcript missing]

attributes themselves are small, again emphasizing small, arbitrary data blobs. We'd like to keep the limit right now under 4 kilobytes. There's many reasons for both implementation and performance concerns. This may be increased in the future so we don't want you to statically code this into your applications but it would definitely stay under the limit for at least Tiger. They also contain separate permissions for the files. Again from the AclTalk Mike Smith mentioned that there's a particular right that you can give not just to the file but also to the meta-data associated with the file so you can control that with very fine-grained access.

The extended attribute is treated as a blob entirely. You can't do extended attribute control over individual elements. Where does it live? Apple's come up with a very ambitious goal of providing a file system agnostic implementation. If you're dealing with a file system that's really modern and supports these sort of functions, it can be done there. But also file systems that are much older that don't have this concept of extended meta-data you can modify or it will be able to be used transparently and be able to store and retrieve this extended meta-data.

Again, these aren't streams though. The extended attributes, you want to keep them down as key value pairs. And one of the key distinctions between these two is the extended attributes are basically items that you want to set atomically, but you're not going to be able to seek around this like as a large data blob or separate file stream. Currently for Tiger we're planning HFS+ and AFP support.

HFS+ is on your Tiger DVD. And it's backwards compatible. Changed the volume format. So it doesn't require a new format. You can go ahead and have a Panther machine do an upgrade to Tiger and start to be able to set these extended attributes on a file. Now there's a caveat associated with that. If you now boot back into Panther and look at a file, the extended attribute information is going to encompass the resource fork. So you'll still be able to access the resource fork in a backwards compatible manner.

But any extended attributes that you set in Tiger you won't be able to view in Panther. Moreover if you go ahead and try to move that file on Panther, you can also potentially lose that metadata. So it's recommended that you do a full one-way transition when going from Tiger to Panther as far as metadata is concerned.

Other file systems is going to be implemented in terms of an Apple double file, which is very similar to what we do for resource forks today. So for those of you who aren't familiar with an Apple double file, if I have a file with HFS+ I can have two separate sort of data containers within this file.

The data fork, which is what you consider the normal contents of the file if you open it up in TextEdit, that's what you see. Also a resource fork. Now some file systems can't support the data and resource fork within a single container object. So there's a file format called Apple double, which if you have a file called my file, you'll now get a dot underscore my file and that will contain all the resource information. And this will allow us on NFS and UFS and other file systems that don't support extended metadata directly to be able to still contain this metadata.

When I mention copy engines, there's a very broad term for copy engines. Anything that deals with files, that opens them up and saves them, potentially has a copy engine. If you're at the POSIX layer and you have a file called foo and it has an extended attribute blob, and you open it up in an editor, your editor might make a copy foo.temp, which is where all the modifications get done. When the editor goes to save that file, it doesn't have a copy of the extended attributes because it's not aware of the extended attributes. Then you'll end up losing those extended attributes.

That will represent metadata loss. As time progresses, we expect this extended attribute information to become very valuable, so your users are not going to want to be able to lose it. In this particular case, we're looking at a possible trick to help you. If you create a copy of a file and it doesn't have any extended attributes and you try to rename it on top of another file, there might be some things that we can do to help save you.

Other things we're doing or at least investigating to sort of ease the transition between the sort of metadata-less sort of world and more of this extended meta-data. If you have an application that currently supports meta-data but through a different API set, like both FreeUSD and Linux support sort of POSIX layer, you know, extended attribute API, we're investigating providing compatibility with that so your application would just work.

We're not quite sure exactly what the usage coverage is in this and since there is so much work, again, I'm not promising anything but we're trying to make your lives as easy as possible. Other things like a copy engine API. Some of us when we're looking at a lot of these codes, I mean, there's probably a dozen copy engines in Mac OS X alone if you sort of look around and it's sort of silly, you know, not to have a particular call where I can say, you know, copy this file.

In particular with both extended attributes and ACLs going into the system. It might be nice. For developers to be able to say just call this copy API, any changes that we sort of add to the file system layer will be sort of shielded from and the copy will always copy everything that's there. Also support things like external serialization.

Being able to say take the extended meta-data from file foo, extract it out and then put it on the file bar. There are lots of cases where that sort of functionality would be very useful and to encapsulate it into a particular library is something that that we're looking at providing.

Also, we've gotten a lot of complaints about the ._files on things like NFS mounts and looking at Reader filtering out those ._files because, again, that resource half of the Apple double file isn't really a file. I mean, it sure looks like a file. It kind of smells like a file.

But it really shouldn't be treated like a file. So if you wanted to go ahead and iterate through a directory of files and studying the extended attributes, you wouldn't want to be trying to set the extended attributes on the container that's holding the file. You're just building your extended attributes. That's just not valid behavior. So although they're there, looking at possibly providing a mechanism to be able to mask those out, but, again, if that provides a problem, an option so that they still show up.

Conversely, in the other direction, if you've ever shared an HFS plus volume over NFS, you don't have any of those .underbar files that are actually there. And this provides some sort of problems for applications depending on how you share things. Now, by far the best thing you can do is go ahead and use AFP, which will be, you know, completely, have a complete layer of support for the extended attributes. But in some situations, you know, we know that that's not possible. So we're looking at trying to mitigate that sort of transition.

So again, we're basically at a point now where we have Apple double revisited. Before we had the command line utilities weren't really aware of these sort of files and they didn't really respect them. But for Tiger, we're planning full command line support for all of the utilities that access files.

So if you have extended attributes associated with a particular file and use copy or move or ditto or you use R sync, we're looking at ways of either extending the protocol where possible or just going and using the Apple double file over the wire. Archivers, tar, zip, be able to make archives of things and not have to worry about whether or not you're preserving the metadata or not. And then the command line editors and things like that.

Basically sort of making it so that you don't have to use the sort of R sync X, tar HFS, these projects on the web that have involved in the data. So we're going to sort of fill this space where Apple sort of, you know, sort of neglected their command line users being able to treat this data as first class data.

So, the short of it is, tiger to tiger, you know, should just work. And anything that, you know, any place where that doesn't work, it's definitely going to be a bug. Now, tiger to non-tiger, we absolutely want to support, but there are a lot of very tricky issues and a lot of very tricky situations. The matrix of compatibility is actually very, very large if you iron it out.

We've looked at a lot of these issues and trying to come up with the best scenario and the best situations to be able to deal with those situations. But it is going to be somewhat difficult while everyone transitions to be able to support this metadata. Fortunately, from past history and stuff like that, everybody should be using this stuff within 10 or 15 years.

Here's an overview of the API. All the functions that you need are stored in sysxaddr.h and userinclude. And there's one particular important data structure, xaddr.io, which contains an XA position, which is basically a reserve for future use. For now, we want you to set that to zero. XA size should be the size of an attribute.

And then an XADATAPointer, which is going to be a block of data that you've allocated to be able to store these attributes. And then there's going to be basically three sets of functions: setXaddr, getXaddr, and listXaddr. And corresponding components that are prefixed to the letter F, that instead of dealing with file names directly, deal with file descriptors as the arguments on files.

[Transcript missing]

To read that data back, we go ahead and have a test file, confu-mothership-url, and we'll make a call to get x-addr. But in this particular case, the third argument you'll notice is zero, which would normally be that pointer to the x-addr-io struct. This tells the API to return the size of the particular value of that key, so you can then go ahead and allocate the appropriate amount of storage. So we go ahead and set x-is to x-a size for x-a attribute, and then malloc sufficient data. Calling again, now with a pointer to the x-addr-io struct. Again, negative one indicates an error. Otherwise, it should be set correctly.

Now that you can set and get these attributes, how do you find out what particular attributes are signed on a file? That's using the listXAdder API. Very similar to the getXAdder API, if the third argument is null, it returns the amount of space required to hold all of the particular keys. The format returned by listXAdder is a series of null terminated strings that are concatenated together. So we go ahead and malloc sufficient size to be able to store the block, call listXAdder again.

And then to iterate over that block, you just go ahead and have like a little pointer to the first key, and then while you're still within that range of that block, you go ahead and increment by the length of that particular key plus its null terminator. So now I have a couple little demos and things like that. Unfortunately for the WWDC build, you do have support for HFS. But the copying utilities and things like that like CPMV haven't been included yet. Some of that work is still yet to be done.

We have three parts to this demo. First we'll go ahead and we have a couple of command-line utilities that we'll be using. These aren't going to be shipping in Tiger but they'll be some sort of interface very similar to it that allows you to both store and retrieve the extended attributes directly from the command-line. So let's start with an empty file called foo. Set an attribute on it, com, Apple, demo, and the value being metadata is cool.

I'll go ahead and set another thing, a comment, you know, the command-line rocks. And give credits to a person named Kevin. LSX, you can go ahead and list these attributes, indicates the length of the attribute, the key name and also the value. You can see I have modified CP here so if I do CP foo and bar and do a list on them you can see the metadata is being preserved.

You can also do things like symlinks. Again, listing on a symlink, you could list on a backing file. With those options, you can control whether or not you read the extended attribute information off the file itself or you can also put extended attributes on the symlinks themselves. You can also put extended attributes on directories.

in this case you'll see system.finder info is set by default on symlinks. For hard links, you have the extended attribute information show up just as you'd expect. If you delete an extended attribute value, in this case removing com apple credits, you'll see that it's been removed from the hard link end and the actual file.

[Transcript missing]

I also mentioned that you should be modifying your copy engines and show this is a pretty straightforward sort of mechanism. I'm going to walk through function. It's 80 lines and does all the work that you need to be able to do to take one file name and then extract all the extended attributes and save it on another file.

So we'll start with the options for the argument, the source file name, destination file name, and those options. We'll have a set of local variables, a name for iterating over the key space, like in that for loop that I showed on the list x adder slide, a name buff to be able to store the names returned from list x adder, and an attribute buff to be able to hold the attributes as we get them and set them. Again, the x position argument should be set to zero.

So much like we showed on the slide, listx adder with a null gives you the size and you can malloc a block and call it listx adder again with the appropriate information being set and you'll have all of the names stored in this name buffer. Now we'll iterate over the list of names.

Again, that's the four-loop construct where we keep advancing by the length of the string. And then we'll need to determine how much space is necessary for the attribute. You know, allocate that space, get the attribute, then set the attribute. And again, that's relatively straightforward. Get xAdder with the third argument being null. Save that as xA size. Call malloc, allocate the appropriate amount of space. Save that in the dataport argument. Get the attribute. :::: and then move us to the top of the screen.

and call set x adder setting that attribute. The reason I did this in the terminal and the slides is so now I can give a demo of that running. So we'll touch a file foo, touch a file bar, set a series of attributes on that file foo, and then call this ‑‑ sorry, I'm listing now the attributes on foo and bar, notice no attributes on bar.

Execute this copy x adder utility and copying them from foo to bar, doing the list again shows you the attributes have been preserved. So again, the amount of work necessary to go ahead and preserve this metadata isn't that hard, but adding it to the appropriate locations, not just again these files that you create but also any directories that you create is pretty critical. For the second part, you need to show, so let's touch a file foo, touch a file bar, set a bunch of testing, oops, wrong one. Sorry.

: I did that, didn't I? All right. First, I want to look at the source. What's an example of using extended attributes? I thought of something that's probably something that you shouldn't be doing by default, but illustrates an interesting sort of point. I'm going to be saying profile encryption using an extended attribute to store key information.

One of the reasons why I'm doing this is because you don't want to go ahead and be searching for the particular key, like through a list of files. That's not a common operation. But you do want that information to be persistent across, you know, copies and moves and things. So we'll open up a file.

I can actually scroll around with a pointer. Open up a file. And then we're going to try to get a key. We have this little function called get attribute. Takes a file pointer, the name of the attribute, and a small little buffer to store this key. And then we'll call encrypt on it and then be done. Getting the attribute.

Again, XA position should be null. Setting the attribute buff, data pointer value and the size. And then calling the variant X or F get X adder, which I mentioned takes the first argument as a file descriptor as opposed to a file name. and then a little poorly written encrypt loop just doing XOR on the key, entering back over and writing it out as a file. So how does this work?

So create a file foo, put a secret message in there, and then look at it in hexdump because it will get ugly when we encrypt it. As you can see it still says this is a secret sentence. Listing the attributes on the file, there are no current attributes.

Attempting to encrypt the file, we get an error from fgetx that that key has not been set. We'll call it a little command-line utility setting it to the letter x. Doing another list, making sure that it's been set. Encrypting it, it's now super secret. You can see. calling it again and since our super secret cipher is symmetric.

: This is a secret sentence. So, again, it's not really difficult to be dealing with this extended attribute information, but it is very critical because things like this key, if you were to lose it by saving a file, now a user might not be able to have access to their data, and that's sort of horrible. Actually, I'm thinking, done with the demo, we can cut back to the slides.

in summary, metadata is great. There are a lot of uses that I know you guys are going to come up with for being able to manipulate and do interesting things. But by far the focus for Tiger is not just innovating around the metadata but by far preserving it as we're going to be creating it hopefully we don't also want to be destroying it a lot at the same time particularly accidentally.

Prevent namespace collisions using the reverse DNS names only instead of dots use underscores and stick to small and ounce for now. The values should be under 4K and we'd really appreciate that. Also, don't go hog wild and implement a file system layer on top of the extended attributes. It's not what it's for.

for more information please contact Jason Yao. This is email address [email protected]. He's a lucky guy. For more information on encryption you can look at the key chain services. Then for the extended attributes information in general there's the Mac OS X file system information on the DVD and then we'll be actually on hopefully be putting up some stuff on the reference site.