Mac OS X Filesystems - WWDC 2007

Mac OS X Essentials • 46:41

Speaker: Chris Emura

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Hi. My name is Chris Emura. I'm the Filesystem Engineering Manager with an Apple's Core OS organization. And today we'll be talking a little about Mac OS X filesystems and related technologies. This is obviously targeted to both halves as far as the developers and the IT professionals. But there's really a little something in here for everyone that's just an end user. Our agenda is pretty compact. We'll be doing a very quick overview.

( Applause )

I'll talk about some things that are specific to our platform as far as, you know, Mac OS X-isms. What's changed since Tiger. What's new in Leopard. A few words on the ZFS developer preview. And then some time at the very end may be 10, 15 minutes for Q&A. All right. Let's get started. So quick overview. In all the years we've done this session it kind of occurred to me that we've never had a master slide that just kind of dumped out all the filesystems on our platform. And so, put something together really quickly.

Divided into 3 basic buckets, the local filesystems, the network file sharing protocols, both client and server, and then some third party mentions. XSAN is not really a local file systems I had nowhere else to put it. There's going to be a number of sessions during this week that talk about XSAN in much more details. So if you are interested in that technology, you know, by all means please attend those sessions. On the local filesystem front, as most of you know, our primary filesystem is going to be Journaled HFS+.

We also have FAT 32, FAT 16, all the variance for a lot of the consumer-level devices whether they be your camera, your flash cards, things of that nature. UDF, the Universal Disk Format, has undergone some significant changes in Leopard. We have a dedicated slide for that. It's typically associated though with optical media.

NTFS, some kind of under the radar, but it's actually been completely rewritten in Leopard. Read-only support still at this point but watch this space. There's going to be some things happening in that area. ISO9660 and CDDAFS kind of round things off in optical media. On the file sharing side again, both client and server.

We've got the Windows file sharing as far as SMB and CIFS. Those of you familiar on the UNIX side, of course, NFS V2 and V3. And then for us on the internal side, the Apple File Sharing Protocol which you'll probably see as a default for Mac-to-Mac communication. Third party, there's obviously many more third-party products as far as filesystems and file sharing protocols. I just mentioned AFS and Amit Singh's MacFUSE because it's just getting some great press recently. Good examples of technology on our platform.

With all those filesystems you've got various capabilities. And some of them are pretty mundane as far as limits, maximum file size, maximum filesystem size and so on. These are things you guys can look up on your own. We won't talk about them in any great detail here. When it comes to filesystem-specific fcntl()s or capabilities, things are pretty much documented in the fcntl() man page. These are things, of course, you should feel free to use, but at the same time as developers, you know, the more -- the deeper your dependencies are on these type of specific capabilities, in general, the less portable your code's going to be.

So just take care in those situations. Crash protection is not something you're going to code against necessarily. But just as an end user something to be aware of. In terms of our local filesystems, the only local filesystem with formal crash protection is going to be Journaled HFS, of course.

That said on the FAT side of things we've gone through great lengths to balance crash protection as far as robustness and performance. And so with our FAT implementation I think you'll see we've headed quite nicely in terms of finding that mix or that balance between being able to survive plug pulls and so on versus just raw performance.

Going on with the various capabilities I wanted to call this one out specifically this year. Sparse file support, we don't have native sparse file support on our primary filesystem HFS. So you will pay that zero fill "cost". If that's going to be a problem for any of you as developers, come talk to us either after this session or during the lab tomorrow. There are definitely a number of tricks or workarounds we've used in the past that have made other vendors happy with this limitation.

Case sensitivity. This one almost never gets talked about, at least in any kind of detail. For starters we offer case sensitive HFS as a first-class filesystem on our platform. Something we tend to steer people towards as far as those looking for case sensitivity. But as developers a lot of time we do C-code. It's internal too. We make the same mistakes as far as people hard coding case incorrectness into their actual, their code.

They have to hard code info.plist, when, in fact, it's a capital I, for example. So they get away with it on their primary filesystem a lot of times when there's case insensitivity. But in the case sensitive side these type of look ups, these types of operations will fail.

So the general message here is, as developers, please, test your code on a case-sensitive format as well. Another subtle point involves people that come from other platforms often times with tarballs for their home directories, whatever it may be. And they come from let's say UFS or something along those lines. They are basically going from a case- sensitive format to a case-insensitive one.

And with something like tar, it's very easy to kind of lose some data here in terms of having two files that basically collide that only differ in case. And on the case-insensitive filesystem you'll overwrite one. And that can be silent for a while and won't notice until you start poking around. So just something to be mindful of.

Access control lists. This is very prominent now on our platform. Whenever I talk about access control lists or ACLs I like to start off by comparing it with the old-school UNIX user ID and mode_t bits. I think everyone in this room is probably pretty familiar with this as far as the read-write execute bits for yourself, groups you belong to, and pretty much the rest of the world. There's other limitations, of course, with regards to 16 groups and no nesting. But, you know, this has served most of us pretty well as far as being a very simple, easy-to-understand model.

That said, there's a lot of times when you do want to express yourself in terms of permissions, a little bit finer granularity, and with that, again, the access control lists. I've highlighted a couple here I feel that are pretty much difficult or impossible to express the traditional mode_t.

As far as appending to a file or changing the security information of a file. And you'll note for everyone of these entries here there's an allow or deny flag. So the list itself is actually pretty long. You can kind of see how much finer granularity you have, how much expressive power you have with regard to permissions with access control lists.

Extended Attributes. I should kind of start of describing what they are. As most you know, extended attributes are simply arbitrary bits of data that don't live in a natural file itself--they don't live in the data fork, if you will. They can be quite useful in situations where you are working with the file format. That's got no allowance for tucking away a little bit of extended metadata, you can just go ahead and use the EAs there.

On our files, or on our platform, they are going to be supported in all filesystems. I think at first that sounds a bit odd but I'll explain a little further. Obviously for cases where the filesystem supports EAs natively, we will use the native method. That's pretty straight forward. In situations where the filesystem does not support EAs natively, we will synthesize support at the VFS level and store the actual EAs in what's called an Apple Double File.

I'll talk about that a little bit more later. The convention we use for the actual EA tags is pretty simple. We just use the reverse-DNS name convention, the FinderInfo is listed here. We don't actually enforce this, but it's something that, you know, we encourage you using that keeps things nice and clean, very manageable. And there's a good past precedent for this as well.

Also want to call out current limitations. The main one I'm talking about today is size. The current lowest common denominator of size per EA is roughly just under 4K, about 3803. I can't remember the exact number but it's around there. This is really something we should put in some kind of volume capability bit and advertise a little bit more strongly. But at this point just be aware that the current lowest common denominator is in this neighborhood.

I call out current and I don't know if any of your browsed the open source code. We do have some work in place for extent- based EAs that hasn't been turned on just yet. But just so you know this is something, you know, that could be lifted at some point in the future. Preservation when it comes to extended metadata or EAs in general.

Definitely something you want to be mindful of even if you don't use EAs directly, it's very, very easy for you guys to strip off -- just as end users, just to strip off EAs. Doing very basic operations, operations with utilities or applications that may not be EA aware. And so, you know, again something for you to be mindful of.

I'll talk about some alternate methods to do copies. We don't have to worry about this. But if for some reason you have to roll your own copy engine, your own copy functionality do moves or some kind of archive. When it comes to preservation we have an internal monitor that focuses on three things. Always the same three things.

It's the data, of course, I mean the data itself, the extended attributes, including the resource fork which shows up in the EA name space, and then the security information itself. Each one of these buckets or these categories has its own corresponding API or family of APIs to manipulate this data. So if you have to roll your own, those are the three things you need to focus on.

This is a slide we actually use every year. It's a good one. It kind of gives you the lay of the land of the various frameworks and how they fit in. Those of you coming from other flavors of UNIX you'll be very familiar with the BSD or the POSIX-level API. You know, it's again, it's pretty much the same thing.

The problem or the limitation with the BSD API, is that they don't show the limitations at the various filesystems, they don't distinguish much between the two as far as -- and everything will be exposed at that level. So as a general rule, we ask people to kind of move up in terms of using the Carbon file manager or the Core Services file manager and the NSFileManager to kind of protect you from the differences between the various filesystems. Just a general suggestion as far as not having to worry about my new show or any kind of specific details behind each different filesystem and its capabilities.

Okay. Let's switch gears a little bit and talk about things that are more or less unique to our platform, especially useful to those of you who are new to our platform and kind of familiar with the LINUX side, or the Solaris side or whatever it may be. First one is those Apple Double Files. This always gets an incredible amount of attention. There's all kind of different terms for this.

As I mentioned earlier they are use to back these extended attributes in cases where native support isn't available. So you would think in the future as, you know, you look at future filesystems and future protocols in general most of those come with native supports. So the idea hopefully is that in the future these Apple Double Files will become more and more rare. There's legacy usage withstanding, of course. The convention is pretty simple for any given file foo you have a corresponding ._ file which again backs all of those EAs and so on.

When you use the upper level APIs as I mentioned earlier, it tends to abstract away things. These ._ files, these Apple Double Files will not show up in that particular -- they won't show up at all in the native space. And so you won't be aware of them.

For better or for worse, my opinion, these things should be actually taken out completely as far as the name space. But at the BSD level, these things are exposed not just at the API level but also command line, things like find or LS. You will see these ._ files.

So please resist the temptation to operate on these things directly. You should never have to do that. The kernel manages these things completely as far as creation, renames, and unlinks, or deletes. And so there really isn't any need for you to manipulate these ._ files, these Apple Double Files directly.

getattrlist(). What can I say? For those of you coming from the UNIX side it sure don't smell like a UNIX function. But it is a full fledged VNOP. It's a fancy way of saying it's here to stay, it's a first-class function on our platform. The best way to describe it for those of you on the UNIX side is this is pretty much a superset of stat(). You get attributes of a file, everything that stat provides and much more. Those of you from the Carbon side of the house, this is pretty much how FSGetCatalogInfo() is implemented. So if you are familiar with that call, you'll be familiar with this mechanism.

It does have, in my opinion at least, one kind of really nice quality as far as this selection bitmap. And I think the best way to describe that is to compare it to something like stat which operates with this fixed struct stat buffer, if you will. With stat, as you know, you get everything whether or not you need it or not, you are going to get everything, you're going to have to pay for everything. With the selection bitmap and getattrlist() you can only specify the attributes you want and only pay for the ones you want.

So right off the bat there is a nice semantic there with regard to costs. Especially in the future if you see some of these attributes costing much more than others in a lot of cases where you don't really need them or don't care. The other nice property with this selection bitmap and getattrlist() is extensibility.

I think it's pretty obvious in terms of a well- define structure like stat in general and you have to add new fields or may be blow out the width of a field, the inode 64-bit from 32-bit is a classic example. With getattrlist() you at least have the option with the selection bitmap to define a new field or nview type and not have to worry as much about extensibility and backwards compatibility. MNT_IGNORE_OWNERSHIP.

This is a much higher level model or construct here. I'll start with this motivation. Motivation is actually pretty simple. I think a lot of you guys in this room can't or don't relate as much because you can figure out other ways to get around it. But motivation is such that you have two computers or two users, both with completely different UIDs, different UID spaces, if you will, of directly services. They are not bound to the same directory service. This is completely different. And you want to move or share files from one to the other with say an external FireWire drive. So we have Joe here, user ID 501, Sam on another machine, UID 502.

All those files there are kind of laid out with Joe's UID, brings it over to Sam's computer, and again for a lot of less technical users, on the read- only side you are typically okay but, you know, even then there can be some problems just because Sam owns none of those files, you know, with the traditional kind of UNIX model, if you will, for permissions. This can be a little bit overwhelming for a less technical user like Sam. So by default we go ahead and mount all those external drives with this nondurable state, this MNT_IGNORE_OWNERSHIP flag.

It's intended to be transparent. But the bottom line here is it's pretty self explanatory in terms of all these processes on Sam's machine, any nonroot process will see their own UID when looking at the directory ownership or file ownership of those files. So even though Joe's UID is what's on disk, whenever Sam does an LS or takes a look at those files, he will see, you know, those files as being owned by him. And so clearly on the read-only side you are really good, I mean there's no problems at all. It's as if he owns those files.

On the write side so long as you are in this mode, so long as you are mounting in this mode regardless of what your UID is, anything you write out to disk will be UID 99. So you don't -- better way to phrase this -- but you won't pollute, you know, the name space when you bring it back into Joe's computer. UID 99 is kind of a wild card here. Any time we see it on disk it will be interpreted as the calling process UID. So that's how these things kind of work out if you will.

It's not a perfect solution. And for those of you thinking about security implications about how I can possibly tell the difference between UID 99 versus the real UID so to speak. Root processes will always see the truth and that's kind of how we mitigate any exposure in that area.

Just log in as root and take a look. You should see the truth there. The Finder's Get Info panel, by the way, also kind of lets you know what state you are in. If you look at the Ignore ownership on this volume check box in the Sharing pane, this corresponds directly to whether or not that flag is set or clear.

volfs. Okay. It's got a pretty clear message. We have to give this message every year but I wanted to go a little bit more into what it actually is. I'll explain what it is and I am not going to debate the merits of look up by path vs. look up by tuple or look up by some kind of inode number. You know the message is the same year after year, please do not depend on volfs directly.

Please do not depend on volfs directly. Let's kind of go into what it is. It's pretty simple. volfs is a filesystem that implements an alternate look up method. I mean you guys are all familiar with regular paths whether they be absolute or relative. I've got an example here where I just cat foo.text.

Very simple look up method if you will. volfs simply implements this alternate look up method where you can pass in a unique tuple. In this case some kind of FS ID and then some file ID, or an inode number. It's an alternate method to get to that particular file. That's all it is, as far as mechanism was implemented as a filesystem. We just ask that you do not use this directly, that's all.

Okay. Let's change gears again really quickly. What's changed or what's new in Leopard? Start off with a copy engine. FSCopyObject. As far as this API, it's easily been one of the most requested features, very robust in terms of being able to preflight high performance, cross volume format so you don't have to be aware of what filesystem you are copying from and to. And it is the same engine that the Finder uses. So you know it's quite robust.

It's got quite a bit of soak time if you will and exposure. It's quite well tested, quite well used. And, of course, it takes care of all of those things about the preservation I mentioned earlier. The data, the EAs, and the security information. You don't have to familiarize yourself with the APIs that are required to copy the EAs and security information. This will do it for you.

So it's something we recommend you use. You see the synchronous/asynchronous versions of the call below. It's something that you can look up on the ADC site for sample code. We did talk about this in more detail last year. So go ahead and refresh yourselves by looking up that reference.

Safe saves. The mechanism itself certainly isn't new, but this year given some feedback from last year, I kind of wanted to talk about what safe saves were. I think all of you guys know what this is in general. Terminology may be different but all safe saves are, are a way to guarantee a very nice semantic in terms of -- you want to modify a document and you want it to only end up in one of two states. Either the original document is untouched or all of your changes have made it over. All are nothing, right. Never a situation where the document was half modified.

Something along those lines. And so with any safe save mechanism we start off with the original document or file, we go ahead and make a complete copy, and you only work and modify on that file. When you are all done you go ahead and close, you sync and you check for errors. of course.

And then you go ahead and do an atomic rename, or exchangedata, the HFS case and you get your semantic. You'll notice that during that entire flow there's no point where I can introduce a fault and then be left in an inconsistent state. Either all the changes have made it over or none of them at all. And that's pretty much what the safe save mechanism is.

New in Leopard we have the ReplaceObject API that implements this as far as safe save. And just like with the copy engine there's no need for you to kind of be aware of what API or what calls I should use for the preservation exchange data versus rename. It's a port across all filesystems and it does the right thing with regard to preservation. You may have already seen it in use by NSDocument as far as that class.

You'll see the temporary subdirectory there. If you need more information just like last year this is talked about in a little more detail, look at the AppKit Release Notes or go ahead and grap around for the API there on the ADC website. Take a look at that and download that, play with it a bit.

Oh, there's that VolFS again. So this year in Leopard no more volfs--period. As far as the filesystem itself has been completely removed. Of course, none of you were depending on it in the first place, so this should not be a problem. But seriously speaking you won't see them in the filesystem listing. There was a pragmatic side as well for those of you tools like FS usage.

I know I get a lot of hate mail about how annoying it is to look at volfs paths all day. You'll finally see full blown paths in traces like this which is a little bit more readable, a little bit more helpful. So there's a pragmatic side as well. So no more volfs at least the filesystem.

Deprecation of UFS is a phased deprecation. Starting with Leopard you won't be able to install on a UFS root and you won't be able to create a UFS filesystem at least not from the GUI. Eventually the end game here is to get to a point where we only support read-only UFS.

Most of the people, seriously, you know, this is nothing personal. Most people that basically use UFS are really after the case sensitivity, at least in my experience wherever I ask this question. So again for case sensitivity if that's what you are after right now, please go ahead and switch over to case-sensitive HFS.

UDF. Universal disk format from OSTA that's an open filesystem. Like I said earlier this is completely rewritten in Leopard. And we've got read/write support. Burning is always going to be there as far as the optical side. I think the nice change now with UDF is that we fully support block storage devices, block addressable devices, whether they be your flash thumb drive, CF/SD card, or even just a portable USB drive. You'll see that Vista's got corresponding support as well. Kind of moving in that direction as much.

It's not going to be a replacement for FAT overnight that's for sure. I mean FATs pretty much the de facto standard for this kind of operability. But the device vendors are very much aware of this as far as the consumer guys and other vendors, you know, UNIX and ourselves and Microsoft are kind of bumping up their support of UDF. So something to be aware of, something to, in my opinion, be excited about. It doesn't have the same limitations as far as metadata and sizes that FAT does. So it's, you know, there's some potential here.

Enhancements to ls. The hang around on the command line or the shell--you always, I think, are already familiar with the -E option for ACL or the access control lists. You'll see that little + sign next to the mode_t bit which lets you know that this file has at least one access control entry.

When you use the -L option in conjunction, you get the full listing. Similarly with extended attributes if I only have a main stream command at least in the command line, right, to give you the same type of information for EAs, you have a little @ sign here next to the mode_t bits which lets you know there's at least one extended attribute for this file. And when used the conjunction with the -L option you go ahead and see everything as far as the tags and the size and bytes. And you can see that the tags conform to that reverse-DNS convention I mentioned earlier.

Access control lists, you know, certainly not new in Leopard but what's changing here is that they are on by default for one. Got no choice at this point. They are on by default and they are actually used in a very mainstream way. If you look at this listing here this is basically your default home directory. And you'll see there are a bunch of access control entries for the well-known folders. The everyone deny delete I believe, yes.

This is basically the "do not shoot yourself in the foot" access control entry. Of course, you can go ahead and remove that explicitly and go ahead and remove the directory if you need to. But it's a nice level of protection, a little safety net there. That's the ls listing from the shell.

The Finders Get Info panel's been modified and updated to kind of give you the same information in terms of viewing the access control lists and also modifying them. So something to check out and start seeing a lot more usage here in terms of access control lists. You'll be able do the same thing with your applications in terms of sharing or providing these kind of safety nets here.

FSEvents. This is our filesystem event notification or change notification framework. It's a nice, it's a really nice framework in terms of being very simple. It's the same thing -- not the same thing. It's what underlies Spotlight so it's quite powerful, yet it's all user land unlike kauth which is also quite powerful but much more difficult to program too.

So simple user land file change notification mechanism. We actually talked about a little bit last year. Kind of at the same session. This year we have a dedicated session for this that goes over this in much more detail. Provides that same watcher sample code. It's a good one to go to tomorrow at 3:30. A lot of times I get questions about how is this different than kqueues?

For starters it's certainly much more sophisticated than kqueues while being able to use but with kqueues the most succinct example I can give is if you are just going to monitor one file or just a few files or may be a directory, and just its immediate children, kqueue is usually pretty nice in that context. Kind of what it's designed for.

But if you want to monitor an entire directory hierarchy all the way down to the leaves, changes anywhere in that hierarchy kqueue is not going to be adequate and you should really be using something like FSEvents. So if you want to learn more about that in much more detail, please attend that session tomorrow.

Filesystem corruption detection infrastructure or framework. This is something that we've introduced specifically in Leopard or new in Leopard. It's very dynamic. There's nothing you have to do explicitly in terms of offline your disk, or unmount your disk, downgrade mount your disk to read only. Nothing like that. It's all dynamic. It happens in the runtime.

At very little there's little performance penalty here at all, it's always running. It's basically triggered by physical disk problems, any kind of physical disk I/O errors or a number of data integrity checks that we're doing in the runtime. And so it's always running and kind of keeping track of things there. In the event that you do have a problem and NS fsck is going to be triggered, is reboot and hopefully everything will be fixed, hopefully it's not a physical problem. Hopefully everything will be fixed and you will be along your way.

In the cases where we cannot fix the drive, it will not be mounted as read only. Not going to look the other way and let you read/write and do whatever you want. It's one of those things where we want -- and the motivation here in read-only mounts is not to do anymore damage.

Kind of let you know something is wrong. You can still salvage things from a read- only context as far as copying things over, but don't want to mounted read/write certainly not by default just in case something is seriously wrong and you are going to make the situation worse. Where possible.

Where possible. We are going to try to provide specific paths or file names so kind of get a feel for what's been damaged or what's been impacted. Kind of as a side note it's not really directly tied to this project. But we also had improved fsck in terms of providing a -c or cache size option.

Kind of realize that these days there are a lot of you out there with huge, huge memory configurations. And we are not taking advantage of that by default in fsck. So we provide this explicit option for those of you who hang out in the command line, you can start passing much larger values for the block cache and make your performance actually jump quite a bit. So check that out. That's actually documented in the man page for fsck. It should be documented anyway.

Okay. How do we get started here? HFS Directory Hard Links. HFS Directory Hard Link is specifically created for Time Machine. It's specific to Journaled HFS+. As you know with Time Machine or most backup infrastructures you've got some kind of level zero where you back everything up and then incrementals from that point on that keep track of the changes.

For the incrementals you want to complete a view of the tree but not have to incur the costs of copying things that haven't been changed. So it's pretty obvious how the hard links fit in. You basically create a hard link, in this case for subdirectory 1 where things haven't changed and you kind of get the same view, if you will.

There's a reason why hard links aren't very popular in the industry. Actually there's a number of reasons. This is probably one of the biggest ones as far as either maliciously or accidentally creating a cycle in the name space. A lot of bad things happen because it's very difficult to detect this. At least in a timely manner. And so you know a cycle is kind of represented by this red line where it created a hard link from one of the trees in level zero up to the root of incremental 1.

If I were to transcend to one of these directories you'd loop for a long time, if not forever. Bad things happen. So we put very conservative limits on link creations to prevent these cycles from occurring in the first place. So with that all that said, kind of get the feeling that this is definitely not intended for general purpose use. It's very specific. Future changes may certainly occur not, just with the on-disk format, but also with the SPI, as far as how we do these in the first place.

With that said I'm sure all of you are going to start poking around in the backing store anyway. Just be aware we've had some problems here already with regard to this very basic scenario. Again the level zero in the incremental from the name space point of view this is basically what you see. It looks like they are completely isolated.

And you guys know better now, but it looks like from the utility's point of view ones that aren't aware of directory hard links this is what they see. In the level zero hierarchy that's completely separate from an incremental. And it seems natural that if you want to blow away the incremental version, you should just be able to point at the root there and it will do the right thing.

Of course, because it's implemented this way as soon as you're naive RM command goes in there, depth first goes to subdirectory one, you will actually be on the level zero side, start unrolling there and start deleting things from the originals. So this view isn't what it seems, so to speak. You've got this implementation. So a lot of utilities are not going to be aware. Just be careful, this is just one of many I'm sure as far as gotchas.

( Pause )

Okay. Let's completely switch gears. Talk a little bit about the ZFS on Mac OS X Server, the Developer Preview. Let's start really quickly describing what it is. I think a lot of you guys already know what it is but kind of bear with me here. ZFS is the next-generation filesystem from Sun Microsystems that's been ported to our platform.

In my opinion it has no equal in terms of the data integrity side of things for any production filesystem. And by data integrity I just don't mean the cases where you have a huge number of disks and a lot of resources even down if you look at this kind of slider, all they way down the single spindle cases ZSF can really do a good job in those cases as well when it comes to data integrity.

It uses a pooled storage model which is a great departure away of from traditional LVM, logical volume managers and partitions. Talk a little bit more about that. It introduces RAID-Z, a great improvement over something like or RAID 5 and if we have time, I'll dig into that a little bit.

Something that I like a lot about it is kind of built in live disk scrubbing. It's well beyond anything like an fsck as far as checking everything, all the blocks, metadata and data. And most importantly it can do this live on a system; you don't have to again downgrade, mount or unmount, or boot in to some other OS to go ahead and do this, so it can do it live. And it's really got first-class snapshots and clones. Where clones are read only or read/write snapshots.

First class, I mean not just in terms of functionality, in terms of no limits, but also implementation. It's not a bolt on architecture as far as snapshots being very fundamental to designers of ZFS And of all the other kind of features, if you will, it's going to end it here with compression capability, built-in compression capability.

And this is great not just for disk savings, which is important, but also for generating much higher I/O throughput for workloads that compress very well. Just because of the amount of work you have to do from disk is diminished quite a bit. When it expands out you can synthetize these huge I/O rates and so.

This is just a little bit, this is a scratch in the service with regard to ZFS. There's much more out there that you guys can read up about. I'm only going to talk about two things actually today. Talk a little bit more about the data integrity model, and then the pooled storage model.

Let's go ahead and start with the data integrity side of things. With ZFS everything is copy on write. So you never overwrite live data. This is a great property you have from an integrity point of view. It's very similar, in fact, to the whole safe save mechanism, instead of for documents think of it as a safe save for blocks, if you will, with some kind of transactional switch. In this case we'd over simplify and represent a ZFS filesystem as a tree of blocks. Kind of represented by a uberblock or head pointer of sorts. And you want to make a modification. You go ahead and copy and write, you know, your changes.

Then I can copy and write the indirect nodes of the tree, indirect blocks, and when you are all done we can go ahead and atomically update the uberblock. Update that version there and you'll notice just like with the safe save there is no point in time where you can eject the fault, would it be a power outage, plug pull, panic, whatever it is there's no point you can introduce a fault and be left in an inconsistent state. This is very simple but it's also very powerful and very provable. It's a very nice departure from traditional, you know, overwriting things live and hoping for things to work out with some kind of replay.

Another aspect of ZFS data integrity involves checksums. Checksums, I think, are a very natural extension of any kind of data integrity goal. But I wanted to start off with conventional checksums as far as checks sums that are local or very close to data, block global checksums, for example. They are certainly better than nothing as far as, at least you're doing something to help ensure data integrity. It will catch bit rot, for example.

But anytime you have block local checksums you are vulnerable to a number of problems. These are problems that are common enough such that they have industry standard terms or terminology. I'm going to talk about phantom writes and misdirection that's because I've seen those myself personally. I can vouch for them, if you will.

Phantom writes are simply a case where you do a write, the driver, the device acknowledges the write and it just basically drops it, kind of disappears. Checksums, at least local checksums, are not going to help you on that situation. Misdirection is just a fancy term for, let's say, you use a filesystem asking the driver to drive itself or block 17. The driver returns blocks 71. The checksums check out and I keep on going. So we are not going to help you in that situation either. So again, better than nothing but there are some vulnerabilities here.

So compare that to how ZFS uses checksums, kind of this indirect model. For starters everything in ZFS, at least by default, everything is checksummed, not just the metadata but the data as well. There's this kind of fault isolation or indirection between the actual data and the checksums. So in this kind of simple layout here this tree of blocks, if you will, anytime I modify the data and go ahead and calculate the checksum store that in the address away from it as far as an apparent porter, calculate the checksum for those kind of transitively kind of roll up here all the way to the uberblock.

Get a nice end to end, you know, this is a very secure way to do things as far as data integrity goes. For those of you who want a more formal definition you can go ahead and look up Merkle tree or Hash Trees, it's got some nice properties that are exploited. But all those previous problems I mentioned earlier as far as the misdirection, the phantom writes, and everything, those are all covered at least they'll be detected by this particular checksum scheme. So quite an improvement over block local checksums.

And finally, there's actually a bit more but this is the last leg I'm going to talk about regarding data integrity. Let's call it variable redundancy. If you are looking at any kind of block tree layout again over simplified here. There's the triangle or the tree itself. You'll have to agree that it's a lot worse to lose blocks further up the tree.

Meaning, it's bad to lose blocks, in general to have something bad happen to a block, but if something bad has to happen, you want it to be at the leaf level. Where that's all you'll lose. You've just lost the leaf. If you lose an interior node you start to lose many more -- much more data, and this kind of logically includes the uberblock where that's probably the most catastrophic block to lose.

So with ZFS, the more important metadata, the further up the tree you go, the more highly replicated you are. So it's a nice semantic or nice design principle to have with regard to data integrity. Okay. Let's talk about the pool model, at least a little bit as far as how it impacts you as users.

All of you are familiar with this basic configuration as far as a drive and a filesystem. That's what we all have basically or most of us have. It works out pretty well, it's worked out for a long time. The problem is you fill this drive up. I'm going to have to buy more storage.

And for a lot of you the first step, which probably be just to create another filesystem, it's fine. Something you're going to have to manage though as far as maybe new files go on the new filesystem, old files you have to remember that are on the old filesystem. A little bit of management overhead here, a little bit more clumsy, but it's not the end of the world. It's certainly something that's served all of us fairly well for a long time.

A lot of you can bring up a logical volume manager, at least a traditional one where I concatenate the two drives together, and that's fine. That's actually a lot better in terms of management in the previous situation. Of course, if I started out with the previous situation I'd have to back everything up, create my volume and then restore everything on there.

You know, again, not the end of the world. This is certainly a better model but it's still pretty rigid as far as the association between my filesystem, the volume, and my initial choice to create this particular -- this particular concatenated volume in this case. In summary, there's always going to be this association between a partition or a volume, some kind of dev node in every filesystem.

Sure you can grow and shrink. I shrink, most of the time any way, by hand or explicitly. So there's -- there's ways you can get around this as far as growing into new storage as it becomes available. I think the main point here is that you have these kind of stranded storage islands, as far as with this layout here, anything in filesystem 3 that's excess or extra, whether it be extra bandwidth or extra space cannot be utilized by filesystems 1 and 2.

Once you've made up your mind it becomes a little bit more manual or impossible in some cases to kind of rebalance things around. You have to do a very manual process. Things are very rigid. There's a very rigid association between the layers here. And so things are so much grounded once you make up your mind.

Looking at the ZFS pool model we start off with filesystems are built on top of storage pools. And just like with LVMs, pools do abstract away the physical storage. That's what they have in common. But unlike traditional logical volume managers, all available space and bandwidth in the pool is actually shared by all the filesystems on top of it.

And here's what's kind of interesting. The filesystems themselves are very lightweight in terms of being able to create and destroy as many as you want. They are very similar semantically to folders and directories. Not quite the same but they are very similar especially from an end-user point of view.

You guys don't give any second thought to creating folders or directories. You don't have to worry about how much space is available in general or associating it with a partition or a filesystem or anything along those lines. You just go ahead and create them. Same thing with the filesystems here.

Pools themselves can be configured by mirroring, concatenation, or this RAIDZ configuration or a combination of the three. And to kind of explain this a little bit better I'll go ahead and talk about that same scenario earlier. We have a single drive. We create a ZFS pool or a Z pool there and put a filesystem on top of it. That kind of looks just like the logical volume scenario.

But with the pool model you can create as many filesystems as you want, as many or as few. Very dynamic, you are not stuck with your initial decision. Go ahead and create as many as you want. They all share the bandwidth and the space of that drive. In this case I kind of gave them folder- like names as far as Movies and so on. Very dynamic model here.

When you run out of space, go ahead and add a new drive, the pool expands, you don't have to do anything with regard to all those files in those filesystems or those folders, the Movies, Pictures, Documents and so on. You don't have to do anything. There's no explicit action you have to take. They all immediately see all that extra space, can take advantage of that extra bandwidth there. So again a very dynamic model.

Growing and shrinking is automatic. There is no explicit anything here and like I keep on reiterating all the space of all the underlying storage devices is all shared space and bandwidth. For those of you who are thinking, you know, it's nice in a multi-spindle type of configurations, these dynamic pool properties do apply to single spindle configurations as well. I know a lot of you as developers may have some big drives for your desktop or your laptops.

And, you know, you may have a couple boot partitions, a big data partition. And once you kind of set on that or you make that configuration choice, you are somewhat stuck. Meaning if you kind of misjudged, miscalculated, you want to move things around you can do it. Actually most of you in this room could probably do it mechanically but it can be a bit of pain. With ZFS just have a pool on top of there, create as many filesystems as you want. You don't have to worry about preallocating or precalculating size and such, things work out very nicely there.

How we doing on time here? We have a lot of time. So RAIDZ. I want to talk about this. There's also a mirror configuration but it's a little bit, it's actually pretty interesting, I don't have any slides for that at this point in time. RAIDZ is something that's specific to ZFS. And the best way I like to introduce it as an improvement over RAID-5. And I think most of you are familiar with traditional RAID-5. It's an N plus 1 model. It's pretty simple, it works well today.

Basic idea, of course, is to kind of stripe your data across one parity disk and distributing your parity out there, but kind of the key here is that value of N is always going to be fixed. It's what drives the detection in the first place. Let's just say we're going to use some exclusive war function for our detection or function itself. The end value, that fixed value, is what drives the error detection and also the reconstruction. And in this example and further examples I'm going use a fixed value of 3. 3 data block or 3 data disk and 1 parity disk.

Problem is with any RAID-5 configuration any value of N, you are always going to have a situation where you have a write, an outstanding write, that does not fill the full width of that stripe. That's always going to be the case. There's never a situation where you can always have complete stripe writes. And so in this case kind of represent that N plus or 3 plus 1, 3 data blocks, 1 parity block, let's just say my workload right now is just 1 data block.

I go ahead and write that data out,and at that exact moment in time you notice the parity is no longer valid. If I were to crash at this point in time, you'd be in a little bit of trouble just because things are just out of sync at this point. Of course, if nothing happens, you go ahead and update the new parity and things kind of work out.

But, like I said earlier, because you cannot atomically update all those disks at that same time, if a fault occurs after the data is updated, the new data, but before the parity update, you'll have corruption. And this is known in the industry as RAID-5 write hole. The side effect of the fact that you are updating data in place. There are certainly workarounds for this as far as on the enterprise side. Battery backed NVRAM and so on. Typically associated again with the enterprise space, however.

So on to RAIDZ and how RAIDZ gets around this problem or solves this problem, if you will, is variable stripe width. And that's something that's very well captured by this diagram. The colors basically represent the stripe width, and you can see that they're variable. You have some very wide stripes, some very small ones, and things in between.

Point being is that you always, because you're variable, everything becomes a full stripe write. So right off the bat there's not going to be any read- modify-write in terms of the performance side of things. When you copy that with copy on write, like I said earlier where you never overwrite live data, the RAID-5 write hole is eliminated completely. And so you solve the problem without any need for NVRAM, special hardware, or special expenses. The ZFS guys have a saying about ZFS loving cheap disks.

Here's our current status as far as -- as far as ZFS on Mac OS X Server. In Leopard, I guess 10.5.0, the initial version of Leopard, somebody already discovered this-- there's a read-only default. Not very interesting, not much you can do. It's the read-only version of this as far as the kext.

Over the next few weeks or so, I'd say two weeks on the outside, hopefully, we will be introducing a read-write beta download for all of you as ADC members. Somebody can download for free, check it out, play with it, kick it around. At this point it's going to be mostly command line. You don't have any need to integrate it with any of the higher level GUI apps and so on.

We have not worked on performance at all at this point. So something that we have to get going as far as spending some time there. And snapshots. Snapshots actually do work. We just haven't ported something called GFS or it's equivalent, some type of equivalent functionality. So you won't be able to browse them as easily as far as mounting them on /ZFS or .ZFS. So you can take snapshots but browsing one until we get that component in place is not going to be as easy. That's basically where we're at.

For more information there's a ZFS mailing list you can kind of send feedback comments to. There's also ZFS radar component for those of you who have radar access and want to file bugs. For any questions in general, ZFS, filesystem in general, please talk to our technology evangelist Deric Horn. His e-mail address is up there.

And you can interact with us tomorrow at the Filesystem Lab, hopefully 10:30 in the morning is not too early for most of you. We'll be there for, I think, an hour and a half. And then there's that Bash on Thursday where most of us will be attending. You can interact with us there as well or even after this session here.