File System Best Practices for Application Developers - WWDC 2004

OS Foundation • 40:10

Learn how to structure your I/O to improve your application's performance. We discuss direct and async I/O, when using mmap is appropriate, as well as upcoming enhancements to the file system.

Speakers: Pat Dirks, Chris Emura

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, everyone. My name is Pat Dirks. I've been working on file systems and file servers on the Mac ever since the days when the most excellent Mac 512KE supplanted the 512 as the top of the heap. So it's been a while, and we've come a very, very long way. We put together a talk to give you some highlights of things that you should be aware of when you're using the file system, mostly from an application perspective.

We are gonna go in a little bit of background, just enough to sort of put the stage for all these things and run an anchor for all these things to make sense. We'll talk a little bit about the BSD-level APIs, but also about the higher-level frameworks. So what we're gonna focus on is how to make your applications go faster, basically how to make your I/O go faster and how to work better overall with the system, particularly in light of some new technologies that we're introducing, extended attributes, ACLs, that sort of thing.

We'll say a few words about monitoring other ongoing things in the system, kind of common areas that applications often get bogged down doing things that are unnecessarily wasteful. So we'll try to hit all the highlights and then we'll try to leave lots of time for Q&A afterwards, where we'll have the whole team up and you can ask anything you like.

So, a bit of background on the kernel and the BSD parts of the system. There are actually a number of different frameworks that people access these APIs through. There's the object-oriented Cocoa set of frameworks that people use. There's the familiar Carbon APIs. And the low-level BSD APIs that are the foundation for all of them. And the thing to bear in mind is that everything that is done in the system is done at some point through some BSD API. So, if you don't mind getting down to the really nitty-gritty, the BSD API will let you do anything you want to do.

Inside the system, at the highest level, is the virtual file system layer. And that provides a common set of system call interfaces to all other parts of the system, as well as providing some common services like path name translation, a certain amount of access checking, NFS export service that are just common to all layers of the system. But beyond that, it uses an extensible interface to lower level specific file system implementations.

and works with Buffer Cache and VM that I'll say a few words more in a moment. Because the buffer cache in, I think it was Jaguar or maybe Panther, was changed from a separate buffer pool that set aside the VM system and competed with VM for who had the correct copy of something to a unified buffer cache that coordinates with VM to make sure that there's only one copy of your data in the system at any one place.

And there's a set of code called UBC, the Unified Buffer Cache, that basically coordinates who owns a given page at any given time and swaps ownership between the two. So the important thing is, whatever there is of your data, there's only one copy of it floating around in the system.

Now, at the lowest level, there's a few very simple set of I/O calls. There's open, close, read/write, truncate, seek, just the bare necessities. And there's mmap, which we'll talk about it in some more length later on. We have some basic metadata operations. You can open a directory for reading.

You can read through it and get some information on the various items that are in a directory. We'll talk about that some more later too, because that's another way that you might want to be careful. Stat is a basic call to get some metadata on an item. And then there's a few other calls.

We extended the basic BSD mechanism a bit to support some of the things that HFS had brought to Mac OS 9 when we first introduced Mac OS X. You couldn't quite port all of the existing Mac OS 9 functionality to the BSD APIs that I just mentioned. So we introduced a pair of calls, getAddrList and setAddrList, that let you do stat and chmod things in a slightly different, more expendable way.

And a new call, getDirEntriesAddr, that probably few people call directly, but is used by the frameworks in the system to do efficient enumeration of items in the same way that getCatalogInfo did. In addition, we introduce SearchFS, which is an interesting API because it lets you search the entire catalog of a disk in more or less block order. So the performance of SearchFS is dramatically faster than doing a complete treat reversal of every single directory in the file system.

We added a few F-control selectors to let you do things like pre-allocating space on a disk and other things like that. And new in the TIGER release, and we'll talk a bit more about that, are extended attributes and ACLs. In general, although I said you can do everything at the lowest level in the BSD system, you're probably best off using the higher level frameworks, Cocoa or Carbon, in the system. Because the BSD level of APIs makes no attempt to hide all the differences.

between the different file systems. So if some file system doesn't support a certain feature, it's just not going to work in the BSD call. And frameworks like Carbon will provide uniform access to things like resource forks and even emulate them if necessary on file systems that don't directly support them. So in general, you're best off using the highest level API that you can use.

So I'll say a bit more about the extended attributes. Extended attributes are new in Tiger. The release that you have has only a very limited implementation of the extended attributes. They're basically only supported on the HFS file system. But in Tiger, extended attributes will be supported across all file systems.

They are, however, going to be fragile in much the same way that the resource fork has always been. And one of the things I hope you'll take away from this talk is that if you're manipulating files in the file system, if you're making copies of things, if you're transferring things around, you should be mindful that, like a resource fork, they may have extended attributes associated with them, and you should be careful to preserve them.

While I'm on the subject of the resource fork, by the way, if your code still has any references to file name/resource, now is definitely the release to get rid of that. We have supported that for a while, but it's not the way to refer things. If you want to, you can use . .name/resource. The biggest trick is always doing a safe-save.

Unfortunately, Exchange Files, which was invented for that purpose in Mac OS 9, is not supported on all file systems or all protocols. Besides which, you're going to have to be, in general, careful to preserve extended attributes and ACLs as well. If you were here at 2 o'clock today, you would have heard a talk about the ACLs and what to do about that. We have a few recommendations.

[Transcript missing]

Wherever possible, if Exchange files are supported and your document format is a single straight data file, the best way to do a save-save is to do an Exchange files. But you're always going to have to have a fallback option, and the best thing you can do, if you've created the file correctly, is to just do a rename in place.

If your document format is a bundle, if it's really a directory with files contained underneath there, you can do one of two things. Either create a complete parallel directory in place, and then rename one out of the way and rename it. It's not atomic, but it'll work. And the other option is to create the updated file somewhere inside the top level directory so they're invisible from the user, and then using either Exchange files or rename, switch them in place with the pieces that you're updating. And only you can decide based on that. And that's the only way you can decide based on your application which model makes most sense.

Now, Some things to be aware of. Locking support is unfortunately not uniform across all file systems. It's good practice to lock files that you think there may be contention for, but there will be file systems, either local or network file systems, that just don't support that. Rename does not support file IDs.

That was one of the big advantages that Exchange Files brought, is that if you had an alias to a file, Exchange Files would replace the data while leaving the file ID intact, and an alias would find it again. That's one of the appeals of using Exchange Files. Fortunately, alias data structures have some additional resolution mechanisms built in that will make it possible to find the file, even if the file ID is lost because you used rename instead of Exchange Files. But Exchange Files is still preferable. And as I mentioned, rename cannot atomically replace directories.

In addition, one thing to be careful about is that not all network file systems allow you to rename files that are open. So if you have a choice about it, it's best to close the file first, or at least flush it, before you try replacing it. And on the subject of that, be careful.

Because on some file systems, close may actually do a very significant amount of I/O. In a WebDAO file system, for instance, it's not until close time that the actual data is all transferred over to the server. There may also be systems that don't actually guarantee that your quota has been checked properly, for instance, until the file is actually closed and on disk. So you should try closing the file before you use it to replace data. And if you can't close it, at least call fsync on it to force a flush of the data and check for errors on both of those before you allow the data to be replaced.

So a word about file copying, and this is actually one of the shortest sections of the talk. We've had lots of requests over the years to allow people access to the Finder copy engine, and that's what we're introducing in Tiger. So you'll actually get access-- Not directly to the finder's engine, but to the same engine that the finder uses.

But you'll be able to do copies exactly the same way the finder does. And if you can use this new FS file operations API to transfer files around, moving or copying, that will be by far the best way to do it, because we'll guarantee that that does the best job possible on the various file systems involved.

The API will let you do either synchronous or asynchronous copies, and you can set a callback routine to be called at a rate that you specify so that you can do whatever animation you want to do to show the copy in progress or anything like that. It's very nice, and that's what the Tiger Finder is going to be using. So that's an example where the highest possible level of API is just ideal.

So, a few words about performance. You may have more choices about how to do I/O in Mac OS X and in Tiger than you may have otherwise. There are a few caveats that apply generally. Obviously, the fewer I/Os you do, the better. And the more you can aggregate I/Os into a few large I/Os, the better.

The kernel will, to some extent, coalesce operations that you do to make fewer larger operations out of a few single ones. It will do read-ahead caching and write-behind caching, so you may be picking up benefits that you weren't even aware of. There is something very specific, mostly on HFS, about zero fill. The Unix file semantics specify that data that has not been written should read back as zeros. And the system goes out of its way to make sure that that is true, even on file systems like HFS, which don't actually support sparse file semantics.

So what that means is, if you take a file and create it and skip ahead 100K into the file and start writing there, and you try to read something out of the middle, the system will actually supply zeros for all those areas and write those on disk if necessary. If you close the file, only having written that last little bit at 100K out there, it will fill the intermediate space on disk with zeros. So that's another example of this. So there's a significant amount of I/O that you might incur when you do close.

I have a few words about zero fill later too. Now in general, there is a couple of different ways that you can do I/O, and we'll talk about them all in turn. You can do standard buffered reads and writes. You can directly memory map a file on disk, anywhere on disk, into your address space and access it that way.

You can do uncached, unbuffered I/O, which is great if you're ripping through very large data sets that you have no reason to believe will be touched again later. And we're now introducing support for true asynchronous I/O in the kernel as well, and that's another option you may want to consider.

Now in the buffered read/write case, it's the most general of all the mechanisms. You can transfer any amount of data to any place you want in your address space. There are some costs associated with that. The data is first transferred into the buffer cache, and then explicitly copied out of the buffer cache into your application space, dirtying a page in the process. So you're incurring an extra copy and you're dirtying the page.

But if you're reading or writing small files, and if you have no reason to think that they might not be touched shortly after, or if you really need the flexibility of writing a small amount of data in a very particular place in your address space, this is the perfect mechanism to use. And if you're just doing reads and writes and doing nothing else, this is the mechanism you're using.

By contrast, you can, if you want to give up control over the location of the data, use memory mapped IO. You can call MMP. You have to call the low-level BSD file system call. There's no framework-level call for this. But you can ask a system to map a file in your address space, and it will just appear.

You'll be given the address, and you can access the data right there, and it will be paged in directly off the file on disk. If you're reading in a file, this is a great way to read it. If you don't care where the data ends up, you can just use your favorite beat copy routine to read your file.

There's only a single copy made of the data because the system doesn't have to relocate it to the place where you requested it be put. Everything is left clean, so if the pages need to be reused later, they can simply be tossed. They don't need to be written out to swap space.

And although you can overwrite data that way, and it'll be paged out to the file, you can't actually extend the file that way. That's the only limitation. The big caveat on using memory mapped I/O is that if you encounter any I/O errors because, say, it was on some removable disk and the disk was removed, or it was on a network file and the network connection was broken or something, it's as if you hit a bad RAM chip.

I mean, you get a memory access error. So it's a little bit more severe than just getting an error back on some I/O transfer. You may want to keep that in mind and limit disk type of operation to local, non-removable media. But with that caveat in mind, it may be an attractive way to do I/O, and it's very clean.

Some of the benefits of doing that kind of I/O can also be had by doing direct uncached I/O. You can call fcontrol in the system to request that I/O to a certain file descriptor is done uncached, and the system will take advantage as much as possible

[Transcript missing]

The system cache is filled with pages from your application, or pages from other files that you've touched, or pages from the disk structures that you're hitting. You don't end up forcing them all out of memory just because you're reading some multi-megabyte data file. So it's something to think about. And there are ways that you can do uncached I/O from the Carbon framework, for instance. We'll show that in a minute.

Async I/O is a standard API that we're introducing support for in Tiger. It's a complete implementation of the standard POSIX async I/O. You can initiate async I/O requests, you can get notified by a variety of mechanisms. The I/Os are handed down to the kernel in parallel, so in theory they could proceed in parallel to some extent. And if you're calling the Carbon PB Read Fork Async or PB Write Fork Async, you can specify explicitly that you want the system to do concurrent async I/O.

By default, the Carbon framework actually emulates async I/O by having a separate thread that is introduced by the system. The I/O is then sent to the kernel to take care of the I/Os while the rest of your process is allowed to proceed. And so things are essentially single-threaded on the async side. But if you want, you can now specify that the system should do true concurrent asynchronous I/O, and it will get done that way.

[Transcript missing]

The system goes out of its way to fill pages with zeros if necessary. It also tries to defer that as much as possible. It tracks in memory what is actually being written, and it's not as simple as skipping around and doing I/O ahead in a non-sequential way. It will guarantee that you end up doing writes of zeros, but it does introduce that possibility, and if you can possibly write your files out sequentially, you'll be that much better off.

You can use allocate instead of setEOF to allocate space without forcing zero fills, because setEOF actually would move the EOF of the file out on disk and would force zero fills of areas that you don't actually write as part of that write. Allocate simply reserves the space in advance. Allocate is not supported on all file systems, so you should just ignore any errors you get, because the write will allocate the space as it goes anyway. But calling allocate is a good idea, and it's a better way to do it. idea than calling setEOF.

Finally, if the file you're writing is just a scratch file, and you're writing it randomly in different places, so you're going to be leaving lots of holes, truncate the file to zero before you close it. Because that will preclude the system from writing any zeros out to disk on data that you're going to just throw away anyway. So if you have a scratch file, and you're going to delete it as soon as you close it anyway, truncate it to zero first and make sure you may have skipped any zero fill that the system might have to introduce for you.

A few words about directory enumeration, which is another area that lots of apps can spend lots of time. You can call the raw BSD, read dir, and stat calls on full paths. You can use get catalog info in the Carbon framework to get the equivalent information out of stat.

There are new calls that were introduced. The get their entries at, or I referred to earlier, are the BSD level call of it. The get catalog info bulk is the Carbon equivalent of that call, and it will enumerate a directory and collect the metadata that you request as it goes.

The big way that you can save there is the call actually has a field which info, which specifies which fields in the metadata you're actually interested in. And for some file systems that would otherwise end up emulating some of the data or go through contortions to generate some version or derive some version of the data, if you don't actually care about some of the information, you can indicate that by setting the which info carefully. And if you set that to as small a set as possible, the system will only collect together as much metadata as is absolutely necessary to satisfy the request.

Finally, the file system SearchFS has a Carbon equivalent of catalog search. It's an interesting mechanism that if you're going to do a very large search of essentially the whole disk anyway, may be a far better way of searching the data. It's surprisingly fast. In about 3-4 seconds you can search an enormously large hard disk catalog in its entirety, and you can search for matches based on a whole slew of metadata parameters or file name substrings or things like that. So if you're scanning a volume, use SearchFS. Don't do a treat reversal of the whole thing starting from the root.

Think carefully about how you access data and make sure that you choose an appropriate mechanism for the type of access that you're doing, or perhaps change the way you access data or write data to better suit the mechanism and avoid zero-fill on file systems like HFS. There's a great quote that says, "More sins are committed in the name of optimization." However, illusory than in actual code, and that's very true.

If you're going to set about optimizing your application, analyze it first and make sure you find the true hotspots in your application before you go wild introducing some new caching scheme or something because you're convinced that this part of the code is what's actually really slow, and instead end up creating a much larger memory footprint that costs you in the first place. performance when it should have gained you.

If you're porting an app from another system, this is a good time to consider. How you're actually hitting the disk in the file system. And some things that you may think are cheap are in fact expensive, and some things that would have been expensive on other systems are in fact quite cheap.

So this is a good time to revisit some of the fundamental assumptions that you may have in your data. And speaking of assumptions, don't make assumptions about the speed of accessing some of these things because innocent things like preferences could easily be located across the net on an NFS server that may have crashed a minute ago, for instance. And what you think is a very quick access to refresh your preferences or something could potentially take a long amount of time. So be careful about what you assume. Now, say a few words about security and ACLs, which are both new in the Tiger release.

You're by now probably familiar with the STANDEX, UNIX permissions model that Mac OS X has supported all this time. There's a single owner, a single group, and three categories of access for each one: read, write, and execute. There is a slight wrinkle in that we allow the ignoring of who actually owns a file on removable media, and we do that by default. But this is the basic model.

Access Control List, change all that. Instead of having one owner and one group and one others category, you can invent as many categories as you want. And you can build up a large list of ACES, Access Control Entries, that specify for individual categories what they're allowed to do or what they're not allowed to do.

And the system takes the whole list of things into account before falling back to the standard POSIX permissions. So, it's very powerful and it lets you do some of the things that AFP servers, for instance, let you do, where owners and groups can each be associated with their own group of access.

As owner or as group. If you were at the 2 o'clock ACLS talk, you saw that the group mechanism has been greatly expanded. The system is no longer limited to fixed 16 groups per user. It has given up on the idea of trying to pre-enumerate the groups that every user belongs to.

Instead, there is a very flexible user level group membership demon that the system consults to determine which users are members of which groups dynamically and cache those results. As a result, there are things possible that were not easily before, which is to have groups that contain other groups, for instance, and groups that are completely determined over the network in a very flexible way.

That's great stuff. The ACL API is based on the POSIX standard proposal that was not adopted and has since been withdrawn, but it's very much based on that. And the actual permissions, rather than being strictly the POSIX read/write/execute model, are a much more flexible, much more fine-grained set of permissions that, for instance, separate control over metadata extended attributes from control over the data of a file. So it would be conceivable, for instance, to have a set of images, photos, that are read-only documents whose metadata you can control and set your own keywords on and things like that. So what you can and cannot do has become much more flexible.

Besides more flexible control over the appearance of a single file, there's also more flexible control over the metadata and the directory structures themselves. There's ways that you can inherit access control information from a parent to a file that gets created. There are separate controls over the ability to control items within a directory without regard for the access controls on the item itself. There's a lot of new stuff there. There's a session tomorrow at 5:00. Not sure where, about the extended metadata APIs. There was an session today at 2 o'clock. I'll have references to all those later.

So finally, notification is another area that an app can easily spend lots of cycles, perhaps unnecessarily. The obvious way you might go about seeing whether something changed, if you have a folder that people can drop plugins into or something like that, is to periodically poll and see if the mod date of that directory has changed. That will work, but you're spending time even when nothing's happening in the system. One improvement on that is the whole FN subscribe, FN notify mechanism that's been around. But a much nicer solution that we introduced support for is KQs. And I'll say a few words about KQs.

The basic model of KQs was derived from the observation that applications spending lots of time in basically a select loop or pull, the problems are not fundamentally different between the two, are very wasteful just in the system call overhead to do those things. And so KQs were introduced to give a model where instead of specifying, like on a select, the set of file descriptors that you're interested in data being ready for on the select call itself, you would create a KQ and you're giving back a file descriptor like an open file would be.

And then you call K event to specify the set of file descriptors that you're interested in I/O happening on. And then your process is put to sleep. Waiting for anything to happen on any of those file descriptors and when it wakes up, is told something has happened on this file descriptor.

So just the basic select loop mechanism is much more efficient that way. We've implemented a source compatible with the 3SB version of KQs, but we've also extended the mechanism a bit by introducing additional event filter classes. And one of the classes is a Mach port filter class that will tell you whether some message has arrived on a Mach port. And another one tells you about things happening in the system. For instance, whether file systems get mounted or unmounted anywhere in the system, if one of those events happens, your KQ would get an event delivered to it.

The one thing to keep in mind about KQs, and part of the reason why they're so efficient in the kernel, is that it's not an actual queue of individual discrete events being notified to you, but rather a single notification that some condition has come to be true in the system. So, you wouldn't get ten little events saying three bytes are available at your file descriptor, five more are available on this file descriptor, ten more. You would get a single event that says there's data available on this file descriptor.

That's something to be mindful of in the model. It's not supported on all volume formats because the implementation does require support inside the file system itself. And if you are a client to a network file server, you should be mindful that you may get notifications, but you will only get them about the local events happening, going out to that file server. Not necessarily all the other things all the other clients of that server are doing on the server itself. Those are two things to keep in mind.

And when you open a file descriptor on a directory, for instance, to request notifications about that directory, there is a flag you should set, "O Event Only," which says that this file descriptor is going to be used strictly for event delivery. Because that will let the system correctly, decide at unmount time whether a file system is truly busy or not.

So to avoid file descriptors that are open solely for the purposes of notification from preventing you from unmounting a file system, if you set OEvent only on the open, the file system will let the unmount go ahead anyway. So OEvent only is one to be very mindful of.

Overall, unfortunately, despite the best efforts of the higher level frameworks, there are lots of capabilities that are very specific to individual file systems. And if you rely on some particular behavior in the system, you should probably check beforehand whether the particular file system you're operating on supports that. And there's a variety of ways that that can be done. There are get-out-of-list variants that will tell you something about the capabilities of a certain volume format. At the Carbon level, there's the get-vol parms calls.

But be mindful of it and either test ahead of time or even better, be prepared with a fallback option in case the behavior you need is not supported by the particular file system. But with that caveat, make use of all the file system special capabilities that you want.

If you're worried about some of the tools that you all have available, One that I find myself using on an at least daily basis is FS Usage. It's a great tool that lets you monitor all the file system activity in the system. And you can constrict it to a particular application, or you can monitor everything that's happening in the whole system. If you want to find, for instance, who it is that's writing a certain file or creating a certain something, and you find out the file system call that's made, the process that made it, what the results of the call were, so you can see errors happening.

If some application is throwing up its hands without any clear indication of what's going wrong, you can run FS Usage and see what kind of file system errors it's getting. It's a great tool. SC Usage does the same thing for other system calls. The features have been somewhat expanded in Tiger since the last release you may be familiar with, so it's worth checking out the -f option.

And if you run with the debug versions of the system frameworks, you'll get tracing at that level. So you'll find out not just that a stat call is being made, but that the application is calling get catalog info, which is causing the stat call to be made. So it's good stuff.

Top is another great tool to get a quick look at sort of at a very high level what's going on in the system. You see all the running processes. You'll see who's using CPU time, who's making system calls, because you can see the context switches happening. There's some summary information that'll tell you the paging rates in the system, the number of I/Os that are happening, how busy the system is. It's great for just sort of getting a feel for, hey, the system's a little sluggish, what's going on.

You find some app that's pegged at 99% of the CPU time, you've got your suspect to go look at it. Time is a great command line tool that will just count I/Os and tell you how much I/O is done, how much user time and how much system time is taken as part of executing a certain process.

If you're setting about optimizing your application, monitoring both of those is interesting to do. Sample is great for looking where a process is spending its time. There are more powerful profiling tools available, like Shark, which is part of the Chud tools that you can install. But Sample, especially when you think a process is stuck, is a great, quick command line tool to get a look at what the current stack frame is in another process.

And finally, LSOF will tell you about all the open files in the system. So if you can't eject a volume because some file is busy, LSOF can tell you who has it open. Or if you want to find out what files an application is keeping open, you can find out through LSOF.

Those are just some of the tools. There are specific sessions that were devoted to doing performance analysis and profiling of your tools that are very much worth looking into. But FS Usage is one of my favorites. Shark is definitely worth checking out. I think there's a session on Friday devoted just to Shark, for instance. So check it out. For more information in general, you can contact Jason Yeo. And there are also some resources. on the DVD and online that will tell you more about the particular file system or in general, software design guidelines.