Mac OS X File System - WWDC 2001

Mac OS • 1:00:45

Mac OS X provides support for a variety of file systems including HFS+, UFS, ISO-9660, UDF, NFS, MS-DOS, AFP, and WebDAV. Learn how the extensible file system architecture of Mac OS X provides support for multiple file systems and allows developers to extend or add more. This session provides essential information for any developer whose product requires close file system interaction.

Speakers: Clark Warner, Pat Dirks

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Thank you and welcome to session 116, Mac OS file systems. The Mac OS has supported multiple file systems for a long time now, and now with the UNIX foundation in OS X, we support a whole plethora of file systems. And to tell you all about those different file systems and how you should be using them, I'd like to introduce the manager of the Core OS file systems group, Clark Warner.

I'm really glad to see all of you here. I know it's early and I appreciate your coming. I hope we're going to have some very useful information for you about the Mac OS X file system. We'll start with this. A welcome and a little idea of some of the stuff that we're going to cover today. We're not going to talk as much about futures, obviously.

You've probably heard that as a theme throughout the developers conference. The present is so much more exciting this year. And we're not going to talk as much about file system internals because we know that a lot of you are now in the process of bringing your apps over to Mac OS X. So we're going to concentrate on some of the things that your apps need to be prepared for when they're using the file system.

As Jason mentioned, we have a number of different file systems in Mac OS X and they do present some occasional wrinkles. We also are going to give you some tips into how to improve the performance of your app, especially with regard when it's--to its use of the file system. But we will talk a little bit about some of the key issues in building file systems yourself to add to Mac OS X.

If you went to the Darwin overview session or probably any of the Core OS sessions, you've seen this chart already. This is the key "You are here" graphic. If we were on an airplane now, you know, we'd be saying, "This is to San Jose. If you're going to Hawaii, get off now." We are part of the BSD kernel inside of the file system.

Here's a blowup of what the file system looks like internally. We support, basically, at the Core OS level, the BSD, Berkeley Standard Software Distribution System calls, with some extensions. Inside of the file system, there is a big switch we call the virtual file system layer, which separates the part of the file system which is dependent on the underlying volume format or network. The network protocol from the part of the file system which is independent.

So the stuff above the VFS layer is independent. The stuff below is dependent. And that's why you see our list of file systems below underneath the virtual file system switch, UFS, HFS, NFS. And we'll talk to you about all of the various file systems that are supported in Mac OS X.

Here's the outline of today's talk. There'll be something of a status update, basically an indication of the file systems that we ship. We're going to do a demo of a couple of the new ones. We're going to talk only briefly about some of the different file system interfaces.

We've covered that in a number of years and there are a lot of sessions describing the application frameworks for use in Mac OS X. But we will spend a fair amount of time talking about the differences between the various file systems that Mac OS X supports and again how that might affect your application development.

We'll talk about security because that's been a large issue for a lot of folks. Mac OS X is a multi-user system with per level file permissions and in that way it differs greatly from Mac OS 9 and you need to be ready for permissions errors potentially in your applications. We're going to talk about performance considerations as I mentioned and we'll talk a little bit about building new file systems.

First status slide, I think this is my third or fourth Worldwide Developers Conference session talking about the interesting things we were going to do in Mac OS X. And the recent things we had done in interim build A, B and so forth. This is the first time I get to say this.

[Transcript missing]

This is why I love this crowd. All right, so Mac OS X. We have three primary file systems. And when we say primary, what we mean is they're fully read and write file systems, and we can boot and root off of them. So we can boot and root off of the Mac OS Extended Format File System, best known to those who know and love it as HFS+. We can boot and root off of UFS, the Unix file system that we support based on the Berkeley FAST file system. And we can boot and root off of NFS. In fact, we do that internally all the time for installs.

We also have some read and write file systems that we don't boot and root off of, but are otherwise first-class citizens. And that includes the Mac OS Standard Format, otherwise known as HFS, for legacy data that you may have on your Mac. The Apple File Protocol, an AFP client, was delivered by the server team into Mac OS X.

We have support for MS-DOS file system, and specifically we mean FAT16 and FAT32 here, which we did largely for digital cameras and so forth, but also zip drives and removal media coming from your DOS machine. And WebDAV, and we'll talk about WebDAV in more detail a little bit later on in the talk.

There are also some read-only file systems supported in Mac OS X, including ISO 9660, which is used on lots of CD formats, especially to interchange between Mac and Windows. We support the Universal Disk Format. Actually, I don't think it's called that anymore. I think they just use the initials UDF, sort of like KFC in Kentucky Fried Chicken. And CDDAFS, a file system written by the CPU Software Group to support CD audio, CD audio drives, CD audio disks, rather, music.

ISO 9660, I should mention, we can also boot and root from. I didn't mention it in the primary file systems because it's not read and write. So I'm going to do a demo. I'd like to bring up demo machine number one. And I'd also like to bring up my lovely and talented assistant, Scott Roberts.

You might have seen a demo similar to this in the keynote. Avi got his digital camera up and took a picture of the audience. I'm going to have Scott do the same thing here. Thank you, Scott. Now, it turns out I'm not a vice president. And as a result, I don't make as much money as Avi Tavanian.

And so I can't afford a really nice fancy digital camera that has USB hot plug and so forth and so on. I have this camera that was given to me as a gift by a friend of mine actually, a Kodak DC210+. And-- But the nice thing, although it doesn't have USB connectivity and all that, it does have this little compact flash card that it uses for memory storage. And we happen to have-- A sand disk reader here attached to our demo machine. And so I'm going to put this little card in here. Let's pray a little bit.

and you'll see the image come up on the system. Now we're doing something differently than Avi did because we're not here to show you image capture or any of the fancy camera applications. I just want you to know that this little compact flash card is formatted as an MS-DOS disk. And so we're looking at an MS-DOS volume that's just loaded on our desktop. And if I bring up the preview application, And I go here to the image that Scott just took and open it up. There's a picture of you folks.

I must have said something wrong. I don't know what. Okay. For our other demo, I wanted to show you the WebDAV file system. I've got over here on my other demo machine, I pre-configured it as a web server. I actually changed the configuration file so that it would run the DAV module that actually ships in Mac OS X as part of our Apache installation. So, host 2.162 is running WebDAV and I'm going to bring up Internet Explorer here, which I should have pre-launched. Sorry to waste your folks' time. And I'm going to type in the URL.

And you can see we have our little Apache page. Now what I'm going to do is, since I'm bored with looking at things in web browsers, I'm going to mount this particular website as if it were a volume on my desktop. I do that by hitting Command-K, going to the Connect to Server dialog, and typing in the URL here instead.

And now you notice I now have the Apache website actually mounted as a volume on my desktop. Why is that interesting? Let me show you one thing that we can do. We took a movie, actually last year's file system session, and we edited it and made a small movie out of it.

And we put it on the web server. And now instead of going to the browser and trying to deal with the QuickTime plugins and so forth, I'm just going to double click that movie and we'll show it for you now. Let me back it up a little bit here.

Thank you. The stage police are watching me. And I've heard from my spies that when I turn around to jump on the stage, something bad is going to happen to me. So I have to be careful. I've noticed that the stage is higher this year than last, so I think I'm ready.

I'm getting older. I can't do this kind of stuff every year. Okay, let me show you one more thing that I think is pretty interesting. I'm going to go into BBEdit here. Now I haven't actually completely rehearsed this demo. I was fooling around with it before the show, but I thought it would be kind of an interesting thing to do. So I'm going to bring up BBEdit and I'm going to open up a file on the WebDAV server.

It's going to be the English HTML index file. And I see some text here that says, let's see, if you can see this, it means the installation of the Apache server went, OK, I don't like that text. The text I like is file systems rule. I like that much better.

Now I'm going to go to the web server here and refresh. And you can see file systems rule up here in the top. So an interesting way to edit your website is to enable DAV on it and then you can just use your favorite editing tools and do whatever you like inside your favorite editing tool. When you save, it's automatically populated right back up on the server.

You know, I had to have a plant for that applause. Okay, let me talk just briefly about some of the file system interfaces in Mac OS X. This is the little file system interface chart. You can see the three major application environments, Classic, Carbon, and Cocoa on the top. They all have their own file manager or file object interfaces, but they all come through the BSD layer, and we've extended the BSD layer to allow access to file system metadata that's not typically available in Unix, especially catalog information, finder and creator, file type and creator type, and things that you find on HFS+. And underneath all of that is our virtual file system switch I mentioned before, and that's where additional file systems that you might develop in Mac OS X or Stacks would layer into our system.

So I won't go again into detail here, but suffice it to say the Carbon File Manager is still available to you. It has the Mac OS interface file system interfaces carried forward. There's access by volume reference number, directory, and ID still available. And Carbon does an interesting thing. Carbon provides HFS-style semantics on file systems that don't normally support it. So if you're using UFS, but your app is using Carbon, you'll still see resource forks and you'll still see file IDs.

The file IDs don't last across mounts. They're done in memory and kept for you only while the mount exists. But nonetheless, they're present if you do a call and you want to see them. So you're somewhat insulated if you're using Carbon from some of the differences across file systems. We also, of course, have the Cocoa environment with an object-oriented API and some file objects that you can use. And as I mentioned, the Berkeley Unix interfaces are available and can be used by your app as well, even in a mixed environment with the other application frameworks.

I want to just briefly mention a couple of the new calls that we added to the BSD layer for accessing HFS-style metadata and those sorts of things that aren't typically available under Unix. Two major ones, GetAdderList and SetAdderList, which roughly stand for GetAttributeList and SetAttributeList. These are flexible calls designed to retrieve various types of metadata, various different formats, back to your application.

Normally, you wouldn't call these. You'd probably call GetCatInfo or SetCatInfo in Carbon, but they're available to you if you need to. They are in our system. We also implemented a call called SearchFS, which whose job is to do fast catalog searching. It's supported in HFS+, although most file systems don't support it. But it does allow for fast catalog searching, and it was designed to support the PBCAT search functionality in Carbon. We also have a call for exchanging data between files.

Those of you who have done Carbon apps before are probably familiar with the exchange. We created a BSD-level call called ExchangeData to do that same atomic transfer of data between two files. Finally, we've added some options to FControl, which is a standard Unix file system call, but we have some extensions to allow behavior like allocating storage in advance of the LEOF.

So data that's not part of your file per se, but is part of the file-allocated storage, which can be extended later without additional allocation. Now we're at the file system differences part of the talk. I'm going to grab a little cup of water here because this is going to be a little longer and this stuff is important.

[Transcript missing]

Some of our file systems support hard links, which is the ability to create separate nodes in the file system pointing to the same data where the nodes are essentially equal, not an alias from one to the actual object. Some of our file systems have storage for catalog data, catalog information. Some of them don't. Some of our file systems are multi-fork, some of them aren't.

I wanted to talk to you about web dev in particular because web dev is a really good way of highlighting some of the odd differences you might see between file systems. It's interesting that we have this sort of file system agnostic architecture because it allows us to do things like mount a web server as a file system, but there are going to be some gotchas.

The web dev protocol does not have any sort of dates besides modification dates. So if you wish to get the access date of a file on the server, it's actually impossible. We have to sort of fake that information. So that's one classic difference. Inode numbers aren't a concept that is supported in the web dev protocol either. Web dev, I should mention, stands for web-based distributed authoring and versioning.

The reason it came into existence was to allow collaborative authoring on the web. I think it's original envisioning was for people who are doing development in a web-based authoring tool to be able to move things to and from the server and affect the files on the server, but it wasn't originally a file systems protocol. That's something Apple decided we could do. It was a simple protocol that was created because the protocol added enough support in the way of a consistent hierarchical namespace, some synchronization with locking, and property management. But those properties do not include inode numbers.

And so, very much like Carbon and file IDs, when we see a file in web dev, we generate an inode number, and we remember that inode number for the life of your mount. But if you unmount the web dev volume and mount it again, the inode numbers for files will not be the same.

Also, we can't set live properties in WebDAV through the protocol. Live properties are the ones that are actually supported by the server. There's also a notion of dead properties in WebDAV which are you can make up a property and store value in it. But real live properties like, for example, the access time, those properties are actually on, sorry, the modification time, those properties are actually on the server and we want to respect those and return those to you, but the servers will not let us change them. So you can't do a set mod time on WebDAV and have it work.

We'll have that silently fail to keep apps from killing over. But be advised that you can see a silent failure of a set mod time. The security model for WebDAV is entirely different from what you'd expect from a file system. It's HTTP security. It's basic authentication for us generally, which means that you try a request and if the server doesn't like you, it gives you back a message that says authorization denied and it's your job to find out the user's username and password and try again, sending it across.

But there is no way except for testing an operation to determine if the user is going to be able to do that operation or not. If you send a put across to the server, which is the mechanism for taking a file and moving it up, you don't know if it's going to succeed or fail. There isn't a pre-flight call, if you will. You just have to do a put. So what happens in WebDAV is if we get an authorization error, the daemon that supports the file system puts up a dialog box that says, "Who are you?

Please type your username and password in." And then we'll send that across to the server. We'll keep doing that until the user either gets it right, it times out after about five minutes, or the user hits cancel. And if the user hits cancel, an e-access error comes back from the file system.

Likewise, unlike AFP or NFS, which are typically run over local area networks and are therefore usually reasonably fast, maybe not compared to local items but in the absolute sense, WebDAV can be quite slow. It may be running over a 28K modem link, you never know, because we're talking about an internet file system after all.

So we're not going to go through all of the items on this chart. I just wanted to scare you a little bit to give you an idea of how some of these file systems actually differ. Classic example, HFS+ supports privileges. It has storage in the volume format for privilege information. Supports the ability to get an error back when you delete an open file. Supports access by ID. MS-DOS supports none of these things. WebDAV supports none of these things either. NFS is sort of a mix.

Same story with naming differences. Some file systems are case sensitive like UFS, some are case insensitive like HFS+. Some support Unicode fully and some don't. We make a particular mention of the Unicode characteristic. HFS+ supports Unicode names in a canonical form where the characters are decomposed. It's possible in Unicode to have, say, an E with an accent represented as E with an accent as one character or E followed by an accent character.

There aren't that many decomposed characters, but there are some. We always store all our file names decomposed on the volume so that there will only be one element in a directory that looks the same to the user. And so that we'll easily be able to do name comparisons.

But not all file systems do this. UFS does not. UFS does not interpret the bits and we wind up storing things on UFS as UTF-8 characters. And so on UFS, you actually could have two names that look identical to the user but are actually slightly different in their byte representation. What this means for your application, though, is if you have a composed character and a name that you send to a create call, when you look at it again through a directory listing on HFS+, it's going to be different. And on UFS, it'll be precisely the same.

So having said that, let me give you some tips as to how you can handle some of the specific differences we've talked about today in your application. Number one, be consistent in your use of case internally. We found with the Macintosh applications environment, which was a Mac environment that ran on Unix operating systems, that lots of Mac apps would keel over when MAE was running on a case-sensitive file system and that characteristic bled through. Reason being, they would have files like preferences that they would open in one part of their app with a capital P and they would open in another part of their app with a small p and they wouldn't be the same and that would confuse the app.

HFS+, they would be the same, UFS, they wouldn't be. You'd either see a different file or you'd get an E not exist error when you tried to access the file but differ it in case. So make sure that in your app, when you're referencing a hard-coded file, if you do that, that you're using the same case in all instances.

Always use decomposed names in your application. That way you'll never be surprised by a name being slightly different on the way back out than it was on the way into the file system when you created it. Be prepared for access errors at random times. As I mentioned, the WebDAV file system has a very odd permissions model, and we're going to do things like put a file across which may happen in an F-Sync or a flush files operation or on a close operation, and we're going to discover at that time and not in advance that the operation isn't permitted.

If the user doesn't have a username and password that allows them to do it, or if the server administrator has cut their access, you can get back any access error on a close call or on a flush files call with something which probably hasn't ever happened to you before.

So be prepared for access errors at strange times, and you may want to be able to put up a dialog that says access denied. You're going to probably see them if you're running a Carbon app as an AFP permissions error because that's how Carbon maps our access denied error codes.

Also, do not rely on inode numbers, and this is also true for file IDs. If you have a file system that's unmounted and remounted, all the inode storage on a WebDAV volume, all the file ID storage on UFS volume is kept in memory in big tables, and there's no effort made to make that persistent across different mounts.

We have one way to help you deal with some of these differences and that is the PathConf system call. PathConf is designed to give you characteristic information about the file system you're running on. It just takes a path and you give it a selector that says you'd like to know if the file system is case sensitive or if the file systems support, how long the names are that it supports. Those are the only two selectors that we support right now in HFS+ on PathConf.

Perhaps someday we'll expand that list. But you can use this call, especially on HFS+ to know that you're on a case insensitive file system. So I'm going to now bring up Pat Dirks, the CoreOS File Systems Technical Lead, to talk to you about some security issues and some performance issues in Mac OS X.

Well, the first thing to realize is that in Mac OS X, it's a whole new world. It is multi-user to the core. There are permissions everywhere and the core kernel enforces those permissions. There's no path around it or some access path that's going to be different that's not going to be affected by it.

The whole system is fundamentally multi-user. The permissions in the system are the standard Unix permissions. If you're familiar with those, the next few slides are just going to be review for you. Hold on, there are a few gotchas in the permission handling of HFS, but if you are familiar with Unix, you should be very comfortable with the permissions model on our system.

And we'll see that diagram again in a moment. The permissions, for those of you who may not be familiar with Unix, are in some ways similar to Apple Share. Apple Share's permission when we were designing the AFP protocol are based on the Unix permissions model. And we made a few changes from there to allow them to work on a folder-only basis.

So, the permissions are based on the Unix permissions model. And we made a few changes from there to allow them to work on a folder-only basis. But they're fundamentally inspired by the Unix model. So, instead of see files, see folders, and make changes, you have read, write, and execute. And you have that for files and you have that for directories. The catch is it applies to files as well as folders.

On an Apple Share file server, the only permissions you ever had to worry about was the permissions you had on a given folder. In Mac OS X, you have to worry about the permissions on individual files as well. And the other difference with AFP is that in AFP, you can have separate permissions. So, you can have permissions for the world, the group that owns a particular folder, and the owner of the folder. And whichever category you fit in, you've got those rights.

So, everybody started out with the everyone permissions. And then if you were part of the group, you also got the group's permissions. And if you were the owner of the object, you also got the owner's permissions. In Mac OS X, there's only one group that is matched. So, if you're the owner, you get exactly the owner's permissions. And if you're in the group, you get exactly the group's permissions. And if you're everyone else, then you get the other permissions.

So, those Unix permissions consist of an owner ID that is saved with the object, a group ID that is saved with the object, a set of permission bits which you see there, and a few extra bits that are not divided up in separate categories for owner, group, and other. And we'll cover in a moment what exactly those bits mean in different cases. But that's basically it. Read, write, execute in three groups and three special bits and some flags that we'll cover in a moment as well. So.

So every user is categorized in one of three possible groups. Either the owner of the object, the group that is associated with the object, or everybody else. And Whichever group is most specific determines the access that you get, as I said. Again, unlike Apple Share. So for each group, there is read, write, and execute.

Write translates very directly to make changes in AFP. Read is the right to read a file or list the contents of a directory. It's a bit like C file, C folders. That's where it gets a little weird. Execute applies most directly to files, obviously. If it's an executable and you have execute permissions, then you can execute it. It also applies to directories as we'll see in a way, which is kind of a subtle case.

For files, if you set the set UID bit, then when you execute that binary, and it only makes sense on executables, the program will run with the ID of the owner of that file. So you'll commonly hear set UID root binaries. Those are files that are placed on the system that will run as a privileged user in the system when they're executed. There's also set GID, which is used less often, which runs with the group that is associated with the object.

Now in directories-- It's probably easier to make sense of these permissions if you think of directories as files listing the contents of the directory, because that's how they originally came about. Read is explicitly the permission to enumerate the contents of the directory, to type ls and list the contents.

Write is the right to make changes. It's very analogous to AFP's make changes. This is make changes to the directory. So these are operations that would require changes in the directory file. Creating a new file, renaming a file that's in there, deleting a file, that sort of thing.

You can set execute on a directory, and that limits access a little bit. Without read, normally you have read and execute together. If you give execute without read, you retain the ability to open files in there, provided you have permissions on the file itself. But you lose the ability to list the directory contents. So it's almost sort of a security-by-obscurity.

If you know what the file name is or your program has built in some file that it's referencing, read is enough to get it open. But you need read in order to enumerate the contents and look at it. So that's sort of an edge case that you may run into. It's unusual.

Finally, setting the sticky bit, one of the special bits that is associated with an object, means you can give write, but the ability to actually make the changes is limited to the owner of the object itself. So it imposes one additional test before you can actually make use of the write permission that you would otherwise look to be granted.

There are a special group of flags associated with every object as well, and these exist only in one form. There's not a separate group of flags for owner group and others. There's only one set of flags. And the most common one you'll see is the immutable flag that takes the place of the lock bit that HFS always had.

And that is in fact what setflock and resetflock in the Carbon interfaces use to lock or unlock a particular file. You can see these flags if you use the dash O option in LS if you find yourself in the shell. It will list U change if the immutable bit is set. So that's a quick way that you can tell if things are locked.

And there's a chaflags command that changes the flags and there's a chaflags system call that will manipulate them at the BSD level. All these things are accessible at the BSD level. There's nothing in Carbon or Cocoa or something that is special above and beyond this. Everything is enforced at the BSD level and everything is accessible at the BSD level. The other gotcha you may run into is that when something has been marked immutable, it can't be moved. That used to not be true on Mac OS. You could lock a file and still take it and move it somewhere.

on servers, that was actually sort of an awkward thing to do because you could lock something down and somebody who had made changes could still make it disappear from you. That's no longer the case. When something is immutable, it doesn't go anywhere and it doesn't change. Now these flags have the immutable and the append flags have special variants of them that can be set only if you are a specially privileged user and it can't be unset in the normal running of the system.

So if you are trying to protect some particularly important file in the running of the system, you can set a special system only immutable bit that is sort of stronger even than the regular immutable bit and that you can't turn it off. So be careful if you try this on your machine at home.

You have to take the system down to single user before you can clear that bit. Now, all the Unix aficionados wake up. This is the part where things get different again. There is some special handling on permissions for HFS+ volumes. We had a problem in that we wanted people to be able to take disks and move them all around from system to system and retain the same ease of use that they had in Mac OS 9.

They could take a zip disk from your system, take it over to somebody else's system, and you wouldn't suddenly find that the permissions were all wacky just because the numbers that were assigned for the user ID and group ID on your system made no sense on the other system.

So, The system very carefully uses the permissions only on those disks that it knows are local or were specifically requested that they be used. By default, if you have an HFS+ disk that the system has never seen and you connect it either by plugging in a ZIP disk or by plugging in a FireWire drive even, the permissions will fall back to a scheme where the owner and the group are ignored. You can get that same behavior on request for any disk in the system through the finder's ignore permissions bit. We'll talk about that.

So every disk is identified not by name but by a special 64-bit identifier that we write on there. When a disk is being mounted, the HFS code checks to see if this ID is one of a disk that it has seen before and for whom permissions should be enabled. And if it finds that, then it will enable the permissions and it will be used just like you would see a UFS disk.

You'll see owners, you'll see groups, you'll see everything. If there is no entry for that system, if it's completely unknown, or if the entry in there says the user asked that the permissions not be used, then the handling switches over to ignore all the user and group IDs on there, make them unknown, and replace the owner with the logged in user. And that's done completely dynamically.

If you have such a disconnected, you log out, somebody else logs in, they are now the owner of all the objects in that system. So it's not a static mapping. It's a static mapping. It's whoever is currently logged in owns all that. So it's a very convenient way to not trip over user or group settings that make no sense on your system. So the ignore permissions checkbox in the finder.

lets you elect to ignore this and get the same sort of foreign disk behavior that you get. And it's the same underlying mechanism. What the ignore permissions bit does is basically turn off the recognition of that disk in the system and it will treat it without regard for the users and groups. It's called Ignore Permissions. Really the best way to think of it is to think of it as ignore ownership.

We'll take questions on this later. I wanted to bring up a few points about performance in the system, and in particular, the different ways that you can do I/O, a few general words that we'll touch in a moment. But we want to cover the differences between doing buffered file system I/O, doing direct memory mapped I/O in the system, and using unbuffered file system I/O and the differences between them and the implications of those things. I'll say a few words about zero fill, which your application may have run into, which is something that you see on Mac OS X that you never saw on Mac OS 9 before.

In general, this shouldn't be news to anybody, the fewer I/Os you do, the better. The more you can aggregate your I/Os into a few large operations, the faster things will go. Even if you're doing small transfers, the system will try to aggregate these on your behalf. If you're sequentially reading through a file, the system will pick up on that and it will read larger and larger chunks even ahead of where you currently are, and as you're writing, it will save up writes to do single large writes out to the disk to maximize the efficiency of your I/O. So that is why sequential operations are so much better than random operations because even if your application is only ever asking for 4K at a time, you'll be doing very large transfers to the disk.

The zero fill that we'll cover in a moment is triggered when you are leaving areas of the file unwritten, but you do become the owner of them. And it's best to avoid that because it's really just wasted effort. You might as well write the data sequentially. Don't skip ahead of the end of file, for instance.

This is basic buffered file system I/O. You see the device driver in the system, which is doing the actual data transfer, the buffer cache and the virtual memory system, which are part of the kernel that govern all the data in the system. And in Mac OS X, those are actually integrated and they coordinate with each other. So where there are cases where a particular piece of a file or a page in the system is the same, the buffer cache and the virtual memory system coordinate access to that page so there's only ever one copy of the page.

That means if you have something mapped and you write that, you'll see those changes in the mapping right away and vice versa. If you--if something gets paged out, the read/write path will see that right away. It seems obvious in hindsight, but it's an awful lot of work to make that work correctly.

And finally, there's a user application drawn there and you see the user application with a page of memory in there. That's really--it's really--it's really--it's really a great way to get the user to understand the user application. So I think that's what we're going to do. So I think that's what we're going to do. So I think that's what we're going to do. So I think that's what we're going to do. So I think that's really the appearance of a page that is managed on the user's behalf in the virtual memory system.

You can think of it either way. You can think of the page as owned by the user or you can think of the page as managed by VM on the user's behalf. It's really all the same thing. So in basic buffered I/O, the data is first copied from the device driver into a buffer that is set aside by the file system to hold blocks for that V node.

So and those blocks are shared between VM and the buffer cache as necessary. So that's the first copy. Then the file system will copy whatever the user asked for to be copied either read or written into the user application and there's a second copy made into the user pages that hold the user buffer.

So it's completely flexible. You can read at any offset in the file. You can read any amount of data. But you do end up making two copies. First, a large page aligned copy for the convenience of the system into the buffer cache and then a separate copy from there into your user address space where the data really lives. And by the way, the user page ends up being dirtied as a result, and we'll see in a moment why that's important.

But if you're going to be reading a file over and over, or you're reading back and forth through a file, it's a cost that's well worth making, in addition to the flexibility you gain from the ability to align the data or the size anywhere you want. The fact that the copy remains in the buffer cache means the next time you hit either in that same page or somewhere right around there, you'll probably find it in memory, and you'll only end up doing the last copy from the buffer cache into the user page. So there's an extra copy, but it may be worth it under certain circumstances. And this is probably what most of your I/O is like. This is ordinary, open, close, read, write I/O.

Instead of that, you can do memory mapped I/O, and it's something that you should consider as an option. When you're just reading files, It's a very efficient way to get data in, and there are some advantages. Although it requires a BSD VM call to set it up, so it may be tricky to do from a CFM Carbon application, it's a very nice way to get the data in because you only end up doing a single copy essentially. It goes straight from the device driver into the VM system, and from there it's visible to your user application. So there's only one copy made, so you save a copy.

In addition, The VM page is not marked dirty. All that's ever happened to that page is something was read into it, unlike the user page that was copied into a moment ago. So when the system needs more pages, it doesn't have to go copying that user page out to swap storage. It's all set. It can just throw this away, and it can read it in later if it should get page faulted in again.

There are some disadvantages. Every transfer is at least a whole page worth, so if you've got a file with 10 bytes of data in it, it's obviously not worth mapping it. But it's a nice way to get some data in and read through a lot of data. The VM system does the same clustering of IO operations that I mentioned earlier. As you're touching through pages, it will start paging in more and more in advance.

So it's a very good way to read in a sizable data file that you're just going to read sequentially. It's not good for write because you can't extend a file. But it's a very good way to read data in that you're only going to read and that you're reading sequentially. And it will save you a copy.

Finally, there's something that's almost a mixture of the two. You can choose to do unbuffered I.O. And it's actually very easy from Carbon because it's the exact... I/O pause mode, no cache bit that you can set on a read or write transfer in Carbon.

[Transcript missing]

Unlike memory mapping the page, the page is dirtied, just like ordinary I/O would be. The page in the user space is marked dirty because it's been copied into and it will be swapped out if necessary. But it's a good way to do I/O and not fill up the buffer cache.

If you're not likely to read or write, to read the data again, it's a good thing to do. And you can write files this way. So if you're writing an output file that your application isn't just about to reread, this may be a good way to do your I/O.

The zero fill I mentioned. The Mac OS X kernel tries to be very careful not to let you read data that you haven't previously written. If you recall cases where some major word processing application would inadvertently ship pieces of your hard disk out along with your documents, you'll see why this is a really nice feature.

You have to be careful though, because if you have a file that you're writing randomly, Uh, you'll end up, if the first transfer is some distance into the file, you'll end up zero filling the whole intervening space. Basically anything that you can potentially read, you should consider, you should either write, or the file system will write with zeros on your behalf as part of the write transfer.

So, Those cases are basically where you use setEOF to make the file larger or where you do a write that skips ahead past the end of file some distance and starts a transfer there, creating this gap. That gap will be zero filled. So for those reasons, sequentially writing a file, aside from all the benefits I mentioned earlier of clustering the I.O., is far preferable.

A word about the cost of caching. You should be careful when you decide to cache data in your application, because in Mac OS X you are constantly running with virtual memory enabled and what you think of as setting aside some memory for this particular cache is really just that much more paged memory. In fact, you may end up doing a number of I/O operations just to read the data in.

You may have to page out some other dirty page in the system to free up a page for your cache. You end up incurring the cost of the actual transfer to read the page in and if this data turns out not to be referenced, you may end up having to page out a page that you dirtied by this cache.

In addition, you have to be very careful about how you structure the cache. This is not wired memory that's sitting there for your behalf. If you have a cache data structure that is just laid out very conveniently in memory but ends up skipping around from this page to that sort of randomly, you end up touching all these pages and you may end up doing page ins with every new page that you touch. So you have to be very careful that you structure your cache in a way that minimizes the number of potential page hits to get you to your data.

All together, it's very easy for an application cache to become much more expensive than simply reading the data right back in from disk, especially if the data is something that is mapped directly into memory, for instance. So think about it carefully and only use caches for things that are truly hard to reconstruct or where you are sure that the hit rate is actually very, very high.

Well finally, Mac OS X is a good time to rethink some of the fundamental assumptions behind your application. Think about the kind of data that you're reading and the pattern that you're reading or writing the data in. And think about what mechanism you might best use to get that I/O in and out of the system.

Look at your application as it's happening and figure out where the real bottlenecks are before you decide where to spend your time and effort and trickiness and what to optimize. If the bulk of your application is reading and writing files, it's obviously worth thinking about. If the bulk of the time is spent waiting for the user to click on some cell somewhere or something, then it may not be an issue at all.

Look at the underlying assumptions that went into your application because some of them may well be changed in Mac OS X. Some system calls that used to be almost free on Mac OS X because they came straight out of memory all the time may be reasonably expensive on Mac OS X all of a sudden. And again, that's a reason to go back and look at your application in action and see where the time is being spent because you may be surprised to find that you're spending a lot of time doing things that you assumed would be almost free.

And finally, try to avoid making assumptions about how fast something will be to read because you might be surprised what's actually somewhere remote over on a network in somebody's home directory and the preference file you thought was cheap actually turns out to be a very lengthy operation that might involve automatically mounting some volume, getting access to the data, etc. So don't make assumptions about what's fast, what's local, what's remote. It could be on a WebDAV volume for all you know.

So finally, there are some tools that you should look at. There are some classic Unix tools. Top is a very nice tool for seeing the size of your application, the amount of virtual memory that it has allocated to it, how much of that is shared, how much of that is private, and it gives you a little peek into the system and will show you how fast paging I.O.

is being done, how busy the system is, what it's doing, what in your system is using the most CPU time, all kinds of things. I recommend it highly. You should run it often. There's a time command, which can be very interesting. It's limited to command line things, but it will tell you how much system time and how much user time was spent executing this particular application, along with the number of I.O.s that were done on behalf of your application.

So you can easily tell when your application suddenly starts doing fewer reads or fewer I.O. transfers or more larger ones or smaller ones, or whether the percentage of system time versus user time is interesting. If the system is spending most of its time in system time, you should think about what system calls it's doing to cause that to happen. And similarly, don't worry too much if most of the time is spent in the system because your application's algorithms may not.

B is relevant. So time can be interesting. Sample is I gather a long-standing next step tool. It's a dynamically probes your running application and takes a peak at where the system is currently running and the stack at that time. And you can tell it to take a number of samples over certain period of time and it will tell you what percentage of time was spent in what routines and that may tell you where are the hotspots in your application are.

Tell you whether your application is constantly waiting for I/O to come off disk or waiting for the user to do something or all sorts of things. So sample is interesting. And finally, FS usage which you may have seen demoed in other sessions as well is a wonderful tool for getting down to the real nitty-gritty of exactly what your application is doing and what the system is doing on behalf of your application.

You may be making Carbon calls and be unaware of the fact that you're running your application. So you can use this tool to get a better understanding of what your application is doing and what the system is doing on behalf of your application. You may be making Carbon calls and be unaware of the fact that you're running your application. So you can use this tool to get a better understanding of what your application is doing and what the system is doing on behalf of your application.

You may be making Carbon calls and be unaware of the fact that you're running your application. You may be making Carbon calls and be unaware of the fact that you're running your application. You may be making Carbon calls and be unaware of the number of system calls that go on under the covers to make that Carbon API happen.

So. I'd like to bring up our resident expert in bad demo code, my manager, Clark Warner.

[Transcript missing]

I'm going to bring up a copy of TextEdit, which many of you, I imagine, have probably used by now. And let me bring up a copy of the Process Viewer.

I'll have to do this the hard way. All right. Okay. First I'm going to do my little Unix command here to find out the PID number of the process that is TextEdit. It looks like 278. Let me change the font here to make this a little bit more readable for you.

[Transcript missing]

What's that? Oh, thank you. Okay. Now I'm now monitoring all the behavior of TextEdit and when I go back into, whoops. Let me bring it back. You can see as I click around various things are happening. One of the things I'll do is open up a file that I put on our demo volume. You can see a lot of things are happening now.

Here's 116 demos. Data file. Okay. So here's my opening of the data file. You'll notice there were a few page ins, some opens, some fstats, some reads, but basically one read call of a fairly large size. So that's not too bad. I'm going to close up this file here. Let me just show you.

[Transcript missing]

I think I would have woken up by now. The man page for FFS usage. One of the most interesting things about the FS Usage program is the ability to see all of the actual Carbon file system calls that are happening while the read calls are happening. If you notice here, there's this temp file tracing.

If you create in /temp this file called file tracing, then you will actually see all the Carbon calls as well as all of the BSD calls that are coming through. And to show you that briefly, I'll turn that on. And now I'm going to launch an application that I call dumb text.

ThumbText is the standard simple text text editor hacked up to do one-byte file reads. Just to give you an idea of if you guys wrote apps this way, this is how you can figure it out before your boss does. Let me talk to you a little bit about some of the key issues in building your own file system.

One is, it's really not recommended. And the reason we say that--

[Transcript missing]

Funnels are acquired when a thread enters a system call. They're released when the thread returns to user mode. They're also released when a system call reaches a voluntary yield point like I/O, allocating memory, and so forth.

But they're held across kernel preemption. So a BSD system call now can be preempted in the kernel and something else can run, a user thread or a Mach thread or an I/O Kit thread and so forth, but the BSD structures won't change out from underneath the BSD kernel system call. So it's happy.

We also split the funnel after we developed the first one so that now networking operations in the kernel are handled in the network funnel and all other operations including file system operations are handled in the kernel funnel. We found that we actually could separate network activity in the kernel from file system activity in the kernel.

What this means though is if you were writing a network file system, say, every time you went to use the networking infrastructure in the kernel, you'd have to change your funnel, switch from the kernel funnel to the network funnel. When you went back, you'd have to switch back.

Switching funnels is a blocking call. The entire world can change from under you when you switch from the network funnel to the kernel funnel and vice versa. All things you would have to know. Network funnel is for things like socket, I/O, and find and accept calls and so forth. Kernel funnel for everything else. There are some calls, of course, that can be called either from either funnel or from no funnel. Memory allocation and free, etc.

So, here are some need to knows if you wanted to build a kernel extension for Mac OS X. One, as we mentioned last year, we built this thing we call the unified buffer cache, which if you had built the file system prior to that would have had to change to support it.

Likewise, between public beta and now, we introduced the split funnel. And of course, we're going to be doing things to improve the performance of our kernel and the functionality of our kernel on into the future. And some of those things are going to require changes in the file system. And if you have one written, you're going to have to be inside of the loop.

You're going to have to contact Apple. There's other stuff that may be involved, but we can only tell you so much in an hour. So the primary message is talk to Jason. If you're thinking about building a file system, contact Apple. You're going to have to be in the loop. Now, we do a little demonstration here because we like to bring concepts home at the file system session.

And so, I am a rogue kernel file system extension. And my compatriots here, Pat Dirks, Scott Roberts, and Umesh Vaishampayan are the kernel. And this is me, an inappropriately versioned kernel file system extension. So, I'm going to be using the kernel file system extension, attempting to load. . .

I think you get the picture. Here's some other sessions you may be interested in at the show to help you with building applications that are file system centric or even building file system extensions. Open source at Apple is happening at 10:30 right after this session in Hall A2. There's a session on AFP Server and the Apple Share Client file system in Mac OS X that's happening tomorrow in this room at 3:30.

There's a Carbon performance tuning session happening in Hall 2 tomorrow at 2:00 and an Apple Performance Tools session happening in Room A2 Thursday at 5:00 where you may get to look at your third demo of FS Usage. We think FS Usage is so important that if you come to the Worldwide Developers Conference you should see it at least twice, possibly three times.

Likewise, leveraging BSD services will happen in the Civic Auditorium Friday at 2 o'clock. The Darwin Kernel presentation, which will give you an idea of how the kernel is structured internally, the Mach kernel and some of the BSD kernel services outside of file systems and networking, that's going to be at the Civic Center at 3:30. And the Darwin Feedback Forum will be Friday at 5:00.

That's all we have for you today. I'm going to ask Jason Yao to come up, and he's going to moderate our question and answer. I'm going to bring Pat Dirks, Scott Roberts, Umesh Vaishampayan, and Don Brady up on stage from the file systems and kernel team, and we'll take your questions.