What's New In The Filesystem - WWDC 2012

Core OS • OS X • 58:26

Learn about recent changes to the OS X filesystem and find out what you need to know about FileVault 2, CoreStorage, extended attributes, and ExFAT. Hear about new APIs as well as which APIs are deprecated and understand how to use the filesystem APIs to optimize your code.

Speakers: Deric Horn, Soren Spies

Unlisted on Apple Developer site

Downloads from Apple

HD Video (217.5 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi. Welcome to What's New In The File System. I'm Deric Horn. I manage the engineering team that works on the filesystem. So today we're going to cover a whole bunch of small topics around the file system. Starting with some of the file system updates with regards to frameworks, system architecture, talk about what some of the trends are in the industry, talk about what some of the recent changes are that we've made to the file system.

I'm not allowed to talk about all the changes that we've made to the file system because some of them haven't come out yet. Then we'll talk about a number of best practices that hopefully you can take advantage of in your apps or your tools or your file systems. And then we're going to spend the last part of the section talking about FileVault 2 or full disk encryption. We'll talk a little bit about what it is. And then I'll bring Soren on stage to talk about exactly how it works.

[Transcript missing]

Like I mentioned, it's best to use those highest level APIs, but as you can see, you're free to use any of those APIs all the way down the stack. So whether or not you're writing a Cocoa application or even a UIKit-based application, you're free to use those BSD-based APIs, the opens, closes, stats, and so forth. So let's talk a little bit about what the filesystems are or what a filesystem is. I think typically when you look at Mac OS X, people think we have one filesystem, and that's HFS+. But in reality, we support six different filesystems.

So back in Snow Leopard, we added support for ExFAT. If you buy a camera today, typically they have an SDXC card in them, and those cards are high density and very fast. So in order to support those, we added support in Snow Leopard for ExFAT. FAT is another interchange format that we've supported for a long time. So if you have friends that are working at a fast-paced company like the DMV and you want to interchange with them, you could always use a thumb drive format as FAT.

HFS. So we actually have two flavors of HFS. One that we call HFS standard. And this is the really ancient format. HFS standard is the standard of our filesystem. HFS standard. Back in Lion, we turned it to read only. So if you have any floppy disks and you want to use them, we'll still support them in Mountain Lion, but they're going to be read only, and now we spit a log message out that says that HFS standard has been deprecated.

HFS+ is our hallmark filesystem. It's the one that we use everywhere now. So we actually use journaled HFS+ everywhere. The journal means that you can pull the power at any time, and your filesystem is still intact. There shouldn't be any data corruption or any volume corruption on that. So that's the standard. So we're going to talk about HFS+ now. We're going to talk about HFS+ now. We're going to talk about HFS+ now.

volume corruption on that disk. Going right along, we also support NTFS as read-only. So that's primarily for boot camp. You can go ahead and have your Windows partition, your Mac partition, be able to read that boot camp partition from your Mac partition. And UDF. We require UDF for DVDs, reads, and writes.

So one of the things that's a big advantage to us is that we support the same code base on iOS and Mac OS X. And that really has been beneficial to us by just having a few really good engineers working on this code base. As we fix bugs on one platform, they're automatically fixed on the other platform.

Or as we add performance improvements on one platform, the other platform gets to take advantage of them. Like an example of that was Trim. Trim is a command that we can send down to the SSD. Basically, it's a call that lets the SSD know we're not going to use these blocks anymore. And it can go ahead and do its own garbage collection, its own cleaning, and actually improve the lifespan by doing that. And by adding that on one platform, the other one gets to take advantage of it.

I'd say the one minor difference between what we support on the desktop and the phone is that on the phone we use what's called a case-sensitive version of HFS+. The main reason we did that is so that we can reduce the overall footprint by not carrying around these case folding tables with us.

So the phone is case-sensitive. The things to keep in mind about that are, for instance, if you're doing iCloud synchronizations, you're working on the same documents on your phone or your iPad as you are on your Mac, it's perfectly fine to create two documents in the same directory on your iPad that are called "hello," one all lowercase, one all uppercase. But if you're going to try and sync those with your Mac on the other side, you're going to have a collision.

We have case-insensitive on HFS+ on the Mac. The only other place where we really have case-sensitivity on the Mac is on your time machine backups. By default, those are also case-sensitive. You're free to use it. You're free to format a drive as case-sensitive, but typically coming out of the factory, they're case-insensitive on the desktops.

Okay, so that was a little bit about the filesystem and the frameworks updates, what they are, just to give you a brief overview. Now I'm going to talk about some of the more, I think, more interesting parts, some of the storage industry trends. So not necessarily what we're doing at Apple, but some of the trends that we see across the industry.

The first one that you've seen is that machines are moving away from your typical spinning hard drives and towards SSDs. And this is, I think, kind of a -- well, it's kind of a pun -- but kind of a revolutionary change that we're seeing in the industry. I mean, I think about every 12 to 14 months, the speed of these SSDs is doubling. So it's really incredible. Where the speed of your processor isn't doubling, we're really making huge improvements here. And that contributes to the overall performance of your machine, especially as you page in and out of these SSDs.

Something you see across the industry is these SSDs are slowly moving over towards a PCI-based controller rather than SATA bus for better performance. One of the two biggest advantages I see to SSDs is for the random I/O performance. Random I/O on SSDs is about 100 times faster than what you get on a hard drive.

But the thing that's hard for me to completely wrap my mind against, maybe it's because I come from software, is that although they look like they're doing a lot better, we're getting a lot faster, for some reason, the write cycles are going down. So that means as you write to the same area of this SSD, it will start to -- I don't know what the right word is -- disintegrate. It will start to go bad. So they're coming up to a place now where you can only write to this particular cell on an SSD 3,000 times before it goes bad. So behind the scenes we do a lot of shuffling around.

So if you think you're writing in the same area of the SSD, they're actually moving this around, waving their hands around so it's actually getting spread across the disk. But that's one interesting thing and a concern that we have and we're trying to improve on that as well.

There's also security concerns. Like I mentioned, if I write to one block on an SSD, behind the scenes it's actually moving that around. So if I think that I've written some critical data here and I go to do a clean wipe on writing zeros all over that block again, I may think I've erased that from my SSD, but you might have a grad student somewhere that hooks up their thesis project and they're able to read directly from these SSDs, they can actually find out that this has been moved to several other blocks in the SSD and pull the data out. So several interesting areas revolving around SSDs that we have to take into consideration. And lastly, they're also more expensive than that. So I think that's a really good point. They're also more expensive than hard drives.

So what does that mean to filesystems and what we're doing? I think first off, with that fast random I/O, that kind of opens up a lot more opportunity for us. It makes it more -- it adds a lot more flexibility in the design decisions that we're going to make going forward. If we don't have to worry about defragmentation so much or keeping those files and those data structures contiguous, it allows us some more opportunity. Also, because it's trending towards, as I mentioned, these fewer write cycles, we have to be cognizant of that.

One way that we do that is now we start to send those trim commands down. By having the filesystem send a trim command down to the SSD, that lets that SSD know that I'm not going to be using this area of your SSD anymore. You are free to go ahead and assume that it's empty. You don't have to move these blocks around. You don't have to shuffle anything around.

So it actually improves the lifespan of your SSD and actually makes it faster. and I think the one thing you can take away from this, whether you're working on an SSD, whatever your application is, or a hard drive, large sequential writes are always better than a lot of small random writes. You're always going to get better performance.

Another trend that we see is in distributed storage. Kind of what we've seen over the last couple of years is that our data is getting bigger and bigger. We're taking more digital photos, have more music and so forth. But the devices that we use, they're still relatively small.

I mean, our phones are now at most around 64 gigabytes or so. So that's a small window into our data. So what we're going to see a lot more of is more distributed storage. And that means that I'm going to see a window of my data on my iPad or my iPhone or even my MacBook Air.

And as I need to reference those songs or those pictures, they'll start to come down from the cloud. Something else we see is that users are getting more devices more often. So we have to take into consideration backup, restore and migration. And when we store things in the cloud, we always have to recognize that we have to be cognizant of secure backups, efficient backups and also deduplication.

That's kind of a fancy term for meaning if we all own the same song, up in the cloud we only have to keep one of those songs and let you all know. note, and then when you reference it, you'll have access to that song. Okay, so that was a little bit about the storage industry trends as a whole. Now I'll talk about some of the recent changes that we've been making in the filesystem.

So the first one, this was actually introduced back in Mac OS 10.4, but I just wanted to let everybody know about it. This is kind of one of those little nuggets if you're looking for -- if you have a certain problem and this might present a solution for you. And it's extended attributes. I don't think this is used nearly enough by developers.

Extended attributes are a way of leaving like little breadcrumbs of data behind on a file. So if you're sharing a file between different applications, maybe it's a PDF file. So you have many applications that can open this up. And you want to leave a little breadcrumb along that said, like, for instance, you're the author of it or my application touched this last.

But I don't want to interfere with the format of that file itself. You can attach an extended attribute to it. We have a pretty simple API set. It's just really four of them. That's all you need to be able to get an extended attribute, set an extended attribute, list them to see which ones are there or if you want to remove one.

In the terminal, if you type in LS minus L with the at sign, that at sign specifies list out all my extended attributes. So I did this on some of my files. And you can see typically you're going to store 32 bytes of finder info. That's going to contain some things like the label that you have attached to the file. On mine, I think I listed off my tax filing. So you can see Intuit attaches a little breadcrumb on there that talks about my filing status.

and in the past, before Lion, extended attributes were limited to only 4K. So I think that's one of the reasons why they didn't get a whole lot of adoption. It's hard to find what you want to store in under 4K. So starting in Lion, we increased the size of our extended attributes to be 128K.

I want to give a little bit of guidelines, some pros and cons of using the extended attributes, the guidelines when you should use them. I think that they're ideal for storing those little bits of metadata or what I call those breadcrumbs, things like author, the window position, the window rectangle, the history of this file. You can even attach extended attributes onto directories.

But kind of the one con of them, or the one thing to be aware of, is I wouldn't recommend you storing application critical data in extended attributes. I think application critical data should always be stored in the data fork of your file. The reason being is we live in a kind of a heterogeneous environment where we're copying files around across the web, uploading them into Safari, across different people's machines.

And as you do that, not all different file transfer protocols or volume formats support extended attributes. So as we move them around, we actually have to split them off into two files. And if at any point we lose one of those files, the file containing all that extra information, what I call these dot underbar files, you then lose all of your extended attributes.

You know, if another application opens up this file that you're working on and it decides to rip out all those extended attributes, you can lose them. So I think it's perfect for storing little bits of metadata around that make the user experience better, but I wouldn't store your critical data as an extended attribute.

And kind of last note on here, just to take note, is that if you store a small bit of metadata under 4K, we actually pack those into the attributes B-tree. So we don't even allocate a full allocation block on disk. If you want to store a couple bytes, we only allocate a couple bytes to store that information.

So I want to talk about a couple of API sets that we've deprecated and what the replacements are. So these are some other recent changes. So starting in Mountain Lion, the way that you mount servers has been changed, and basically it's been modernized. We've replaced the option bits with more modern dictionaries. We've replaced the CF run loops with more modern GCD dispatch queues. So we have the FSMountServerVolumeSync, which used to take option bits, now takes a dictionary. What used to take a run loop now takes a dispatch queue.

Another API set, just a drill at home, is the file manager has been deprecated. And this is what I said is the old Carbon file manager or old core services file manager. So when you used to get metadata about files, you used to call something called like FS get catalog info or FS get catalog info bulk.

And when you call this on a file, you're going to go ahead and get, for instance, the name of the file, the modification date, the creation dates, and a bunch of other metadata associated with it. Those have been replaced by more modern property-based APIs, so the CF copy resource properties for keys API. and also to note that those newer APIs are available on iOS where the old file manager calls aren't available.

So those are some of the recent changes. Now I'd like to talk about some of the best practices and kind of some tips for you when you're trying to eke out the best performance from your application or out of the filesystem. Throttled IO. So sometimes in your application you want to do something in the background. You don't want it to interfere with what the user is currently doing.

You don't want to interfere with the user experience that's going on right now. So if I'm writing an application that has some sophisticated workflow to it and I'm manipulating some data right now and at the same time maybe I want to create backups of that document, I probably want to create those backups in the background. I don't want it to interfere with what the user is currently doing now.

So one way to do that is to say I want to do this in a throttled IO way. This means that when the user has their hands off the keyboard, I can create those, I can go ahead and compress those backups, move them off to the side and so forth at full speed, but when they're using the computer and they're using data, reading in new data, I want to kind of throttle that back.

And that's what we call throttled IO. Typically on the system today we use that in a number of places. You'll see it in, for instance, spotlight, time machine backups, background encryption. So, for instance, if you turn on full disk encryption, if you turn on -- If you're not using your machine, we will go ahead and encrypt those blocks as fast as we possibly can. But as soon as you put your hands on the machine and start copying files around, we scale it way back and start moving very slow.

That's what throttled IO is. So if you have a need to do something in the background or throttle your IO, the best way to do that, I would say, is to use the grand central dispatch IO background queue mechanism. Any IO that you do on the background queue will automatically be marked as throttled. The next topic I want to talk about is kind of what I call playing nicely. Playing nicely with others. Memory on your system is a shared resource.

We have a small amount of memory. Maybe we have four gigabytes of memory on your system. And we have a large drive. So every application, VM and so forth, is vying to use that memory on that system. You're paging in and out from memory. When you read blocks off the disk, they're being stored in memory. So you don't want to be a bad citizen and blow everybody else's cash.

Right? Kind of what I mean by that is if I have a number of applications, they're all reading and writing their files. But what really happens is they've got a large amount of reads and writes going on, but most of those are already held in the cache, in the universal buffer cache. So we actually have to do very little IO down to the disk. This becomes very important, especially if you have a slow disk. This is what it usually looks like.

Now, if I've just written my first application and I want to read a large file to play my own type of a movie, for instance, I come along and I say I want to read four gigabytes. Next thing you know, it reads four gigabytes and it fills up my whole cache and blows everybody out.

So that's kind of a bad citizen. That's what you don't want to do. Then when those applications go ahead and become active again, they want to go ahead and access their files again, and they have to do a lot of I/O down to the disk, and they blow you out of the cache now. So what you're doing is creating this pattern of everyone's just thrashing on memory, on the universal buffer cache.

So that's where the F no cache bit comes along. For I/O that you're only going to read once, especially large I/O, it's best to mark it as F no cache. This would be being a good citizen. That means I want to read this data, but don't put it in the cache. Don't use up all those extra buffers in the cache. I'm not going to read it again. So if you mark your I/O as F no cache, that same application reads it right from disk, and everybody's happy. So that's F no cache.

Next performance tip, kind of like Costco, you're always going to get a better deal if you buy in bulk. And the better performance that you get, the better battery life you're also going to get. That's just a general rule of thumb. If you're going to go, for instance, buy six bottles of wine, you don't make six trips to the store, you make one trip, you buy the six bottles and you get your discount.

Here are a few APIs to use if you're going to go ahead and try to get a lot of information. Same thing, you don't want to iterate through a bunch of files in the directory getting information about them. You want to get it So the first one from the application perspective is searching.

So this is kind of related to Spotlight. The NSSearch field is typically that field that you see at the top of Windows that's kind of that oval with a magnifying glass in it. That's your NSSearch field. And the way it typically works is through the NSMetadataQuery APIs. Those link down into your Spotlight database. So the way that works is you build up what's called a bunch of predicates if you're writing a UI kit or app kit-based application.

Basically, you would say, for instance, you build up your predicates. So you say, show me all presentations that I've created in the last week that are over five megabytes. So you create these three predicates. You hook them to your NSMetadataQuery, and it will just go ahead and return back everything in bulk for you for you to iterate through.

The other topic I want to talk about is filesystem notification changes. So this is a very powerful concept called FSEvents. We always get requests that say how do I tell if a file has changed somewhere in the filesystem? I want to see this and I want to know when this file has changed. Maybe it's in the directory, maybe it's anywhere on the whole filesystem. And the best way to do that is with FSEvents.

FS events work in two different ways. You can either pull for those events or else you can have them delivered to you. So the two ways that we use them in the system, for instance, one of them is in Time Machine. So every hour Time Machine runs and it says, I want to know about all the filesystem changes that have happened in the last hour. So it runs and we go ahead and return back this big long list.

Now, kind of to talk a little bit about the history of that, before Lion, that big long list was just a list of directories, and it was up to Time Machine to figure out which one of those files in the directory has changed. Now, with Lion and later, when it says, tell me all the filesystem changes that have happened in the last hour, we actually return back a long list of every file that's changed.

It then goes ahead and says, I want to back up just these files. It doesn't have to iterate through every single file in the source, every single file in your directory and figure out which ones have changed. It just asks for the changes. So that's the polling method. Give me all the things that have changed in the last hour.

The push method, for instance, would be where the finder comes in. So when you have a finder window open, the finder has FSEvents on that directory. So if, for instance, someone else on a different computer copies something into that directory or creates a file, FSEvents then says, hey, look, I've created this new file, finder gets a notification, and it draws the icon. So it works in two different methods there, either pull or push.

And the last one also is kind of goes along with the first one with the searching, is if you want to get a bunch of metadata at one time about files. People typically do this. They look through directories worth of files to look for the metadata. You can use those URL-based APIs that I mentioned earlier. And those are all bulk-based. So once you get the results, you create an enumerator and just enumerate through all the results.

Okay, this is one of my favorite ones. If you've ever seen this before in your application, You know, it's not a good sign. And there's only really one reason why you'll ever see what we call a spod on your application. And the only reason you ever see this is if you haven't serviced your event queue in the last two seconds. So that means basically you haven't checked to see if there's something new for you off the event queue.

This is almost maybe selfish, but my engineers get tons of these bugs all the time where it says, you know, the filesystem's broken, I was doing something, and I see the spinning cursor, and my engineers spend all sorts of time looking at it, and they go, well, it's not broken. You know, maybe we're just waiting for your disk to spin up. With some computers, it takes up to 10 seconds for these disks to spin up.

You're on your main thread trying to do UI, and we're waiting for this disk to spin up, and someone thinks it's a filesystem bug. It's not a filesystem bug. I think the crux of this is that you have to be able to expect large latencies from whenever you're getting files. Whether you're trying to get information from a file across a network or a hard drive that's spun down, this could happen.

So what you should always do is get your UI -- or excuse me -- get your I/O off the main thread. There's a number of different ways to do this. You can do it through Grand Central Dispatch to create another thread to do that, NSOperation is built on top of GCD, that's another way.

There's also a method called, like, perform selector on thread. That's probably the easiest way to get your I/O off your main thread. Your main thread or your UI thread should really only be used for two things, and that's to pull events off the queue and to update your UI. Otherwise, you're susceptible to getting your spots.

I just want to talk a little bit about the NS File Coordinator. If you're working with iCloud, you're going to have some contact with this. The NS File Coordinator basically serializes access to a file. This becomes important because, for instance, if I'm writing an application like Pages, I'm updating a file.

And this file is going to get synchronized to the cloud. So while my process, Pages, is updating this file, I have another process that's called the iCloud daemon that's trying to read this file and upload it to the cloud. Or possibly it sees a newer version of that file and it's trying to download it from the cloud back down onto my disk. So we have to serialize the access to this file. And we do that through the NS file coordinator.

So if you create a file presenter, that basically means let me know when there's been a change to that file. So to put this in context, if I'm Pages and I have this beautiful document up here and I'm working with someone else, and I have a picture up here in the corner, someone else may update that picture.

When they do that, I'm going to get called through my file presenter and says there's been an update. Now it's my responsibility in my application to go ahead and throw away, to deallocate everything I've had with that file, read it off disk again, and update all those data structures again so I get the new picture.

If you don't use this method, you're going to be fighting each other. You're both going to be saving changes over each other. So the coordinator helps you coordinate changes from multiple sources. Also, again, if you're using UI document or NS document, you get a lot of this for free. It automatically takes care of the file coordination for you.

Okay, so this one, now I'm slowly getting a little bit more technical. This is one of my favorite tools. There's two ways to analyze performance of your filesystem. You can do it through instruments and instruments will call the tool FSUsage or you can go right to the terminal and call FSUsage. I'm just curious, how many of you out there have ever run FSUsage? Great.

Okay. Great. This is great. I heard like 60% of the people were new to the conference. I thought nobody's probably heard of FSUsage before. It's a great tool. Especially if you're having problems with your -- if you want to analyze the file system performance or see where your bottlenecks are. You run something like FSUsage and you see a whole lot of text come through.

And all that text is actually -- it's pretty impressive to me because that means that we've got so many different processes all doing activity at the same time in the file system. So that's another place where it's really important for us to make sure that we keep things very tight inside the file system, keep our data structures small and allow as many threads to access our data structures as possible at the same time.

So you see a lot of activity come through. It's easy to kind of trim this down. If you specify the name of your application in the command line, in this case, TextEdit, it will weed out all that extra information. So I just see the file system activity that's responsible for what I'm looking at and that's my application text. Excedit.

So in this case, all I did here at 2 o'clock in the afternoon, all I did was do a safe save of a document. I had a file open, I made some changes, and I hit Command-S, and I captured the FS usage. So I just kind of wanted to walk through it with you to show you a little bit how to use FS usage for your own applications. So the first thing that you notice is we have this 2 over here. That's an error code. That's an enoint error. That means that this file does not exist.

So to take a step back, this is probably actually expected because the way the safe save works, if I have my existing document here, what NSDocument does for me is it creates a brand-new copy of the document over here. So I've got my old document here, brand-new copy of my document here, and then the last step, it does a rename to just replace the original one.

We do it this way because at any point during this process, if I were to pull the plug on the machine, I will have a consistent version of that document. So it's either all the old version of the document or all the new version. I never end up with something that's half-written when I pull that plug.

So probably here when we get this enoint, it's probably saying, like, I'm going to create this temporary file. Does it exist yet? Because I want to create a file that doesn't exist yet. So now let's go ahead and examine some of the other, I think, more interesting calls in what we're looking at here just to kind of walk through what the output is.

The first field of the output, here we're going to open it, write the file, sync it to disk, make sure everything's out on the disk, and then we're going to rename it over the top of the file. So we opened it up at 2:06, that's the time stamp where we opened it up. The system call that we made was a write in this case.

And we know that it took 0.000018 seconds. So it's very quick. And then we have the application of thread ID. One thing to take away from all this is that I'm able to save in about 1/300 of a second. So the filesystem is something that we see a lot -- kind of like memory, where we see a lot of people beating it on at the same time, and it's very important and critical that we perform it as quickly as possible.

So that's a little bit about how to use FS Usage. Kind of the takeaway is if you're having filesystem performance problems or you suspect it, I recommend that you run FS Usage. It's enlightening to see all the things that are going on in the system when you run FS Usage, just to see everything that's going on in the system, maybe not only in your application, but it's pretty impressive to see everything that happens, and you can easily identify where you're spending your time.

Now that I've talked about a lot of the filesystem proper, I'd like to talk about FileVault 2, or what we call full disk encryption. I'll talk about what it is and then we'll have Sorin come up afterwards and talk about how it works. So before we started the FileVault 2 project, or what we call full-disk encryption, I've seen a lot of quotes like this.

We have entire websites dedicated to things like this. A thief recently stole a computer server belonging to a major U.S. insurance company, and company officials now fear that the personal data of nearly 1 million people could be at risk. California has a law, as well as 41 other states, that require you, if you lose someone's personal data, require you to let that person know and also let the state know.

So this becomes a very big PR issue for companies if they start to lose social security numbers or credit card issues. So we thought, well, this is a big problem. Let's go ahead and try and solve it. How hard could it be, right? Right? I mean, we saw some solutions out there.

They looked like they were okay, but we thought we could do better. We thought, you know, you take data, you encrypt data, you store it on disk. Once you get it, you read it off disk, you unencrypt it, and you return it back to the user. It seems like a simple problem. We can knock this one out.

So we started drawing it out on a whiteboard, and we got something like this. All of a sudden we noticed, like, okay, we're going to have to write code in EFI, we're going to have to write a lot of kernel code, we're going to have to write user space code, user space library code. We didn't have anything related to key management. We hadn't thought about that part yet. That's in there.

And then we needed a bunch of these applications on top to be modified to work with this too. So it quickly became much more complicated than what we originally thought. But one of the big benefits, I think, of working at Apple is that we're able to develop these integrated big solutions and integrate it right into the system.

So I kind of want to diverge a little bit from FDE for one slide and talk about something right at the middle, and that's this component right here. That's what we call CoreStorage. And that's the piece that I was talking about where we thought we can go ahead and encrypt all the data and do the decryption of it.

So to take a closer look at how that works and what CoreStorage is, Basically, where it fits in the stack, if you think about it, HFS+ would be at the top over there. That's your logical volume, Mac HD. And then beneath that, we have this whole layer called CoreStorage. You could think of it like a disk driver or like a block storage device driver. This whole layer that's going to go ahead and intercept all the data being written from HFS+ going down to your drive. And then at the bottom, we have the actual drive itself.

So if you look at the way it works, we have our top secret document over there. We shuffle it down into CoreStorage where we're going to encrypt it. And we actually go ahead and move it to a different place on disk. And that's what the address tree is about. We move it to a slightly different location.

But we're finding that we can take this layer, this CoreStorage layer, and it has a whole bunch of possibilities that we have here. Encryption is just one of them. This is a place we're hoping to develop feature after feature for our users and where we're spending a lot of time doing new development. So stay tuned to this layer, and hopefully we'll have a lot of new information for you.

Now back to FileVault 2. A little bit about what FileVault 2 is, if you haven't used it yet. It password protects your entire volume, like full disk encryption or whole disk encryption. It encrypts every block on your volume, so like you saw in that last picture, it doesn't just encrypt your payload data, it also encrypts your metadata. In fact, we go through every single block of your HFS+ volume, whether it's free space or use space, and we encrypt them all from one end to the other.

We also support encryption on external drives, so this becomes really cool if you have a thumb drive, you can go ahead and format it as encrypted, we can store that key in your key chain, or your password in your key chain, that means whenever you put that thumb drive in your computer, it just mounts, but if someone steals it from you and they stick it in their own computer, they'll be prompted for password and it's thoroughly encrypted. And we also support encrypted time machine backups.

So we use a method called AES XTS for our encryption. It's basically AES 128 with some cipher stealing. What the cipher stealing means is that we encrypt out of a 512-byte block, we encrypt the first 16 bytes, we get the result of that, and we mix it into the next 16 bytes, and we mix that into the next 16 bytes, and so forth, all the way along.

The AES XDS is FIPS compliant. It's also what NIST recommends. So as of January 2010, this was the most modern, best method that they recommended for encrypting data at rest. So I think that what we've delivered is, I think, sets the standard for encryption as well as performance.

So one question I get, though, a lot, though, is how can we use AES 128 and not this new thing called AES 256? So I ask you, how big is 2 to the 128th? So I kind of have a quote here that I want to read to kind of set it in perspective so you know the difference between 2 to the 128th and 2 to the 256th.

Imagine a computer the size of a grain of sand that could test a key in the amount of time it takes for light to cross that little grain of sand. If the whole planet were covered in a meter of these little computers, it would take on average 1,000 years to crack a 128-bit key. So 2 to the 128th is, I think, as close to infinity as you're going to get. When you're talking about cracking encrypted volumes, people never talk about cracking the keys. They talk about cracking passwords.

I heard a good quote before, and it's a lot easier just to go out and buy a $10 wrench to get someone's password than it is to crack their key. So what is FileVault 2? One of the things that I'm most proud of, I think, with the FileVault 2 project is the way it integrates the whole system and the fit and finish that we put onto it.

A couple of examples are we have this feature called Treat First Reboot Special. So when you enable full disk encryption... and the rest of the team. You're going to be prompted for your password at EFI time. It looks just like the real Cocoa login window, but we actually present that at EFI login time as soon as you reboot. So sometimes your keyboards don't always work from EFI. If I've got a third-party input device that runs through some different type of a router and so forth, it doesn't always match up.

Sometimes my Bluetooth keyboard isn't paired correctly. Or one thing that we get a lot, in fact, when I called up AppleCare, I think the number one call driver was people that owned iMacs or people that enabled password login and forgot their passwords. So we wanted to be very sure that when you enable full disk encryption that you're able to at least remember your password or help you along a little bit, and we're not just going to completely encrypt your data before you get a chance to type in your password.

So we treat the first reboot special. We turn on CoreStorage, we reboot, and we immediately ask you for your password. If you're having trouble typing in the password or if you've forgotten your password, what we do is go ahead and reboot into the recovery OS. You can disable FileVault 2 or you can change your password because at this point, no encryption has been done. As soon as you type your password in the first time, that's when we start encryption, and that's when all bets are off.

but actually not all bets are off because we have this other feature that is called the recovery password. When you enable full disk encryption, the first thing it does is print out a long ASCII string of your recovery password. What that really is, is kind of your fail safe. If you have forgotten your own password, you can write this down and keep it in your safe.

Something else you can do, though, is we have an option where you can store it with Apple. It doesn't actually store your recovery password with Apple. The next step it does is ask you three banker-style questions, you know, what street did you grow up on, who is your favorite teacher, what's your pet's first name. It takes all of these answers and encrypts that recovery password and stores that encrypted blob up with Apple.

So the next time you can go up and call AppleCare and say I've forgotten my password, can you help me in the last three questions? If you know the answers to those three questions, they'll read off your recovery password. If you don't, though, there's no way to get back in your computer. It doesn't matter who you are. There's nothing Apple can do. There's no back door.

That machine is cryptographically wiped for all intents and purposes. Okay, so that was a little bit about what FileVault is. Now I'd like to bring Soren Spies on stage to talk a little bit more about how it works. So, FileVault 2, we got CoreStorage, we got HFS, we got encrypted data, we got unencrypted data. How do we get there? How do we set it up? How does it work?

My name is Soren Spies. I work in storage technologies with Deric and I had a lot of fun putting all this stuff together and now hopefully explain it to you. So, first of all, I want to talk a little bit more about CoreStorage, in particular how it does the in-place encryption. Then I want to talk about FileVault 2's key hierarchy, how we make it so that multiple people's passwords can unlock a single disk.

Also, we're going to talk about how we secure those keys to make it hard to attack. We also are going to talk about tools that you can manipulate CoreStorage FTE volumes with. And finally, I just have one little list of requests to try to guide. We get a lot of different requests for full disk encryption in FileVault 2 and I want to try to guide those in a way that's more user-friendly. So, first of all, at the low level, you run DiskUtil List. How many people know how to run DiskUtil List? Yes. Okay.

You're in the right session. So, in the old days, you just had your disk zero and you had a bunch of partitions. You probably noticed we added this Apple boot partition in Lion. We store the recovery OS in there. But we also store what we call the boot not equal to root pieces.

Those are the unencrypted bits that we need in order to boot something fancy like CoreStorage full disk encryption. So, in the old days, we had this Apple boot partition. We store the recovery OS in there. But we also store what we call the boot not equal to root pieces. Those are the unencrypted bits that we need in order to boot something fancy like CoreStorage full disk encryption.

So the input is sort of this disk zero S2 in the slide, and then the output is this new virtual disk, disk one, and that's the unencrypted data. So when it's locked, there's no disk one. But as soon as you do the unlock, then this disk one comes out, and it's partition type, you'll notice, is Apple HFS.

So it is an unencrypted volume that just mounts like any other. So from an application perspective, there's no specialness for FTE, really. There's nothing 5.2 makes application do any differently. The only thing is sometimes you attach a volume, and it doesn't come online 'cause it's not unlocked yet.

So basically, disk zero S2 gets consumed, and disk one gets produced, and there's a core storage driver in the kernel that's doing that. So that's what we have. How do we get there? 'Cause we want it to be very smooth for users. We want them to click turn it on, and we do make them reboot and type their password. Derek did not mention that you can actually just power off on that very first reboot, and we'll go ahead and turn off FTE for you. After that, then you have to go to the recovery OS.

So we have your HFS volume, and we shrink it a little bit, and then we basically wrap around it-- there's a tiny bit of space at the beginning, fortunately, with HFS--we wrap around it core storage metadata. So we take--your HFS now has some free space, and we go ahead and make it into a core storage.

And when we reboot, the core storage driver comes up, and once we've made sure that you typed in your password successfully at reboot, we start encrypting your data. And we encrypt your data 128 megabytes at a time. We basically take a lockout on that segment. We encrypt it, but we write it into a different part of the disk.

And so we basically slide this window throughout the disk so that at any given moment in time, if you want to access a logical block of your HFS, it's either encrypted or not encrypted, and we move that slowly over. If there is a crash, we haven't erased the old data yet, and so we can go ahead and start over on the most recent 128 megabyte segment if the computer happens to crash.

Let's talk about the keys. How do we set up the keys so that you can do the unlock with multiple secrets? First of all, we generate a random volume key, so 128 bits of just random stuff that we get of our dev random, and that is what we start -- that's the low level. Each block is encrypted with that data. We also, for indirection purposes, we do a key encrypting key. You'll see how that works in a second.

And each user record, whether that's sort of a real user like me or Deric or sort of a role user like the recovery users, they each get a copy of the key encrypting key. So if you have -- you know a secret, you can unlock the key encrypting key. The key encrypting key can unlock the volume key. So let's take a little look at what that would be like. So Deric comes along. He types in his password. And we run it through a thing called PBKDF2 with a lot of iterations.

And then we get a passphrase drive key. Now, obviously, we didn't store that anywhere, but we can use it to decode the key encrypting key. There's one copy that Deric has, and he can unlock it with his secret. And then we have a hash that makes sure that we got it correct. And then we use that to decrypt the volume key.

This allows us to do I/O to the disk, the encrypting and the decrypting. What's nice about this is that if we want to add another user, we just set up that hierarchy again for the second user. Nothing about Deric's setup is changed, and we have an indirection so that we can change passwords, and we can also change out the volume key from underneath. We're not currently doing that, but that is something that we've architected it to be able to do.

So we have, I mentioned, these different types of user records. Most common is going to be your OS user. It stays in sync with the OS. So if your user has a passphrase or a password, even if it's a terrible one like puppies, then It's going to be enabled on the full disk encryption part of the disk.

So if you don't have a password, we will not copy that user down to the disk. But if you do have a password, we're going to go ahead and enable it. And in general, all users are enabled all the time. So that there's a synchronization between that so you don't add a user and then, oh, I forgot to add them to the disk. We just try to do that automatically for you. We also have what's called disk passphrase. So that's great for a disk that doesn't have an OS on it. How many people are using a Mac Pro every day?

Okay, so how many of you have more than one disk in your Mac Pro? Okay, you guys are pretty good. So you could have your secondary disk if you didn't have any OSs. If it's like on my -- I don't have Mac Pros, but if I did, they'd have a lot of OSs on them.

But if you have a data disk, then you can go ahead and encrypt that disk and have it used from, say, Time Machine through your USB stick, et cetera. So that's a different type of password record, and if you boot it up, you'll see, like, a big disk, and it'll say "disk password," and then you type in your disk password.

It's sort of a slightly different model than the OS password. If there's an OS on the volume, it's generally a better experience to use the OS-synchronized passwords. Then we also have this personal recovery key, which Deric talked about, but we also have an institutional recovery key, so you can install a public key on your system, and when you turn on FileVault 2, we'll go ahead and make a secret that can unlock the disk and we'll encrypt it with the public key, and we'll store that.

So if you have the private key back home in your super-safe system, if that--you know, if I were to leave Apple and somebody wanted to decrypt my disk after I left Apple, they could take my computer, and they could take Apple's private key, and they could bring them together with the appropriate Disk Util command line and unlock that volume.

So that's kind of an institutional feature so that you can have thousands of machines. They're all unlockable by a single private key. That also disables the personal recovery key in the UI, which will prevent your customers or the people you support from archiving their personal recovery keys with Apple, which you probably don't want, depending on your security policies.

So we have all this great encryption, lots of keys. As a bonus for this, we get sort of cryptographic wipe, a really fast, you know, I just want a split-second, wipe-out access to this disk. So I mentioned--so how are we gonna do this? We have multiple keys, and they're variable size, and they move around the disk, as it turns out. So what we do is we actually take all of your metadata, things like your user pictures and your usernames and even their password hints that we display at EFI Login. EFI Login is basically a slideshow.

It's kind of scary if you look inside. But the--but it's very clever, and it looks very beautiful. So we have all these nice, rich UI elements, and they're actually stored in the metadata, of course, storage, and we want those things to be gone if you want to quickly erase your disk. You don't even want to know whose disk it was.

And so we basically encrypt all of that data, including your secrets that are encrypted with your password. So if I know my password, then I need the metadata to get to the volume key, right? There's the multi-step unlock. So we take all that metadata, and we encrypt it with the wipe key. So this wipe key sits actually on the disk. It's clear text on the disk.

And the idea is you can eliminate the wipe key, and then all you're left with is this encrypted data. So if I know my password, I can't unlock my disk because the wipe key has destroyed it. Now, if it's SSD, when I destroy the wipe key, eh, who knows whether it's really destroyed, so I still recommend multiple pass secure erase afterwards. But you get about 80% of the value very quickly by destroying the wipe key.

Let's talk about key security. How do we make the password secure and then how do we also make the crypto that we do secure so that it is harder to attack? First of all, there's a couple types of attacks. There's password attacks. You take a dictionary and just encrypt every single word in the dictionary, hash every key in the dictionary, and you try to unlock this volume. Can I unlock? Can I unlock? Can I unlock?

There's also algorithm attacks. AES-128, AES-256. There's been some cryptographic research. And there's, you know, a few little -- they're kind of nibbling away at the edges of is it really -- you know, what's -- how secure is it? And it's still pretty secure. But they're definitely nibbling away at the edges.

If you reduce it a little bit, I'll talk about you can -- you may be able to attack it. Our goal in full disk encryption is to increase the cost of any attack against a particular volume and in particular to make sure that we don't have one attack that can apply to multiple volumes.

So first of all, let's talk about password attacks. These are pretty straightforward to understand. If I have a simple password like puppies, then it's pretty easy for you to try that password and maybe a few other common passwords and unlock my disk. So the worst case is if all of the passwords were hashed the same way, you guys could go out and make a huge lookup table of all the hashes of all the passwords and then you'd be able to figure out, oh, hey, look, here's all the hashes of all the passwords. I am going to now -- if I happen to find any file vault encrypted computers, I will say, oh, look, this hash matches this giant table. We definitely do not want that.

That's the worst case. So we're going to talk about how to solve that problem. We also have the general problem of just -- there's a lot of computers out there. I can go on EC2 and get myself like a supercomputer for a day and do a lot of password hashing. So we've got to make that expensive.

You've got to assume the attacker has everything. They've got your metadata with all the password encrypted keys and they've got your encrypted keys. So they're holding on to the encrypted data and then attacking the little metadata trying to -- once they break the metadata, then they can go and read your data.

And, you know, especially with cloud computing, that whole time, money, computing resources, who's your adversary, they may have a lot of ability to crack your system. So number one thing we do is do a lot of iterations. So it takes a while for any particular password to be hashed. We also salt every single password. We try to calibrate even over time.

Every time you change your password, we say on this computer, how fast can we -- let's make it cost at least a tenth of a second of real time to try to hash through this password. So that even if you do have EC2, you've got to -- it's going to take you, you know, a while as you clunk through trying all these passwords. Then, as I mentioned, we store a hash of the key encrypting key to see whether or not you got the value right.

So that's a little bit of a challenge. So we're going to try to do that. We're going to try to make sure that we get the value right. So that's another way, because the key encrypting key is random, that also makes this unique. And you're not going to be able to have a giant lookup table and say, oh, I found this computer and I know that if his password was puppies, all the computers that have puppies in their passwords are going to have the same hash. No. That's not how it works. Every key encrypting key is unique. Every salt of every password is unique. And we do a lot of iterations.

disappears into random numbers and that is all you have on your disk and it's unique to your user record and if somebody wants to attack it, they can and if you have a bad password, they will crack it. But if your password is decent, it's going to take them a while. That is our goal. We cannot eliminate attacks but we can make them more expensive.

Let's talk about algorithm attacks. I mentioned, first of all, if you can actually instrument a machine where the bits are in the registers, you can kind of see when the power rails go high and stuff. Those are pretty cool. But you can also, if you say AES 128 has 13 rounds at a certain state of mixing and you take that down to nine rounds and you're like, oh, look, at nine rounds, it doesn't mix it fully and I can kind of figure out what the key is without having to guess every single possible key.

There's also attacks where you have two keys that are related in a certain way and you can do some math and kind of -- you don't know one of the keys but you know the other one and you do some math and then you figure out what the other key is. These first three are sort of -- you know, these are known problems but they're not practical attacks. They aren't things that people were going to -- you're not going to have related keys in the real world, for example.

And unless your AES implementation is truly broken, it's going to be doing the correct number of rounds. But they're very interesting for cryptographic applications. And you can see that from a graphic standpoint because once you've broken a related key attack, then you might think, oh, interesting. There's like a -- you know, because it's trying to do all this mixing in a way that's really hard to go backwards.

But if you find some mistakes, some of these other attacks can help you maybe extend out to the full AES. Known plaintext and chosen plaintext are somewhat more practical. Your HFS volume structures are pretty standard. So that's a known plaintext that's on every core storage volume. So we want to make sure that that doesn't look similar across multiple volumes.

There's also big blocks of zeros and popular pictures on the Internet. All that kind of stuff is going to be on people's hard drives. Chosen plaintext, if I send you an email with an attachment, I can probably put a plaintext into your system and get it encrypted with your key.

And if that plaintext is like super magic -- I think there was a crypto system once where if you put the right plaintext in there, it would just -- it would basically come out as like a block of zeros because it was XORing itself or something. But anyway, you can insert known tricky plaintext.

And then you try to find them wherever they end up on the disk. And you can go and attack them that way. So we want to make that difficult. And the way we make that difficult is we've got to make the ciphertexts different every place they appear on your volume. So if you have 50 copies of something, like a block of zeros, they're different at every point.

And also, if you store the same text on different volumes, even if it's the same offset, you want it to be unique. So the way we do that is for within one volume for every block, we feed in the logical block address. So that will kind of change the crypto algorithm. We tweak it, feed it in as the initialization vector. And we say, okay, kind of start over here.

And then that will make the output look different based on the location within a single volume. And then across multiple volumes -- I already mentioned that we have a random volume key. So every volume is going to have a different key. We also feed in -- Derek talked about the mixing. You know, you take each -- because AES is only defined for 16 bytes. It's like this tiny little thing. And that's why they have AES XTS.

AES 16 bytes at a time. And then XTS kind of swirls, mixes it all together across those blocks. So that even if you change a bit at the end of your block, it will actually have an effect at the beginning. And we can tweak XTS also so that its mixing algorithm is kind of offset within its abilities.

And we do that by feeding in a hash of the volume key plus the volume UUID just to kind of make it, again, provide as much uniqueness as we can to the encryption engine. Here's a picture. Plain text on top, ciphertext on bottom. In particular, there's a per block tweak. And then there is the mixing tweak that applies to the whole volume as well -- excuse me -- as the volume key.

Talk a little bit about the tools that we have to manipulate FileVault 2 volumes. So in particular, DiskUtil, many of you are familiar with this. There's a subverb called CoreStorage, which you can abbreviate as CS. And you can now encrypt and decrypt in Mountain Lion existing volumes. So if you already have a volume turned on for whatever reason you've been playing around with DiskUtil, you can encrypt and decrypt an existing volume. You can also delete a volume, which I don't think was there before. In Lion, if you're not as familiar with this, there were create and create volume to set up sort of a logical volume group, and then you can have logical volumes in it.

And also, you can take an existing HFS volume and convert it over to CoreStorage, optionally with a passphrase. That's one way to turn on FTE. Now, this way of turning on FTE is great for data disks, but here's the big thing that everybody's been wanting for, which is full FileVault 2 on the command line.

So we have this tool called... I can't take direct credit for this, but I did provide lots of advice. So using all the same libraries as the UI, and it actually has a few extra features that the UI does not have. So number one thing is it's got the recovery key, and then it also can do nice things like give you your status of how encrypted you are. So basically you type in your pass -- you enable it, type in your pass phrase, and then you reboot. So that will take you through the whole process.

If you power off after the first reboot, it will not take effect, et cetera, et cetera. So this is great if you want to command line automate turning on FDE FileVault 2. So we have some other examples here. You can add existing users. You know, I can add Deric, and I can add our -- add his boss, Brad.

But then later maybe I don't want Brad to be on my computer anymore. So I can actually delete him from the OS first, and then I tell FDE setup, hey, go check the OS records, and if there's anything on FDE that's not supposed to be there, get rid of it.

FDE setup itself doesn't delete anything, but it will synchronize and it will notice deletions. And then finally you can also turn off from the command line, which you could do before, but it's nice to have it all in one command. Also, super bonus, if you're running remote desktop, you can, in fact, set up a giant pipe command line, and it will accept XML and it will spit out XML, and you can go out and FileVault 2 that person's computer. Magic FDE setup command.

One more little tool. I want to talk about a problem that people have and actually more of a solution. It's a really cool thing. So data at rest. FileVault 2 is designed for data at rest. But what is data at rest? I know that data at rest is when my computer, like take my hard drive out. That data is at rest.

And that's going to be expensive to attack. But it turns out that most of us don't take our computers apart that often. And most of us don't turn our computers off that often. And unfortunately, RAM is still sitting there with the volume key in it. And so very clever attackers have figured out numerous ways to basically pop your RAM out and then... Suck out the volume key.

So sleep is not maximally secure. And we have this great feature on some of the newest portables called standby. And standby actually writes your RAM to a file on disk. But for usability reasons, we want to be able to open the lid and just go back right away. We actually stash a key so that we can unlock that Hibernate image and get you back your volume key into RAM.

So standby doesn't change anything about making sleep more secure. Also, standby doesn't always kick in. Standby, if you've got plugged in AC power, if you've got USB devices plugged in, standby doesn't kick in. But standby is pretty close to what you want. Because what you want is you want everything on disk encrypted in a way that you with your secret can decrypt it, but an attacker cannot. And with no secret stashed off in the special hardware to bring it back.

So we have the PM set command line. This is in the PM set man page. Don't worry. You don't have to write this down right now. But destroy FV key on standby. That is not the default, but it will make your machine more secure. It will make you type your password twice.

Once to unlock your kind of Hibernate reanimation process. And then again at your screen lock, which we always force screen lock to be on. Hibernate mode 25 is also there. That will basically force you to go into hibernation right away every time you go to sleep. There's different options that you can pass. But conceptually destroy FV key on standby and force hibernation on every sleep. This will give you every time you close your lid, you'll go into a highly secure mode where basically you have to take apart your computer and start attacking the keys to really get at it.

Final slides here, requests and investigations. First of all, no promises. I'm just trying to make your bug reports better. People want authenticated reboot. They want a whole bunch of stuff with recovery keys. They want to be able to rekey. They want more OS X server integration, although I'm glad I will send back to the people that did all that work for FD setup that you guys are very happy so far. Just, I mean, happier than you were before. And finally, mandatory FileVault 2. A lot of people want to kind of force it to be turned on.

Okay, brings us to our little wrap-up slides here. Paul Danbold is our evangelist. There's a filesystem programming guide where you can get some of the latest on which APIs to use. And the developer forums are always there to help you help each other. We have some excellent sessions on responsiveness, getting things off the main thread. Tomorrow morning at 9 a.m., power management and energy-efficient software. Learn how to really optimize for battery life.

Before you all leave, performance and energy consumption go hand in hand. The filesystem is fast, but if you use it less, it's better for everybody. Use our APIs. We are always going to be making them better. And beware of unexpected delays. Don't do UI on the main thread. Thank you all for coming. I/O, I/O on the main thread.