The Darwin Kernel - WWDC 2002

Darwin • 1:05:54

At the core of Mac OS X is the Darwin Kernel, which provides basic services such as threads, scheduling, real-time support, synchronization, address space management, timers, and virtual memory. This session covers kernel services for both application and kernel extension developers.

Speaker: Jim Magee

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. Let's see if I got everything working here. All right, so we're here to talk about the Darwin Kernel. I hope you guys are all excited about it. It's tough to get excited at 9 o'clock in the morning, but it's better than last session on Friday like we had last year.

So what are we going to talk about today? First of all, we're talking about the Darwin Kernel and how that defines the core set of services inside of Mac OS X. Basically, everything you do affects your code. Everything in the kernel affects your code, whether you're using it or not. So it's important for everybody to understand how the kernel is affecting their work.

And also that while the kernel is fairly stable, it's still under constant refinement. Most of the time we're refining the implementations of things. But we're also talking about refining the semantics and syntax of certain areas of the kernel. And we want to give you a heads up on what those things are.

So what you're going to learn in this session is, you know, which services may be affecting your code, how they're affecting your code, and how I'm going to do that is to give a brief overview of how the kernel services play into the kernel. to the higher level services in the system. And therefore how that may be affecting you.

Talk about some of the changes we have planned for Jaguar. Some of those are already on the CD you have. Some of those are still to come. Some may not make it, so this is all just a tentative list of some of the things we're looking at. And also looking at some of the things that are coming beyond Jaguar. So give you a heads up on some of the things we're looking at. Again, this is really a tentative list.

We want you to help us drive that list of things that are coming. Some of the things that we need to look at, we need to do. But we also want to give you a heads up on some of those things that we're already thinking about doing. All right, and this is the classic picture you've seen time and time again, the multi-layered facet of Mac OS X with Darwin as the core.

and the traditional picture of the Darwin Kernel. It is composed of three basic components, the BSD Kernel, the Mach Kernel, and Iocat. But we tend not to... Sorry. Where did that come from? Well, the Mac code came from Mac 3.0, basically the OSFMK 7.3 code, if you were here last year. Nothing different. We have not picked up any new version of Mac. There really aren't new whole versions of Mac to pick up. But that defines the virtual memory, the scheduling, and the inter-process communication that happens in the system, at least the fundamental versions of those things.

Our BSD code was originally from BSD 4.4.Lights2. That defines the process model, the scheduling, the security model in the system. Mac has some security features, but it's mostly security policy agnostic. So the BSD part of the system is what defines the security policy for the rest of the system. And it also defines the file management system. And also we had the networking code came from FreeBSD. FreeBSD as well. While we're not really talking about the networking code, it's important to kind of see how that all came together.

For a large part of the system, including the networking code and the NFS system, those have been refreshed in the Jaguar timeframe to come up to FreeBSD 4.4. But the rest of the kernel, while looking at FreeBSD 4.4, what it defines as services and semantics and syntax is really important to us.

Our versions of those parts of the system are radically different in implementation than the FreeBSD version. So you can't just take whole cloth those sections of the system and bring them forward to the current day Jaguar system or the current day Darwin Kernel. So we've more picked and looked at certain aspects of that code and have decided what parts we like and what parts we don't, how we can bring those in. But we cannot possibly take new refreshes of that whole cloth hunks of code.

[Transcript missing]

So, but we don't really like to look at the system as Mach BSD I/O Kit, especially the Mach BSD differentiation. We have a unified kernel. We're not really that interested in experimenting and showing how to the world that you can implement BSD as a user-level service. It's not at all interesting to us. What is interesting is the silos of service, the vertical services that are provided, the process management, the threading, right, and Mach and BSD provide services in each of those areas combined. They make up what we have in the kernel as services.

And when you look at that in the services area, what's new for Jaguar? Well, we've had a lot of performance enhancements in Jaguar. That's probably our primary goal, performance and stability. And so a lot of the implementations have been refined on a lot of things throughout the kernel.

But one of the things that we look at all the time for performance is not just to make this call run 10% faster or that call run 5% faster, but to look at the higher-level services and the higher-level applications and try to define new services or change the semantics of some of the existing services to let them run faster. And so provide better matching of the services to the higher level is something that we're really always watching out for, and there's a lot of that in Jaguar.

[Transcript missing]

And just reams of new bug fixes and reams of new features. Just in the pure kernel part, not counting the file system implementations of the networking part, but the pure part that I showed earlier, there's been over 600 enhancements since Puma, since 10.1 released. But where are we going?

More and more performance enhancements. This is one of our primary goals, something you'll see us doing over and over and over again. Again, the same two areas. Implementation of existing services and working our way towards new and better services. Yet more standards adherence. There are still some things in POSIX that we don't support, and we're constantly striving to have essentially a POSIX-compliant system. And a single-user--single-unit specification. That's been moving along faster and a little bit further in front of just the standard POSIX. And we've been looking at tracking those things as well, and we're going to be implementing a lot of the things that are there.

But one of the things that is really new coming forward is this notion of rigorously defined KEXT services. We've had, in I/O Kit land, we've had a strong and fairly rigid set of things you can do in I/O Kit when you develop an I/O Kit KEXT. We do not have those same kind of systems in place when you're writing network kecks or file system kecks.

And basically what people have done is adopt the BSD model, which is you reach in and you grab whatever you need and write a kecks that accesses all the internal data structures, knows about all the function calls and every little bit of the internal thing, and build kecks out of that.

And in essence, you've got to, every release of the system, you've got to re-roll that kecks and get a new kecks because it's not going to run in an old system. Well, that doesn't really match very well with Mac OS X as a commercial product because obviously people don't like to have to go out and buy or get an upgrade version of their file system every time Apple decides to roll out an update release. And so we're looking at, now that we've gotten the higher-level services, the ones towards user-level applications, fairly well complete, we're now looking internally and trying to define those services so we can get better release-to-release compatibility for your kecks.

And of course, just constantly new features and other bug fixes. So when we look at the services provided by the kernel, We need to go through and kind of describe to you what it is that each of these services does and what we've changed in them. So in the process world, everything starts with a Mach task. A Mach task is the unit of ownership inside of Mach.

All the resources that are mock-related are defined and held in a task. So in essence, it provides an environment in which threads run. So you've got an address space, a set of communication rights to communicate to other parts of the system, and exception handling to handle faults or whatever inside of that space.

and a Mach task is represented by a data structure called a task T. It's available to you when you're programming at higher levels, but... Each BSD process actually contains a Mach task. So you have a BSD process. It's got a Mach task. It uses the Mach task in certain ways. When you try to access and manipulate the Mach task directly, you have to be careful not to step on or interfere with some of the things that BSD does.

We try to solve some of that for you, like we don't let you terminate a task out from underneath a process. You have to come through and actually, if it's a BSD process, you have to come through an exit call in the BSD level to get the process to go away or shoot it with a signal that's got a fatal handler. So we've done some of that work for you, but obviously there's some other things.

If you suspend a task out from underneath of a BSD process, well, you're going to get behavior that BSD is not going to do. and expecting and may or may not, uh, provide the capability you're looking for. So each BSD process contains a Mach task. You're normally gonna program at the BSD process level for most things.

And it adds a file management part of the process, or I use process as its own set of file descriptors. It adds signal handling in the Mach level. We had exception handling at the task level. BSD is an exception handling client of Mach and provides-- turns those around and makes those signals.

So there's an interesting mix when you interact with the signals at BSD level and the exception handling at the Mach level, and something you always have to be careful of. And it provides process naming, a Mach task. You can have umpteen Mach tasks in the system. There is no name space for the Mach tasks. They're just out there. You need to know them before you can even find them. And it provides a process group relationship.

and it's normally represented through the PID. Now, normally you're working along in a process and you're, you know, doing standard BSD things, fork, exec, all of those. But there's this one call that basically opens Pandora's box for you, and it's called task for PID. It's a new call just in Darwin, and it basically, not new, but it's been there from the beginning, but it's only in Darwin.

And it basically allows you to get at the underlying task for certain PIDs and manipulate them. And while that's, you know, a really powerful feature because Mac has all these third-party task manipulation facilities where most of the BSD stuff is within my process, it also has problems, and we'll talk about some of that later. Okay, but most of you are not even programming to BSD processes. You're programming to a Mac OS X application.

All right, and in Mac OS X applications, there's an additional BSD Processes Head process groups. There's this additional notion inside of a Mac OS X application, this notion of a session and services that are provided by the core services of the system. You have pasteboard services, and you have the clipboard, and you have input services and Apple events and all of those things that add on to what you get in a standard BSD process.

And those are managed in a scope called a session, basically the login session, if you will. And that adds additional burdens and additional interactions to some of the things that you might do at the BSD level. And we'll talk about some of that in the security part of this. And it provides additional and different ways to get things launched.

In typical fashion, you would use launch services or just use the Docker finder or whatever to get applications launched at the Mac OS X level. They all go through the launch services of the system. And there is an interaction issue, right? Because most of these services that are in that session are represented by Mach ports and are per session.

And fork at the Unix level does not inherit any of the Mach ports that are associated with a task. It just basically creates a new task, shared memory space, but no shared ports. You can end up with an issue if you're in a Mac OS X application and you call fork.

You can end up in a situation where the address space of the new application that you got by forking thinks that there are certain services available to it that it had already established connections to, but they're not there because the ports that back those aren't there. So, again, this is, you know, one of those areas where you have to be careful when dealing at the higher levels. Don't dip down to the BSD or the Mach level too much. And if you do, be careful about some of these interaction issues.

So what's new in Jaguar in this space? All right, well, performance, performance, performance, there's...that's our main goal. But process creation and process exit, there's been a lot of work to make those even faster than they were before. One of the big areas, now that in the BSD world, signals are part of the process space, we now have a lot more adherent POSIX signal semantics. And so we have all the SIG action options that we were missing in 10.1.

But where are we going? Well, performance boost, performance boost, performance boost. That's basically where we're going. One of the things beyond just trying to make what we do better is trying to make our services match the higher-level services better. Launch Services is a spawn kind of facility. It allows applications to just be created. It's not make a copy of myself and then turn myself into something. It's just create. And so we're looking at things like the POSIX spawn or other versions, variants of the same thing to provide facilities for the higher levels that match their semantics better.

And we're looking at, you know, even more of the POSIX signal stuff. POSIX now defines 128 different signal levels that you can send. There's definitely some interest from some of the people. I think if you have interest in that area, you ought to bring that forward. But POSIX also specified in their real-time versions a real-time signal delivery mechanism.

We've looked at that, and we've decided, well, you know, we like the 128. That sounds pretty good. But the real-time delivery kind of conflicts with some of the real-time behaviors that we have for other things, and in particular, conflict with some of the run loop support and things like that. And so we're probably not going to delve into that world.

Alright, so now we have processes. Processes are basically an area in which threads run. So let's talk about threads. Well, every thread in the system is a Mach thread, pretty much. There are some things that people call threads at some of the higher level that aren't fully Mach threads, but everything you're going to typically deal with is a Mach thread.

A Mach thread is the unit of execution in the system. It's what the scheduler schedules. It owns no resources of its own. It is simply a state. It describes an execution state within a set of resources that's a task or a process. It defines the how and the where of execution, but not what to execute with.

There's thread-specific exception handling at the Mach level. So you can grab exception handlers. You know, you can register exception handlers at the per-thread level and do certain things. A lot of you may be tool writers, and this has been something that people have been doing is grabbing the Mach exceptions basically before BSD gets them so that they don't become signals and kill the process out from underneath of your tools.

And thread T is the basic data structure that's defined there. And again, you can create and terminate. Again, one of the generic differences between BSD services and Mach services is Mach services are always third-party enabled. So any process can reach out and do all the Mach services to some other entity. They don't have to just do it to themselves.

And that provides a real carrot for people to delve into the Mach services because you can reach over and manipulate this other address space, or you can reach over and stop this other thread without the thread's consent and doing things like that if you have the privileged port that lets you do it.

So people have been striving to find ways to get those privileged ports so that they can do this. We urge you to be careful. When you manipulate something out from underneath of itself without its consent, it's not going to work. And often you're going to get in trouble with the higher-level services that are depending upon a known state in that thing. But one of the things you can do and is fairly well-defined is the exception handling.

is the developer of the kernel, and he's going to talk about the kernel system. All right, and it's our portable layer of threading in the system. You can pretty much create a Pthread at any level of our system and any application space and get fairly well-defined behavior. Not only do they provide, you know, an easier way to create and set them up than the Mach threads, because Mach threads have no resources, Pthreads have resources. And so you can just tell Pthreads, create one, and it creates the resources for you, rather than you having to do that manually.

It also provides synchronization, which the Mach threads don't have any native form of synchronization inside a process. They simply can use global synchronization services, but they don't have anything inherent to the process. And Pthreads adds, you know, synchronizers in the form of mutexes and conditions. You've got thread-specific resources as well. You've got a stack, and you've got per-thread data.

And basically, you can do the standard Pthreads things with those. At the higher levels, I don't want to spend a whole lot of time on this, but each MP task or Carbon thread has essentially a Pthread underneath, and that has a Mach thread underneath of it. But there are some situations like deferred threads in the IOSpace at Carbon and others where certain behaviors are dependent upon, right, of those threads, and they may be multiplex, so you may have deferred work. There may be one worker thread doing all the deferred thread work inside of a Carbon application, so you've got to be careful, especially in those environments, not to do something to the thread that will affect other things that depend upon using that same thread.

What's new in threading in Jaguar? Performance. Again, faster creation, faster termination of threads. So you can create almost two times as many threads per second in Jaguar as you could in 10.1. P-thread synchronizers. Well, we now have read-write locks at the P-thread level in Jaguar. We have recursive mutexes, or we will be having recursive mutexes in the near future. And we have, or will be having improved P-thread cancel semantics as well. More adherent to the standards. But a big thing we have that a lot of people have asked for is per-thread signals. And so that's in Jaguar or on the CD you've got. Thank you.

Where are we going? Again, better adherence to the standards for Pthreads. There are certain small APIs that we're not adherent to at this point. Most of the big things I think we've got covered at this point, but we've got some things more to do. Another thing we're doing is at the Mach level, there has never been the notion of, at the thread and task level, the ability to look at something without manipulating it.

And because of that task for Pid and the ability to reach out and grab third-party threads, a lot of you are using those things to just gather statistics or to look at their state, but not to control them. And so we've been looking at and are probably going to be adding in the future the ability to separate those semantics of access without control.

But all this just leads to the scheduler. Threads are uninteresting if they don't get scheduled. So a lot of you have struggled with this particular area as well. Well, what is all of the scheduling information inside of Darwin? How do you do it? How does it affect the higher-level things?

Basically, we've got a banded priority space inside of Darwin, and in particular, Mac OS X, where most of the application threads run. If you just create something that runs in the background off of a terminal or whatever, you're going to end up in this priority space, 0 to 31, where things just get time-shared in and out in that space, which means their priorities adjust up and down based on the amount of CPU they're using.

And so we kind of be fair at that level. But we've added, you know, the ability to have these time-constrained threads, which are threads which, which run under very strict constraints for real-time. You obviously can't allow a real-time thread to just kind of adjust and be fairly scheduled in the system. It has to respond to very real-time events in the outside world under time constraints that obviously that standard kind of adjusting time-share policy wouldn't allow. And so we've added the ability to create time-constrained threads.

You'd specify the constraints, the kernel, you know, the amount of time it takes for you to execute, the amount of time, the period at which you have to respond, and we'll compute a priority for you. Right now we do that very simply. We assign them all the same priority, but in the future we're actually going to be using that.

So be careful. But anybody can create a time-constrained thread. You can just say, create me one. And the reason you can do that is they're watched. We won't allow a time-constrained thread. We're not going to allow a time-constrained thread to take over the system and not give it back. It can take it over for a little while, but it'll eventually give it back because we watch it. There is no portable interface to that. You can't--there's no Pthreads or, uh, uh, Standard POSIX way to create one of these threads. That's an invention strictly to Darwin.

And we've also got a band of priorities for GUI application threads. The Windows Server and other things help assign a set of priorities to anything that responds to GUI events, such that, you know, you get quick response to clicking throughout the system. And those tend to fall in that range from 63 to 31. And then we have all these other magic priority ranges, basically where the kernel runs and where the core services run.

One of the struggling points that people have with all of this is how do I get in that space? And hopefully I can cover that in a little bit. But you can also create fixed priority threads in the system, but... There's a problem with those, and that is that once you create a thread fixed, it can take over the system or anything below that and deny anything lower a priority service throughout the system. And so that operation is privileged.

Basically, the Mach scheduler assigns threads, runnable threads, to processors, and we do that symmetrically on a multiprocessor system. So we will run threads on any available processor. But it's tuned for affinity to try and run the same thread on the same processor to keep the same cache set and other things.

It's a fully preemptive scheduler, which means that whenever a higher priority thread gets set running in the system, it will preempt immediately any user level thread that is available at a lower priority that's taking a CPU. And it'll also preempt threads that are running in the kernel, but only if the thread waking up is marked as a real-time thread.

BSD adds some things on top of that, and mostly in the realm of Pthreads, the ability to set via Pthreads the scheduling attributes. But one of the problems with that model is that it assumes, basically the POSIX model assumes a system-wide view of things, that you understand the priority of everything else in the system, and you know exactly where you want to go, right? And you can assign your priority.

Well, in our system, that's almost impossible to figure out, because there's applications written by many developers running on the system at the same time, so you can't possibly do that. So we allow the Pthread attributes to be used, but we restrict it to that range that's basically for timeshare threads, so you can adjust your priorities using Pthreads within that realm, but you can't use that facility to get yourself up into some of the higher levels of the system.

[Transcript missing]

And again, in each environment, we allow for these time-critical threads, but we have this problem with trying to

[Transcript missing]

So we have some new things in Jaguar to go in this area. We have better SMP support, so we will immediately now set a higher priority thread running on the other processor than what was already running on the other processor. It used to be that would take a while. So for some of you that have real-time apps that need the power of both processors in real-time, this actually provides now the ability to do that.

It also provides stronger affinity than what used to be in the scheduler, so that we will work harder to keep the same threads running on the same processor. But obviously we're not going to let that override the real-time nature of things. is the head of the Kernel Management Team at Mac OS X. Jim is the head of the Kernel Management Team at Mac OS X. Jim is the head of the Kernel Management Team at Mac OS X.

Future directions. Again, enhanced priority inheritance. We really are looking for a way to make sure that we can drive threads through to completion, even though the lower-level threads have got something locked in the kernel that you need or locked at user space that you need. And we will make that, you know, in the future, we're looking at making that the default in the system. We want real-time behavior out of the system without programmers having to program everything for real time.

Right now, almost every queuing mechanism in the system is FIFO, which means you get a whole slew of low-priority threads queued up on something. High-priority thread comes along and wants it, and it's going to have to wait in turn for all those low-priority threads, except for the processor, to get that resource. And so while we have priority queuing in the processors, we're looking at trying to make a more widespread use of priority queuing throughout the system that, again, give much better real-time response.

And one of the things we're looking at, and this is really kind of up in the air, is a producer-consumer type queuing mechanism that will allow you to have threads that either feed or suck the data off of one of these time constraint threads in such a way that you don't have to manually black art set its priority, and yet it will adjust its priority based on its ability to keep up with the real-time threads to try and help us help you get over this problem without basically locking in the threads. And in the future, we will be, again, doing real-time critical priorities. So don't fake out your time constraints that you provide to the time constraint setting mechanism right now, because eventually those will be used for something.

The kernel also provides a bunch of files mechanism. The files, Mac has no notion of what a file is. It has no concept of file systems. But it has a concept of memory objects, and it manages all the cache data in the system. It manages the physical pages in the system, and it manages the logical assignment of pages to things called objects.

It will fill all of memory with cache data, unlike a lot of systems that will only fill 10% of the memory with cache data from files, or 20% or 15%. It will just fill up all of memory with cache data. This actually is a very nice thing for a lot of your applications because you have very large data sets in some of these, and restricting you to only small parts of the memory for caching tends to be a performance problem on some systems.

But it also has its drawbacks, and so you have to be very careful how you manage your data such that you don't swamp the rest of the system and push everything else out. All the backing store for all these pages is managed by pagers. And the biggest pager in our system, or the most common pager, is actually the file system pager that's built into BSD. But on top of that, we also have a default pager which swaps files, swaps pages out that are just dynamically allocated in the system. And that's called the default pager.

And how all this works is that these memory objects then get mapped into parts of your address space. So it gives you access to these objects. If you're not actually mapping the file, but you're using the file system in BSD, each one of these objects can represent, and often does represent, a single VNode or a single file on the system.

And so we support traditional read-write semantics through files, by essentially the file system inside of BSD uses the Mach objects and the Mach cache to provide the data into and store the cache data for each of the files, and we do that through a thing called universal buffer cache. And that's why you get this ability, because they're all just memory objects in the system, the ability to just fill up memory with all of your caching.

But it also supports the MMAP semantics inside of POSIX by essentially directly letting you map those objects into your address space. This is what DYLD does and the loader does in the system to just give you your task image. But it's also what you can do manually to get active files. And it has a clustering support to do more efficient loading of that data and writing of that data back to disk with a smaller number of IOs.

And at the application level, even higher above, you have asynchronous support. Well, the BSD file system is all synchronous. All the semantics there are synchronous. So you end up having at the higher levels worker threads that help you do asynchronous semantics. They emulate the asynchronous behavior. There's also the ability to access additional data from traditional file systems. You can access the resource forks and things like that, which are all... Oh, I'm sorry. Wrong one.

Because at the BSD level, you don't have a notion of separate seek pointers on a file. You have a file that has a current seek pointer, but the higher-level application semantics, both in Carbon and in Cocoa, had the ability to give you access to ranges of files, and you can have multiple threads coming in and trying to simultaneously access a file.

You needed to lock things out at user space and arbitrate over the control of where the current seek pointer is for a file to try and keep threads from fighting over it and moving it out from underneath of each other. And that was all handled at the higher levels. And then you had the legacy emulation for things that some of our file systems just can't support, some of the resource forks and catalog search kind of things.

[Transcript missing]

And if you implement some-- one of the things that comes along with that is if you are a file system writer, we've noticed at least in several of the file systems that people have written so far, that they were not expecting these larger page lists, these larger clusters to be coming through, and it confused them a little bit in their implementation, so you might want to watch out for that.

But one of the things we've also added is in the POSIX PREAD and PWRITE semantics, or APIs, that now gets us away from that locked-out file seek pointer problem that I talked about earlier. You no longer have a single current offset in a file. These calls take the offset as part of the parameter, right? And so now multiple threads can issue I/Os to the same file simultaneously to different parts of the file, and the system will deal with that just fine.

Where are we going? Obviously, caching is important. Performance going to the disk is important, so our clustering and caching is something that we're gonna be working on for quite a while now. This is one of the biggest areas that you notice performance. And we're headed towards true async file I/O so that we can get rid of those worker threads at user space and do everything directly through the kernel async support.

[Transcript missing]

But again, the file system is only as good as the virtual memory system because it's housed on the virtual memory system. All right, and Mac basically creates this very complex view of the world. You have objects mapped into the address spaces. The objects are backed by files which are implemented by pagers inside the BSD file system, which use clustering to get to the disk to do their work. And Mac basically is the arbiter of all of that.

All right, it allows every address space to be laid out uniquely and in a pattern of their own choosing, mapping objects and creating anonymous memory however they like. provides protected address spaces so that what I have mapped into my address space is available to everyone else. So I can map that file without worrying about someone who doesn't have access to that file getting at the data.

It has these copy-on-write optimizations that are well published inside of it for mapping. It's mostly good for mapping copies of text and copies of data from files into address spaces in such a way that as you-- for text, you typically don't touch it at all. But you may when a debugger comes along, and you would really hate to have a debugger write a break point into a mapped file and actually have that show up on the file.

That would be very, very bad. And so you essentially, whenever an executable is mapped into the system, it's mapped in copy-on-write just in case somebody's gonna come along and touch it. And so from an optimization point of view, that is dramatically better than going ahead and pulling in, you know, unique copies of those things for every instance that's running.

And it allows for controlled sharing at various levels, single-page level, sharing ranges of a mapped file, to sharing complex regions. And in fact, we do that in the system. As you noticed, we have a lot of frameworks inside of our system that most applications are using all of them, or a large portion of them. Just the overhead of mapping in 127 or 100 and something frameworks into every application is part of launching.

Right? Can be a fairly onerous task. And so we optimize that for you. We have a shared region in the system. This is optional use, but the linker currently at the higher level, DYLD, uses it fairly heavily. That allows you to have global regions that are mapped into everybody. And that's where he maps in all of the frameworks. So that everybody gets a shared mapping of that.

And the way that works is, you come along and let's just look at the global part. Again, behind every mapping is an object, and behind the object is a pager memory object, and that may go to the disk, right? But you take a fault in one of those areas, right?

Well, normally, every application, when it takes a fault, would query the VM system, would handle the fault, and say, okay, what file is mapped there? Here he realizes that it's a global region that's mapped there, and he sends the request on to the global region who forwards it on to the object, who creates a mapping for a temporary file, basically a placeholder.

But unlike normal, this actually shows up in every address space at the same time because it's a true shared mapping. So everybody gets the placeholder. One guy's driving the fault, but when the data is finally supplied by the pager, it shows up everywhere and it's a very, very good example of how to do this. Typically, if I have a file and you have the same file mapped, when I take a fault, I would get a translation for my page that lets me access that, but you don't.

And so when you access it, you would have to go and go to the VM system. He says, oh, the page is in cache. That's no problem. Here's a translation. But you still have to take a fault, right, for it to find it and provide a translation unique to you. Because these global regions are shared, the translations are shared, so any application faulting in a page from the core foundation or from the system framework actually makes that page immediately available to everybody so they don't even have to take the faults.

So where are those areas located in the system? Well, in 10.1, there was two 256-megabyte regions where they sat at 7 bazillion and 8 bazillion, respectively. All right. And these map the system-wide frameworks, right, and gave you that multiple level of efficiency. But they kind of got in your way, didn't they? If you needed to allocate large portions of virtually contiguous memory, we didn't leave you very many big open spaces with the way this was laid out.

So in Jaguar, we're moving them. Right now, they're moved up a little bit so that you now have just under 2 gigabytes of available space to you maximum, you know, between your heaps and these spaces. We've also dramatically improved the ability of the system to determine what you're using in your space. So we have task-working sets.

They were in 10.1, but they've been enhanced in Jaguar to do a better job of, if you touch one page, knowing what pages you touched around that same page last time, and getting them all in at the same time so that you don't have to take a fault on every page you access if you're needing to bring pages in. So when you get to the point where you launch the N plus 1 application, where N fits in memory and N plus 1 doesn't, this dramatically improves your ability to switch back and forth between those applications.

And also there's preheat, so that we know when you've done this once on an application, that when you come along and launch that application again, we actually don't have to build up the working set, the initial working set. It's already there and remembered from the last time, so we'll go ahead and use that to bring in. So second and third and fourth launches are dramatically faster than the first ever for an application.

[Transcript missing]

Okay, so now we've got this task. We're running. We're doing all kinds of fun stuff, but we're in a little silo. We can't interact with anything else in the system. So we've got to have inter-process communication in order for everything to work. A lot of the system depends upon Mac's basic inter-process communication system, which is a Mac port. Sorry. They're the endpoints of communication within Mac. They allow you to send messages back and forth on message queues or semaphores or lock sets, basically synchronization at those levels.

A Mac port is actually-- you don't access a Mac port directly. That's the endpoint. What you have is a right to access that. And these are represented by send and receive rights and send once rights inside a Mac that allow you to communicate back and forth to various services. And in Mac itself, almost every object in the system, a task, a thread, the-- You name it. VM object, those kind of things.

They're all represented by ports. So in order to manipulate them, you have to have a right to the port to send a message to the port to say, change yourself. And that's how you get that third-party, anybody can change anybody else's behavior. And these messages carry data, they carry port rights, and they also carry security information to identify who the sender is.

And BSD adds, you know, typical BSD IPC services on top of that, sockets and pipes, POSIX semaphores, both named and unnamed. And Pthread synchronizers, which have options to be inter-process communication as well. Right now, those are all implemented on top of Mach ports in such a way that it is very difficult for us to export them across. So right now, all the Pthread synchronizers are only intra-process, so you can only use them within a process. All the options to go cross-processes are disabled.

and then you have these higher level concepts of IPC, like the CFRunLoop, right? And the CFRunLoop is all built on the Mac side of things, right? So you have a port set, right, which is a collection of ports. Each port represents an object or an event source in the system.

And so you sit and wait for a Mach event, Mach port event to occur. Well, a lot of the driving events, a lot of the things that drive these events are not Mach port based. And so the CF code and CF networking and CF socket and those things reflect events from those into the port set, right, via worker threads or via call outs or whatever it can work out to do that. Most often it's a thread, right, to drive it into the run loop so that you can see it in your run loop.

[Transcript missing]

We have, again, from System 5, we have the System 5 IPC, thanks to the Darwin Development Organization, the Darwin community. Again, we'd like that to be for compatibility only, but it's there. Where are we going? Well, one of the things that we're doing from the BSD side is looking at adding the pole interface. There's lots of things in the way of doing that, and so we've got to work those issues out, but we'd really like to do it. In fact, you know, we'd also like to look at adding the BSD KQ support.

One of the things that looks really nice about that is it looks like it could be interchangeable with run loops in such a way that you can choose to either use a run loop to do your event processing or you could use to do a KQ to do it, depending upon the semantics of your app, and still have the same event sources providing data into both. That's really just something we're considering, but it actually looks pretty good. And more directly, you know, first, we're looking at ways to do the POSIX inter-process synchronizers. And again, real-time being a key factor in all of this, we're looking at inter-process communication.

security in the system. Nothing here. Now that you can share and you're isolated, now you have to be able to do it in a controlled way. Mac itself does no authentication checking. Everything it does is based on capabilities. If you have a port to manipulate something, you can do it.

[Transcript missing]

is the developer who runs the kernel. He's been working on the kernel for a long time. He's been working on the kernel for a long time. He's been working on the kernel for a long time. He's been working on the kernel for a long time. Right, and the V nodes inside of BSD provide the cache of all of the permission information from the file systems itself. I won't be going into how file systems do all of that.

But again, Mac OS X applications are much more than just BSD processes. And so there's additional security semantics that are added on that, and they interact with the kernel services in certain ways. One of the things that happens is individual applications now have unique capabilities in Mac OS X, right? Not just users. So you can have an application that has a keychain item, right? And that keychain item allows it to decrypt a file or do a certain operation, right? And these have to be protected.

This is somewhat contrary to the typical BSD model where every process owned by a user is equal. You don't have to protect any one process from another process at that user level, right?

[Transcript missing]

What's new in the security world? Well, at the Mach level, we've got security tokens added, and the CFRunLoop provides those so that people can do authorization on anything that's mock-based at the CF level.

Where are we going? Again, we're probably going to split some of the Mach capabilities up, especially in the thread and task name port world. And one of the big E's, we're going to be restricting tasks for PID in such a way that you won't be able to reach out and grab another task and manipulate it without some kind of authorization, some kind of privilege, in order to protect those higher-level keys from the application level. And we've got people that are looking at the system in a way to determine if it's evaluatable in a secure way, and we'll be looking at bringing some of those changes into the base system.

We've got a new, from serviceability part of the system, we've got some new things. We've got a new panic user experience. We don't want the user to ever experience a panic. But the old panic was really bad. It just scrolled over your screen and just spewed data all over.

Now you've got a nice little UI that pops up. The panic data is saved and captured on the reboot. And so you can just cut and paste a report and send it off. You don't have to type in or copy the data off the screen or take a digital picture of the screen and send it in.

Again, we don't want users to ever panic, but... All right. And for people who write kernel services, we've now got additional debugger support, including the ability to debug across routers and things like that so the debugger can handle arping. And we've got K-Trace support in the system as well, so now you can trace things using standard K-Trace tools.

Which leads us to KEXT services. We've got, in Darwin, we've got these KEXTs. And IOK KEXTs. We've always discouraged writing KEXTs. If you can do it at user level, please do, because we don't want to repeat the behavior and the instability of 9 by having everything everyone does written as a KEXT that can bring the whole system down. But we understood that most IOK things needed to be done as a KEXT, right, or a large number of them anyway, right? And we expected most KEXTs to be IOK KEXTs.

But we've done it in such a way that most of the hard work is done in abstracted families and things like that, so the actual KEXT is fairly small and can be rigorously tested. But then we've got all these other KEXTs, right? We've got BSD KEXTs that do file systems and do networking, right? And BSD is BSD. It's a traditional kernel, opened up to the world, so anyone can reach in and do anything they wanted. And it was based on that recompile model.

So what's new? Well, non-IO kecks are still highly discouraged, even more so than they used to be. And we will preserve the binary compatibility of those kecks for a little while, probably not much beyond this current release. We don't know exactly when, but at some point, those things will break.

And you'll either have to keep recompiling them with every release or not. But we need the ability to, as some other companies often, we need to be able to innovate. And so right now, we're locked out of innovating in the kernel because we have kecks that reach in and grab every little private data structure we've got.

And we don't want to break those binaries, but we're going to have to in order to go forward. And so we're marking all the non-sustainable APIs inside the kernel as such in the header, so that you'll have to turn on a flag basically saying, I understand. I'm no longer using sustainable interfaces, right? I'm going to have to figure out something else to do in the future, or I'm going to have to keep tracking the kernel from release to release to release to release.

So what we're doing in the future then, we're doing that now, and then in the future we're going to be defining sustainable interfaces for those file system and network kecks so that binaries will be able to go forward from release to release. We're defining controlled access to system services so people can stop patching the system. It's amazing what people do when they have the kernel sources available to them and they have the kernel fully symboled and everything available. They've been doing some really, really bizarre things, and we want to stop that.

and many of our services, we're trying to figure out ways to move things to user space. Kecks are bad. Kecks are bad. We want to do everything at user space that we can do. So we're trying to define interfaces to move things up to user space. And so whatever you're doing in the kernel, if any of you are kecks writers, whatever you're doing, and you need access to, come talk to us.

Send us your input. We're really begging for it. To help us figure out how to, A, get you out of the kernel in the first place, and if not, define a set of controlled and sustainable interfaces for you to use in the kernel so that we don't have to go through this, we jointly don't have to keep going through this release after release.

All right, so that pretty much covers everything. The roadmap. Well, you want to talk about that KEXT thing some more right after this in this very session is managing KEXT. So you want to definitely hang out for that if you're a KEXT writer. More of the security stuff later this afternoon. The Darwin feedback forum later today. That's definitely something we're interested in getting your feedback, and if you can start giving us some of that input on KEXT, that would be great.

And to learn about some of the Pthread stuff that's new in this release, go to the Pthread session tomorrow and learn about those there. And on Friday, the developing for performance. We give you a lot of rope in this system. We don't restrict you from very much, but we also give you a lot of rope to mess up the performance of everybody else. And so you really need to write your applications in a way that are not only performant for you, but performant for the rest of the system, and that session on Friday will cover a lot of that. a lot of how to take that into account.

To contact Jason Yao is the technology manager for CoreOS, and you probably want to send him any requests. But we're also the Darwin developer sites and the Darwin documentation. If you have questions, we're very responsive to the Darwin development list. We can talk things out on that list if you don't want to go through developer relations as well.