What's New in the Kernel - WWDC 2006

OS Foundations • 43:52

The kernel is the heart of Mac OS X, providing the memory management and processing primitives that support the rest of the system. Come learn about how the kernel is being supercharged for Leopard with support for new processors and smarter algorithms.

Speaker: John Wright

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

My name is John Wright. I run the Kernel Technologies Group at Apple. I'll be talking to you a little bit about what's new in the kernel today. So we're going to go through kind of just an overview of what the kernel does and kind of some background on it.

Then we're going to talk a little bit about some developer features for Leopard, particularly some things that relate to 64-bit, some of our Unix interfaces, and some things about system tracing. And then we're going to cover some infrastructure changes for Leopard, including things that cover scaling and working on larger machines and some implications to security. And then we'll have some time for QA.

So target audience. So the kernel is really kind of the lowest level in the software stack, and it kind of runs right on the, enables the bare iron of the hardware. So most of you guys are actually implementing to much higher level frameworks. But it is still of interest to some of you to kind of understand the lower level mechanisms that's underlying those, and just in general the system behaviors of those.

Now, a lot of times when people think kernel, they'll think device drivers, and the I/O Kit interfaces. And I'm not covering those. Those are actually covered in some other talks. But I want to make sure you guys understand that if you guys are expecting I/O Kit interfaces to be covered, they're actually covered elsewhere.

So this is your basic block diagram of the kernel architecture. You'll see here on the right, you've got a few libraries. There's the standalone library that actually is used for the boot infrastructure. libkern, which is basically some routines that are generic in the kernel that are used. And the platform expert, which kind of abstracts away some of the lower level hardware details away from the kernel.

Down here at the Mach layer you have virtual memory, the physical mapping memory, the scheduler, the abstraction of tasks and threads, your timers, your low-level primitives for symmetric multi-processing, all important Mach IPC, messaging. There's also some device driver infrastructure in I/O Kit, which really isn't in Mach, it's actually kind of separated.

and And then on top of that, we built the BSD layer, which includes our POSIX interfaces, P threads, processes, signals, our Unix security model, VFS, which exports all the file system infrastructure out to the file system, some of which are in the kernel, some of which are outside the kernel. And you also have some of the networking infrastructure, the sockets and the networking stack.

Now, other talks are actually going to cover some of the file system and the networking and I/O Kit stuff. So I'm really trying to focus kind of on the core kernel infrastructure here in this talk. and some of the talks, there are other talks that relate to these things, but the writing device drivers for Mac OS X covers the basics of I/O Kit, and there's a talk dedicated to what's new in the file systems as well.

So I just want to say a little bit about target customers. Since we are kind of the lowest level of the software stack, most of our target customers actually end up being our frameworks. And then most of you guys are actually developing to those frameworks. But those frameworks actually percolate up directly to you guys and to our end users. So it is important for us to kind of see the stack as a whole.

As far as where we get influence on how we develop the kernel, what really drives us is the internal innovations that happen within the stack. At Apple, we're unique in that we control the hardware architecture and the entire software stack all the way out through the frameworks and all the way out to the applications. We do a lot of optimizations directly with our framework teams, but we also drive this really through looking at the whole widget. We interface with the hardware teams and the framework teams, and we take requests from apps to really figure out how this thing can be optimized.

We continue to be influenced by external sources. We're always looking at industry trends and standards and seeing how they'll affect the way our software works. We're always interested in technology our partners are working on and seeing whether we can take advantage of that. We're always looking at the open source community and helping what they're driving.

And probably the most important here is actually requests from you guys. Our developer relations team actually takes requests from you guys and passes them on to the engineering team and helps make that relationship work. And we take those requests very seriously. So when you guys come up with an idea of how we can actually optimize things for you guys, I encourage you guys to send that in.

I want to talk a little bit about performance philosophy. Our focus is really improving the system level performance. And I have a quote up there, lies, damn lies, and micro benchmarks. I've seen a lot of things that show micro benchmarks between different operating systems. They're of less interest to us because there's something that we've kind of learned here.

That if you look at like LM Bench, which was written for Linux, Linux runs really well with LM Bench. If you look at something like Lib Micro, which was written for Solaris, Solaris runs very well in Lib Micro. And we run another micro benchmark called Mach Bench, and you probably won't be surprised.

That we do pretty well in Mach Bench. So we use these micro benchmarks as a tool to kind of understand what our kernel does, but we're much more focused on the system level benchmarks. And Apple has several teams that run system level benchmarks both on the client and on the server to analyze our performance.

And we're much more focused on optimizing for those. There are a couple of things that we pay special attention to in our kernel. One is Mach messaging in IPC. Mach messaging is the underlying communications mechanism that's used throughout many of our frameworks. We want to make sure we optimize well for it.

Another thing we pay particular attention to is the scheduling latency and the interrupt latency. This is the underlying technology that allows us to do our real-time performance. It also directly affects the system interactivity. Just being able to switch between different tasks, get things done on scheduling time, is very important to our system because it directly affects our users.

I also want to say something about the monolithic versus microkernel debate. I think about every six months on Slashdot, I see something go by that raises everyone and everyone gets to put their opinion on whether microkernels are good or monolithic kernels are good. Mark is a microkernel, its architect is a microkernel, but we use our kernel, X and U, as a monolithic kernel. So we build BSD and Mach together and the device drivers together to actually build a monolithic kernel. Now we do this to minimize the transitions across the kernel boundary.

That's really what we do. And we find that what we want is that when the code path goes into the kernel, we want to get all of its work done before it comes back out of the kernel again. So that's where we kind of lie on that debate, I guess.

So here's our friend Darwin. There's been rumors out there that he was pining for the fjords or something like that. So we can kind of put those rumors to rest because we finally have our open source release of the Darwin kernel for Intel. And I think that actually hit MacForge yesterday.

So we faced some kind of challenges in making sure this got out. We've written lots of new code, as you can imagine, over the last year or so to complete the internal transition. And we had kind of a management priority to make sure those products shipped to our customers.

And that was definitely the priority at the time. Now, when you write lots of new code and you're doing open source, it does take some time to review the code, engineer around code that is actually encumbered, and making sure that it's all clean before we can actually release it.

And that did actually take us some time. The other thing is that we really wanted to-- We want to preserve the ability to build and run the sources on the system. We probably could have thrown an open source kernel over the wall much earlier that wouldn't build and run, but we find this to be not very useful to you, the developers.

We want to make sure that everything worked well before we got it out there. I did make a note that there are some caveats here that some of the first release does require you to pull some kecks over that aren't on your original systems, but in general we'll be able to take care of that as we go forward.

So I want to emphasize that Darwin is very important to us. And there's really two reasons I always quote. One is that it's really for you guys. It's transparency for you guys. You guys can look at the system behaviors in the source code. If you want to know how the fork system call works, you can actually go get the source and look through the code.

You can actually modify the code, build it, stick it on your system. Even if 90% of you guys never do this, it's good insurance just to have the ability to do it. And we really appreciate that. And we want to make sure you guys have that ability. And then the second reason is actually transparency to security review.

And we really believe, since this is kind of the core of the operating system, that's important to have the sources out there for security experts out there in the community to be able to look through and give feedback on the kernel itself. And last thing is thanks for your patience. We're glad to have it out there, and we're looking forward to our regularly scheduled releases of the open source kernel.

Let's wrap up the overview. There's a lot of detail to how the kernel works. A couple of months ago, there was a book release. Now I get no kickbacks or anything, and Apple's not actually associated with the publishing of this book. But it actually is a great reference for Mac OS X in general.

It has several chapters on how the kernel works itself. If you're interested in the details of Mac IPC and things like that, I really recommend you pick this up. It's about 1,600 pages. I found it riveting reading. Some of you guys just might find it a good reference. Let's talk a little bit about developer features.

64-bit, it's a big theme here at the conference. If you haven't heard this before, I'm just going to kind of reiterate some points that were from the State of the Union. The new 64-bit Intel systems, the Mac Pros, inherit what we call Tiger parity. We're enabling 64-bit processes through basically our Unix interfaces.

And we're using the standard programming model that we used previously on PowerPC machines. So it's 64-bit longs and pointers. We're still writing the kernel as a 32-bit entity, but we enabled the large address spacing on the 64-bit processes. And this allows us to have full compatibility with the drivers in the kecks.

So the big news for Leopard is that we're actually going to put in support for all of our frameworks, including like Carbon and Cocoa. And that will enable all the applications that are actually using those frameworks and not just the Unix interfaces to actually get full 64-bit benefits. And there's a talk here about how to actually enable your drivers for 64-bit to actually address more large memory systems.

Now, I want to talk a little bit about how we got to 64-bit support for the Intel systems. And on the left here is our address space map. If you look below 4 gig, what you'll see is basically the address space map for our 32-bit machines. And you'll see your 32-bit address space sitting beside your 32-bit kernel space.

We were usually referred to this as a 4/4 split for the kernel and user space. Some kernels use a 3/1 split or a 2/2 split. And in those cases, you'll actually see the kernel underneath or above, but not overlapping with the user space. We actually made this split so we could actually get full utilization of the virtual memory, both for user and kernel, and allow our applications to be able to address the full 4 gig. : It does mean that there is a little bit bigger boundary to move across from user to kernel because now we have to actually swap out the address space.

But you'll notice that in 64-bit, you have your 64-bit address space above that. And you still have your 32-bit kernel space. So we don't actually take that hit when we go back and forth between user and kernel for 64-bit processes. Now, there is still a boundary there that actually takes some time. But we don't actually have to swap out the address space. So on our 64-bit machines, we run our 32-bit kernel in what's called compatibility mode on Intel machines. And that basically means that it's the 32-bit mode of the architecture.

The big joke here, so the 64-bit user space, the size of this right here, is actually 47 bits. So if you really thought you were getting full 64 bits, no, we took some away from you and we're only giving you 47. That's actually an architectural limit to the current Intel chips. There is a kind of a dead space above 47 bits that actually will give you a general fault.

If you actually try to address that on the current architecture. But note that 47 bits gives you 128 terabytes of addressable space. And if you want to do kind of the math there, 128 terabytes really is a lot of space. So before you go complaining to developer relations that you really need 48 bits of addressable space, make sure you do the math. It's a lot.

The other point is that in the Intel architecture that you actually get to use when you're in actually 64-bit mode, you get to use twice as many registers. So for the general registers, there's eight on IE32, and when you're in 64-bit mode on these machines, you actually get to use 16. So there are some performance capabilities that you get beyond just the large address space.

Unix. So standard Unix interfaces. We've supported the POSIX standard Unix interfaces for quite some time. In Leopard, one of the things that we want to do is move our default, all of our behavior to the Unix 03 standard. But we'll still support all the legacy behaviors, it's just the default will be Unix 03. And you can kind of switch back and forth even today between these behaviors with POSIX C source as a define. It's just that you will get the POSIX behavior by default in Leopard.

If you target your binaries to Tiger or earlier, you'll automatically get the legacy behavior. If you do this in Xcode, you'll automatically get the legacy behavior and things shouldn't change in general. You can check the main pages. I think we've updated most of them to actually explain what the default behavior is now and what the legacy behavior is.

There's a specific talk here about developing porting Unix applications that will go more in depth on what these interface changes are. But I do want to talk a little bit about the kernel changes for this because there are some catches and I want to make sure that people are aware.

The first one is the Pthreads now support all the mandatory cancellation points. Cancellation points in Pthreads are used to synchronize the proper termination of the thread. It gives you a point to register a callback handler, release all the resources before the thread is terminated. The legacy behavior is that we only support cancellation points on exit and pthread_text_cancel. And now we support all 55 mandatory cancellation points. So they're in very small font here, but the gist is there's a lot.

Most of you that are porting things on the POSIX layer and the Pthreads layer probably are porting from other POSIX machines, and they probably already support these cancellation points and register callback handlers and don't have any problem. You really only have to be aware of this if you've written one natively and have relied on the fact, for some reason, there aren't many cancellation points and not registering a cleanup handler. And then down here at the bottom, I just have kind of the syntax to register the cleanup handler, which is pthread cleanup push. And you basically give it a cleanup routine and the set of arguments for it.

The next one I want to make sure people are aware of is partial reads and writes. I've always considered this a bug in Mac OS X more than a feature. So when performing a read of a size larger than what is left in the file, the legacy behavior is actually to send back what is left of the file and an error and set the error node to end of file.

So if you say we're trying to read 128 bytes and there were only five bytes left in the file, what it would do is it would give you those five bytes and then set the EOF. The conforming behavior is actually to set the five bytes and send you back how many bytes were actually read, which would be five. And then on the next read, you'd get the error in the EOF.

So if you're using read and write and actually looking into files, you want to make sure that you have the correct semantics for the partial reads. Partial writes are very similar if you're writing to a device and it's full. The semantics of writing only a partial write are pretty much the same. They'll change to where you write the partial write, you'll get the number of bytes written and no error. and then on the next one, you'll actually get the error.

A couple other minor ones. Controlling terminals. All processes now get a controlling terminal by default. And this wasn't the case in Tiger before. And you can actually opt out of getting a controlling terminal, but you have to specify it with the no CTTY option on an open or one of the other calls.

The only thing to note here is if you're actually relying on the fact that there's no controlling TTY in one of your programs, you may have to nohup your processes, for example, to keep the controlling TTY there. So just be aware of that. And likewise, there's another minor nit with sending signals to process groups. For example, kill process group, the system call. If you send the sending processes in that process group that you're trying to kill, it will also get killed now. The default behavior before was that it would not get killed. It would kill everything else in that process group.

So we're also adding a new system called POSIX Spawn. The idea here is that fork and exec actually will copy the address space, even if you're going to exec it right away and get a whole new copy of the address space. They will copy the address space of the parent. And the solution to this back in BSD and other Unixes was to move to vfork and exec, which would not copy it, and therefore was faster.

The problem with VFork and exec is it wasn't just a faster version of fork. It left you kind of semantically out in the open between that VFork and exec to make modifications to the child that would reflect directly into the parent. So, for example, if you were trying to mark it as set UID, it would actually reflect that change back to the parent and mark the parent as set UID as well. So we actually prefer POSIX Spun because semantically it's a lot safer than VFork.

So you can do a few things to it, like pass file descriptors or change user and group IDs or signal mask. But you actually have to do it specifically from the arguments. And that's semantically a lot better than just relying on the point between your vFork and exec to do those things.

So you get all the speed benefits of going to VFork without kind of being out there in the open and exposed to bugs and reliability in your system. So just as an example here, I put some of the syntax. And this just basically is positive spawn. You have to give it a PID to return. You have to get what it's going to exec. In this case, it's just going to do a cat on etsy password. These null arguments here is where you would actually put arguments to actually pass the either the file descriptors or change the user and group IDs.

And then you basically have the arguments to the command that you're execing and the environment that you'd want to pass into it. So we encourage people to look at POSIX Spawn if you're using Fork and Exec in your programs and take advantage of that. This last developer feature I'm really excited about is Dtrace.

So if you went to the developer state of the union, you heard that Ted talked a lot about X-Ray. And D-Trace is really kind of the fundamental building block of X-Ray and being able to make custom X-Ray instruments. This is a port of OpenSolaris of their dynamic tracing.

And it's used for profiling, isolating code paths, exploring behaviors across the entire system. It goes through kernel and user code. The design is really great. It's really designed to be near zero overhead in production systems. So, I'm going to talk about some of the kind of new technology in the industry in that most of the profiling tools that you had were either on while you were developing or kind of on test systems that were set up, and you had to use special software to use it. And this is compiled in.

It basically uses a set of static probe points that are around what we'll call contended routines. And these are routines that you have to carefully get the information out of because of locking primitives, etc. And then it also does dynamic probe points dynamically in the code. And it will actually go in and insert code in line to pull out the profiles. So this means in a production system, it has very little effect at all, near zero overhead.

Because the static trace points are no ops, and the dynamic trace points are in the code at all. And when you actually turn on the tracing, it's relatively lightweight. It only traces the points that you have specified. And you can feel safe about turning this on in production systems and being able to see what behaviors are going on in the system.

So this is just a short example of showing that this basically will return any processes in the system that are trying to do the kill syscall, so trying to send kill signals to other processes. And you basically can just run this on the system, and it will start giving you exactly which processes are running kill. And you can go explore from there. There are just thousands of uses that you can make scripts for. The D language is a scripting language akin to more like awk. It's a little esoteric, but for all the old Unix hands, it probably looks familiar in a way.

And there's two talks that I'll refer you to. One is we have a dedicated talk to DTrace. And that will go over the D language in much more detail. The performance analysis tools talk, which will talk about X-Ray and how they make custom instruments out of D-Trace. X-Ray is a great UI on top of D-Trace to allow you to do these system traces graphically and see them in real time, what your application is doing.

So I'll move on to some infrastructure features. So we're starting to see larger and larger systems coming out of the Mac division. The newest Mac Pros that we just released are the largest yet. They're two dual core processors, so they really look like a four core system. It supports 16 gigabytes of RAM.

And we're also starting to see that the apps are more demanding. We're definitely pushing the envelope on graphics performance and I think what Peter Graffagnino referred to as wow factor. And the frameworks are starting to support this easily on the system. So we're seeing ever increasing number of threads usually take advantage of the multi-core architecture. And these applications are accessing a lot more data, and they're a lot more media intensive. And we don't really see these trends lighting up.

So to keep up with those trends, we're going to do some enhancements to the kernel itself to make sure that it transparently still gives performance to the upper layers of the software stack. So a couple of these enhancements. The first one is the schedule architecture itself. And we've always used a flat scheduler architecture, where all the processes and threads actually go into a flat run queue. We're moving that model to a hierarchical model that will actually reflect the hardware that it's being run on. The load balancing between that hierarchy will be automatic and transparent to the user.

And we're also going to do more idle processing without actually using an idle thread. And that will actually reduce some of the context switch over time that we see in our current system. But the important thing to take away from this is through all the switching, we're going to maintain our interfaces and our current system behaviors. So all the real time scheduling interfaces, you'll see the same interfaces being exported. All the priority interfaces, they'll be exported too. And this should just be transparent to the applications themselves.

Likewise, in the virtual memory space, we're actually doing some more advanced paging algorithms to support the larger memory and the greater use of that memory. We'll be changing our working set algorithms to be more efficient as well, so that the code that you're actually running in your application should be stored in memory more efficiently.

We're actually making changes to the dynamic pagers so we can allocate swap files on demand. Instead of keeping on increasing the swap file space and only using that. So that will actually limit the amount of swap files that are on disk, or the size of the swap files on disk.

Another area that we have to look at when we're looking at these larger machines is resource limits. And we've kind of looked all the way through all the resource limits in the system. What we're wanting to do in Leopard is actually target a number of them to become dynamic. An example of that is the number of open files, so that we can actually scale up the number of open files according to the size of the system and according to the application workload dynamically, so you don't have to set it on the system at all.

Likewise, we are actually adding support to cloning device drivers and this actually means more terminal drivers or TTYs will be available to the system. And probably the thing to note there is if you have an application such as VPN or a heavy use of consoles in your application, there should just be more TTYs available on the system to be able to write the application.

We're also looking at some of the default limits on the system and we're having to bump them up because we're just supporting larger machines and larger workloads. So, for example, previous to Leopard, we're only supporting 100 processes per user and now we're bumping that up to 266.

[Transcript missing]

And the last thing is we're actually going to start enforcing some limits that were never enforced before.

In the mating of Mach and BSD, some of these limits actually don't span that boundary. And we're actually going through and re-architecting the accounting for some of the resources to make sure that we can actually enforce some of these limits. And most of these limits are actually per application.

So I've listed out some of them here. The maximum virtual address space, the maximum resident memory size, and the maximum amount of wired memory in the system. And we'll actually hook those up. They were completely unlimited before, but being completely unlimited will also allow a rogue app or an app that's not well behaved to go ahead and use everything on the system. And that's not the behavior that we want for our system.

A little bit about security infrastructure. One of the things that we took a hard look at for the Intel transition was the use of task for PID. Task for PID is actually part of the Mach messaging infrastructure, and it allows you to get a task port for any PID on the system.

And it allows you to inspect that task port. And starting with our Intel machines, our Tiger-based Intel machines, we actually kind of locked down the security of Task for PID. Task for PID is a very powerful tool. It also can be a relatively dangerous tool if it's not used correctly.

So we wanted to make sure that it had proper privileges around it. So that behavior on the Tiger-based Intel systems is that processes are allowed to inspect their own task port without a problem. But they must be privileged to inspect another process's ports, meaning privileged that they'll have to run as root.

Now, we had to make an exception because a lot of our debug tools use TaskForPid to do their job. So we actually made a separate set of groups called proc view and proc mod that will allow you to get task ports for other processes in the system. But you have to be set GID, those groups, and you also have to be a related process. Now currently what we call a related process is basically that the effective UID is the real UID.

In Leopard, the thing that will change about that is we'll probably be a little bit stricter about how we define a related process. We'll probably use group sets and saved UIDs to make sure that we're a little bit tighter about which processes these debug tools can inspect. An example of debug tools that use this mechanism is GDB and Sample and malloc-debug. Most applications never use this mechanism.

So another area of security that we're looking at is actually creating policies for more access controls in the lower levels of the system. So we're adopting the mandatory access control framework. So this was originally developed for trusted BSD. It was ported in the community to Darwin. It was a great starting point to start from for our architecture.

It's basically an architecture for authorization points in the kernel. The authorization actually works via a callout mechanism, so you can actually implement the policy out in user space. Examples of these policies are things like SE Darwin, which is, as you can imagine, similar to what SE Linux does, if you're familiar with that project. But you can have other policies. There's a BIBA policy and many other different policies that can be implemented this way.

It works through having labels on objects and subjects. And objects are things like sockets, file system objects, and subjects are things like processes. And these labels are actually stored persistently in our extended attribute architecture. And then there's a mechanism to load policies and arbitrate between multiple policies in the system.

You can actually layer these policies on top of each other. Note that we're not actually exporting the interfaces in Leopard to use this directly by developers. What we want to do is create this infrastructure and create a simpler mechanism to use it by the upper level applications in the system.

We are actually adding sandboxing support in Leopard as well. Now, there's actually a lot more about this from the security team in a separate session, but I just want to kind of introduce you to the concept. It's based on the Mac framework. It's basically a process-based sandbox. It's defined by a text profile.

It has a language that those profiles are written in that the sandboxing engine interprets. It doesn't require any special privileges to use it, but you cannot escalate privileges based on it. This is actually included in the developer seed, and there's some examples in user share sandbox of profiles being applied to, I think, three system daemons.

And it's designed to be very cooperative with applications. It's really designed with you guys in mind. We'd really like to have your feedback on this. The idea is that if you have an application that just needs one set of privileges, that you can write a profile that will sandbox it and only allow that one set of privileges out. And there's a lot more information in the security talk about this.

So just to summarize some of the points here. For data-intensive applications, welcome to a fully 64-bit enabled world. We kind of put a lot of this infrastructure in Tiger, and we're going to reap all the benefits of it in Leopard. So we're believers in standards, and we believe standards are good.

It helps in porting applications. It helps you guys to see a common set of interfaces. But be aware that we've made some changes here from our legacy behavior. And make sure you don't step in any of the potholes involved. We are still very committed to the open source Darwin kernel. We believe DTrace is a great tool throughout the system for you guys to actually explore the system behaviors and optimize your applications on the system.

And the last point is just feel confident that the kernel on Mac OS X is keeping up the performance and scaling for your application, hopefully completely transparent to you. The whole idea of the kernel is that we help you by running your system as fast as possible on the hardware.