DTrace: Kernel Logging Made Easy - WWDC 2006

OS Foundations • 46:18

Speaker: Terry Lambert

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello. Welcome to Kernel Tracing Made Easy with DTrace. Today we're going to do a short introduction to DTrace. We're going to learn what it is. We're going to learn a little bit about the deprogramming language. I'll show you a sample script, another sample script. We'll talk a little bit about what works now. We'll talk a little bit about what's not in the WWDC Leopard seed that you guys already have in your hands. And we'll talk a little bit about what's going to be different.

Anybody who attended Dave's talk earlier this week saw this slide. This is DTrace being used by X-Ray. What we're going to talk about is this part of things, just DTrace by itself. So what is DTrace? DTrace is an open source technology and it came out of OpenSolaris. It was the first thing OpenSolaris released. The D stands for dynamic, so it's dynamic trace.

It's usable for profiling, locating bugs, exploring the operating system, just figuring out what's going on. There's no recompilation of code necessary to use this tracing facility on the operating system or on your own programs. You can trace more stuff, less stuff, and you can drill down on problems iteratively, again, without changing a line of code. There's near zero overhead when it's not in use.

So if you don't use it, it doesn't cost you anything. For dynamic probes, there's no overhead at all, unless you're using it. For static probes, there's a tiny bit of overhead which is normally eaten by the CPU instruction pipeline because it turns static probes into no ops, and then replaces them with code when you turn them on and puts them back to no ops when you turn them off.

DTRACE is not here to replace your debugger. You can't modify memory. You can't modify registers. You can't interpose functions, so you can't change the arguments to a function or the results from a function. You can't call arbitrary functions. So if you had lock_debug in your kernel and you wanted to run it like you would from a and it would print out your lock list. You can't do that. D itself is not a Turing complete language.

And the simple way to explain that is you can't write general purpose programs in it. But what it means to you is there's no branching, so you can't jump backwards and get into an infinite loop. There's no iteration, so you can't blow a loop counter and get into an infinite loop.

And there's no user functions, so it's not possible to do recursion and either get into an infinite loop or crash. But you can change the timing a little bit. When you do probe functions, you actually do end up executing code that you would not have otherwise executed, and this can sometimes result in timing changes. And you can use that to your advantage if you want to trigger a timing change if something only happens when the timing is off.

The deprogramming language. The deprogramming language has a C-like syntax. It's an event declarative language. That means it's very much like awk. I don't know if everybody's familiar with awk. If you're a Unix head, you are. If you're not, it's the weirdest thing ever. You can indicate an interest in an event by creating a probe.

And what happens when that event goes is the probe fires. And then whether or not it actually does anything or collects any information is based on a predicate. So actually when you instrument a probe point, the probe is firing, but you don't necessarily have any collection of data unless the predicate is true. And it's basically like an if test. So you declare the probe, you say, "Here's the probe, but only do it if this thing happens." And of course the order of the probes is important.

We can decode C, C++ data types. Actually, Objective-C as well, with some of the work by the tools people. But they don't use the compact type format for it. Compact type format is basically massaged debugger information, which gets put into a separate section, and then is used to allow you to know whether or not something's a pointer or an int, and allow you to dereference structure members. For example, the contents of the proc structure. You want to look at p_com, which is the command that happens to be running in that process. You can do that.

It also supports speculative tracing. What speculative tracing does is it lets you collect data on everything, and then if you decide after you've collected the data, no, that's not what I was interested in. You can throw that away and only show the data that shows the things that you're interested in. This is a little bit more powerful than the windowed debugging in Shark, in that all you get back from it is the events of interest, not all the events surrounding it.

So, what's a D script? Well, if you're a Unix head, then you'll recognize this as a back-of-snow grammar for a D script. But what it basically means is that a D script is one or more clauses. A clause consists of a probe, a predicate, and an action. The predicate is optional, as you can see by the brackets.

A probe itself consists of a provider, a colon, a module name, which is optional, a colon, a function name, which is optional, colon, and then a name for the probe itself for the action that it's going to do. So, for example, entry or return on a function. And I have a couple of examples here. The first example right there, we're doing a system call probe. So, that matches system call.

It matches any module. In our case, the only one who implements system call probes is the kernel itself. That module name is mock_kernel. And then we get the close function. And when the close function returns, then if self -- self is a reference to the thread. And if that's true, then it sets it to false.

This is basically the probe we use to turn off tracing on the way out of a system call. This second example is function boundary tracing. Function boundary tracing traces entries into and out of functions. This particular one, we're saying match only the Mach kernel. Actually, you could put a KEXT name in there, although that won't work in the WWDC seed. And then it doesn't care what function it is.

All we care about is the entry to the function. And then, of course, just to be tidy, I'm saying I'm not interested in tracing things which have to do with Dtrace itself. So, if the executable name is actually Dtrace, then I'm not going to trace it. And then what I do in my action is -- and you'll get a little more information on this later -- is I have an aggregate variable like calls. This is similar to a Perl associative array. It's indexed by the executable name. And what it'll do is it'll count the number of times that you enter into this. So, this particular example counts the number of function call entries.

"That happen within any given executable over the entire time that the probe set is running." So let's start with the sample script. At the top, there's the familiar pound bang that says this is a script. The executable name is UserSbinDTrace. That's the same location as it is on OpenSolaris. The dash S says it's a script.

So what we're going to do is we're going to do the close entry instead of the exit. And when the touch program calls close, because we use the predicate touch, we're going to begin tracing this thread. And the guard is a global variable which, like basic, you declare it and start using it.

[Transcript missing]

When the close call returns, we unmark the thread as traced. What this is basically going to do is we're going to enter into the close call, we're going to see all of the functions which happen in the kernel as a result of the close call, and then we're going to see the close call exit. So, I have a demo. So, demo machine please. Demo please?

[Transcript missing]

Is this thing normally running Leopard or not? Do we know? We probably don't know. Well, that's interesting.

[Transcript missing]

The DTrace output is the trace output. Maybe I can get Steve to come up and boot this thing on the Leopard on the external hard drive, unless it's running Leopard already. What you would have seen is a column which indicated the CPU that things were running on, another column which is the probe ID, and then you would have seen function call entry and exit for all of the function calls in between. And it turns out that this is actually not an incredibly pretty output. So what you can do is you can actually control the output that you get. The output that you get is actually varying by provider.

So you can override the default output. In this particular case, we were getting the function boundary trace output, and it was pretty ugly. So I don't mind missing that demo. You can override the default output. You can either do this with your own output in addition to the default output, and there are a bunch of intrinsics for doing this. There's a trace intrinsic, which allows you to trace variables or data out. There's the trace mem, which outputs chunks of memory. It's controllable. There's our old friend printf.

There's printa, which is like printf for the aggregate variables I talked about earlier. There's stack, which will show you the kernel stack at the time that the probe fires. And there's ustack, which shows you the user stack at the time that the probe fires. And some providers let you select from alternate outputs that they themselves support.

For example, the FBT provider has a poundpregmedd option flow indent, which is really more complicated than it needs to be. A way of saying, "I don't want the ugly output. I want the pretty output." Is it back alive? Yes. Okay. I'll let Steve check to see if it works. So I'm actually going to give another demo, which shows actually the call graph nesting depth.

[Transcript missing]

We'll get the demos up in a second. Adding information to our sample script. So, in addition to getting the pretty call graph output, which would have included only the CPU ID and the function name, as well as little arrows left and right as to the probe depth, which I actually would have been able to sneakily show you if they hadn't got rid of my folded slides.

[Transcript missing]

If it's up and happy, which it may be, then I'll be able to give a couple of short demos. Does it panic when you do a function boundary trace? We'll see in a sec. Looks good, okay. So, I guess back to the demo machine. Oh, what is it? Shift Option +? Yeah. Oh, go back to demo, please. Go back to demo.

Yay! Okay. This is kind of meaningless over here. You can ignore it. But what I'm going to do is I'm going to trigger the sample over here. And this is my initial sample. So what we'll do is we'll touch, since that's the program that's waiting for some file in temp. All of my files tend to be named foo. And there. This is the initial output we were supposed to see. And as you can see, it's not very pretty. You can also see that it does a lot of work. So, let's make the output a little prettier.

This is the second demo. I'm lazy. I'll just up arrow and do foo again. Okay. So now here we see our call nesting depth. This is the same type of thing that you might get out of Cflow or something like that, some other tool. But it gives us a good idea of what's going on. DTrace, an open source project which provides insight into the interaction between your code and the OS kernel, is now available as part of Leopard. This session will explain how kernel developers can both take advantage of, and extend, DTrace's functionality to help debug and optimize low-level code.

This session will explain how kernel developers can both take advantage of, and extend, DTrace's functionality to help debug and optimize low-level code. how the nesting depth doesn't quite line up. The reason is because we don't see the tail call optimization on that particular function. Hopefully, we'll be able to deal with that. But it actually doesn't distort the output too much. You get useful output from it. So what we'll do is we'll go to the pretty example with our own output. Let's mix things up. Phi instead of fu.

So, what do we have here? On the right, we have timestamps in nanoseconds for the amount of time that it took in each one of these calls. These are actually one behind because of the way that I'm calculating the deltas. I could have tried to make it prettier, but it's not a Turing complete language. So, we can look through these and we can see that out of these numbers, the only thing that really sticks out as a high number is this guy here. That basically means that that function is potentially one that we're going to want to optimize. Okay, podium please.

So let's talk a little bit about probe order. Probe order is important. Probes are firing every time they're hit. You can think of these as events which occur. And event streams always occur in order. So what's going to happen in this, we've switched up the order of assist call and the FBT probes.

And we're going to see what happens. Because this probe comes first here, the assist call close probe, and we see that its action is to turn off tracing. This probe does not fire. And the reason this probe doesn't fire is because its predicate is no longer true. So you've got to be careful a little bit about probe order if you want to capture all of the information that you're trying to capture.

So, don't know if I want to go back there. So we have another demo coming up here. What we're going to do is we're going to watch the amount of data read and written by the program top. This shows instrumentation of the system call functions read and write, and it shows introspection of their arguments.

You can see I'm using another aggregate function down in the read entry and the write entry. They seem identical. What I'm really doing is by probe function, which in the read case is the string quote read quote, and in the write function is quote write quote, and we quantize it on the second argument. Arguments are option based zero, so don't know how many people directly use read and write.

Read and write take a file descriptor, they take a buffer pointer, and they take a byte count. So what we're going to quantize on here is argument two, which is the third parameter to the function, which is the byte count. So we're interested in the byte count that's going to happen. And then when we see the system call exit get entered, what we're going to do is we're going to print out our information, and we're going to exit DTrace. So this basically captures only the information that you're interested in capturing. Demo, please.

Okay, so what we're going to do is we're going to watch what the program top does. So I'll start up top over here. What it's doing really doesn't matter. I'll switch its mode, switch its mode back. I'll quit it out. And here's what we get back from it.

What we get back is power of two histograms on the number of bytes read and written, and the number of times each function is called. So for example, we've had three calls which were somewhere between 1025 and 2048 bytes. We've had seven calls down in the 165 to 128 bucket and so forth. So we basically get a histogram on that.

So actually it's kind of not necessarily interesting to do this because you don't necessarily capture what's going on at the time that you're having a problem. So what I did was with a little ANSI escape sequence foo on my computer, I was using my original script and then using what's called the profile provider, which fires at a certain time interval. It's user-specifiable. I said do it once a second. I'm going to capture just the writes.

And I'm doing this in real time. And it's updating the graph in real time. So you can use DTrace to write your own tools. And you notice when I move the pointer, it actually moves, so it's not eating the system either. And when I quit top, it stops. If I start it back up, it picks up where it left off. Okay. Uh, podium.

That was our demo that we just did. What works right now? Right now, we have a number of providers working. We have the DTrace provider itself that provides begin and an error. That lets you set up things before you start. It lets you clean up things or print out data when you're done. And it lets you handle error conditions. Right now, please don't generate error conditions.

Basically, what this is intended to do is it's intended to trap things like null pointer dereferences if you attempt to introspect a pointer to a string that happens to be null, or things like that. Function boundary tracing, we just demoed that. System call tracing, we demoed that. Timed events, we demoed that. There's actually another demo in the set of scripts, which makes the timed events a little more understandable. But I won't show that. You can get that at the website. PID providers. We have two PID providers.

We have the regular PID provider and the tools people have added in Objective-C PID provider. The Objective-C changes up the arguments a little bit for module name and function name, and replaces them with class and method. And we'll be documenting that. We've got some static user probes going, which basically means instrumenting your own code or instrumenting our code, or as you maybe saw yesterday, instrumenting Java 1.6. We support the full probe complement for detracing Java that Sun supports.

And then Mach calls. That's an example of a little bit of a demo. And then we have a little bit of a demo of the code. We're going to show you a little bit of a demo. That's an Apple edition. And it's basically the same as a system call provider, except for Mach calls, which Solaris didn't have. The detrace command line works, although apparently not the first time.

The detrace library works. So you can link your own programs that use it against that library, although I don't think many people actually do that. We don't even do that with X-Ray. X-Ray actually invokes the command line detrace and parses its output. And the detrace documentation already out there applies to everything that we've done in detrace. So anything you can Google on the web, that basically applies to us.

So, what's not in the leopard seed that you have? What's not in the leopard seed is some of the providers. They're all static probes. They're all in my bug queue. And they're basically the static kernel probes for gathering information from places where it's dangerous to gather information with a function boundary tracing. And so, when these events fire, we have a different way of collecting the information, and because of that, we instrument things that you can't.

And if you had something like that in your driver, you'd be able to use that as well. But, like I said, right now, for the seed, we're not supporting static probes. The compact type format and symbols for tracing texts is not there. That would include your texts. So, right now, you can't specify your text name as a module name, but you will be able to specify your text name as a module name. Static probes in your texts, you could use those as well if you had dangerous places. The PPC support is currently lagging behind the x86 support. That's a little explainable. We're porting x86 code.

The PPC support is pretty much there with the exception of the PID provider. And then there's a number of additional exit probes and entry probes that were kind of scary because there's actually a lot of assembly language in our PPC kernel. And the x86 support is currently disabled. You can actually boot your system with a boot args, your 64-bit system with a boot args of -legacy. And that'll re-enable it. But you won't be able to run 64-bit programs while you're doing that. We expect to address this in the near future.

So, what's going to be different from Solaris? I was going to say the other ports, but it looks like we're out of the gate first. Mostly, they're OpenSolera-specific things, and most of them have to do with impedance mismatches in kernel design. I could go into detail on each one of these, but effectively, they have to do with the differences in architecture. Where possible, we're going to map to Mac OS X equivalents. So you should be safe from them when running generic scripts.

You might get slightly different information, but we expect the probe points will work, maybe with the exception of the UFS probe points that they have in their code. And you should be able to use generic scripts which are written on Mac OS X on non-Mac OS X systems. "We'll be contributing to the organ bank out there and we'll be collecting."

So, what can I use it for now with what I get in the seed? Well, you can find the location of kernel memory leaks. I won't go into that in detail, but there is a sample script which instruments the kernel zone allocator and counts the number of allocations and deallocations. And everywhere there's a positive number, you either have a leak or you have a persistent allocation. And everywhere there's a negative number, I'm surprised you didn't crash.

You can find out where an error is coming from. That's kind of neat. So you could use speculative tracing to say, give me the results of this, and when this function kicks back this error, then I want that information. But until then, just keep doing it and throw away the stuff I don't want.

You can find out where to look for performance problems for a given program load or a given benchmark. My benchmark of touching the file foo in /temp indicated that the closef

[Transcript missing]

That's a real big thing for me because unlike texts, I can't necessarily unload my kernel and load a new one without rebooting.

Most places that you'd put a printf are places where you would do printing out of arguments or lock state or things like that. And obviously anything that's available on the argument stack is available at the time that the probe fires. Things which aren't available are local variables. So you aren't going to be able to, for example, introspect the contents of a stack variable.

So maybe once in a while you'll either have to put in a static probe, which will be deactive and not out of your shipping code, or you'll have to put in a printf, do it the old way. You can also be surprised at how much work the kernel really does.

I expected a lot of people who come from user space, expected it to go into close and come -- right back out without doing anything. Instead we had maybe 40 lines of function calls, maybe 20 functions being entered and 20 functions -- the same 20 functions being exited. And it's fun for just poking around.

I know Steve and a couple other people poked around and we found out that, for example, our Windows server was -- I think it was -- I think it was calling sigprocmask an ungodly number of times a second. And that's something we have to look into. So, what we're going to do is we're going to do a bug hunt. This is a bug hunt that I've done personally, and it was my first real use of DTrace, and so I'm going to show it to you guys. Demo, please.

[Transcript missing]

I worked on the POSIX conformance stuff, and as part of that, we had to deal with differences in the set UID, set GID calls, and so on. And we have this little thing called Apple Directory, which is not such a little thing, and it provides credential information that is not stored locally.

Well, it turns out that you pretty much have to be able to get out of with a particular group with set groups. And it turns out that even when you left the group, you were still in the group. Because set group would say no, you're not in the group. And then it would say, well, let me ask the directory server. Directory server says yes. Congratulations, you're in the group. So I had to do a bunch of code to deal with that. And my code to deal with that is-- let's see. What did I call it?

About 1900 lines of code. So I had to deal with it in about 1900 lines of code. And, like most 1900 lines of code changes, something didn't work.

[Transcript missing]

That should have been the transition event for the credentials were spread all over the place. There are a lot of files. I can say index of three files, three fairly large files.

And To do this with the traditional method, what I would have to do is I'd have to spread printfs in every place where potentially one of these things was happening, and I would have to deal with it. So instead, what I did was I wrote a DTrace script.

Let's see, where is it? I wrote a little DTrace script. My little exec name cheat here is so that I can use a different code path. Obviously, this kernel does not have the bug anymore, so I had to put it back in, but if I put it back in, it would have been bad, so I had to put it in a different place. So what I'm going to do is I'm basically going to check the function boundary crossings to see what code path is there in the bug. And I have a little file.

The DTrace: Kernel Logging Made Easy. DTrace, an open source project which provides insight into the interaction between your code and the OS kernel, is now available as part of Leopard. This session will explain how kernel developers can both take advantage of, and extend, DTrace's functionality to help debug and optimize low-level code.

[Transcript missing]

And here's the call path. And rather than getting into a huge amount of explanation on all the mandatory access control and all that stuff, what we'll do is we'll say, well, This is the only thing in the code path that I touched in the 16 places that I touched.

So that's the one that I care about. So what we'll do is-- What we'll do is we'll wait until we get into that function. That happens to be a kernel function. That's a kernel function you wouldn't otherwise be able to see into unless you had sources to the kernel, compiled up your own kernel, put a printf in.

And what I'm going to print out is I'm going to print out the user ID, the effective user ID, the saved user ID, and the group membership UID, which is the one that should have been changing. Those are all the parameters to the function. There are arguments 1, 2, 3, and 4 to Kauth, Cred, SetReal, Effective, and Saved user IDs. So we'll say ./script2. And we'll run my special buggy version of the code.

It wasn't an X. What was it called? It was called bug. Of course, why would a bug be called bug? Okay, and we're going to run it. And if I quit reaching for the clicker instead of the mouse. We'll see that what's happening is it's effectively opting us back in. It's saying, "For group membership checks, check for UID 501 in the Apple directory." Well, I told it not to do that with the set groups. What's going on? Obviously, the parameter is wrong.

So, how do I find that out? Well, I can... I can go through here and try and find something near or around or before or after, or maybe not even in the diffs that I did because I forgot something. That's a lot to go through. That's not interesting. I could go into the C file itself.

I could go in the C file and look around, but that's not a lot of work. But I know from the DTrace that it's that particular function. So what I'll do is I will go in the C file. Kernel.proc.c. And what we're looking for is we're looking for chaos.

DTrace, an open source project which provides insight into the interaction between your code and the OS kernel, is now available as part of Leopard. This session will explain how kernel developers can both take advantage of, and extend, DTrace's functionality to help debug and optimize low-level code. This session will explain how kernel developers can both take advantage of, and extend, DTrace's functionality to help debug and optimize low-level code.

Let's see, we know it's in the setuid function. So what we'll do is we'll go down to chaos-cred probably. Did I do that? Yeah. Yeah, CathCred. So here's the function call. And it turns out that what I did was I set a group membership UID I set it only in the case that I'm opting out of the directory services, in this case, memberd. And then, well, I'll be darned. I did all my work except passing the right parameter. I assume that I've recompiled and rebooted, but only once.

And I see from my DTrace output, It's passing Kauth no cred, Kauth cred no cred, which is what it should have been passing in the first place. And so I fixed my bug, and I've done it in about five minutes, and that's about what it took me. Podium, please.

So I pretty much, while I was talking, went through the moral of the story. Doing a lot of printfs and debug, I would have been at least four kernel recompiles and reboots out. I would have had to put the printfs in everywhere. I would have had to take them out of everywhere except the code path of interest, put more printfs in the code path of interest, and then I'd have to recompile and reboot again, try again.

And then I would have had to take all my printfs out and hope that I got them all. Instead, five minutes. Here are the DTrace resources on the internet. I'm sure that Garth will make this available somewhere. And so, what should you take away from this? DTrace is neat. I like it. And that's not just because I helped Steve, who did most of the work, work on it.

You want it. Leopard has it. It's not hard to use. As we saw, hopefully everybody's going to come away from this with a basic understanding of how to do a Descript, or at least putter around until they get something useful out of it. It's going to make your life easier, and it's going to help improve your code.