Using Performance Analysis and Debugging Tools on Mac OS X - WWDC 2006

Development Tools • 1:01:10

The first step in optimization is understanding where your application is using resources. Learn to use Shark and other tools to understand application flow, analyze what your code is doing, and take the necessary measurements so you can make your application run lean and fast.

Speakers: Dave Payne, Rick Altherr, Ryan Du Bois

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

And we'd like to tell you today about the Performance Analysis and Debugging Tools in Mac OS X. We'll show you some case studies of using those tools to do optimization, and we'll talk about some exciting new directions that we're taking the tools this year. So first off, why are we here? Why worry about performance? It's real easy to forget about it in that last rush to get the product out the door, but it is a competitive selling point. Hopefully we all care about it. I'm on the Xcode team, I certainly hear about it.

It's also easy for performance problems for you not to really notice them and maybe your customers use the product in a slightly different way than you did and they start seeing issues. It can be hard to fix major performance issues at the very end of your development cycle. So you want to be sure that you're using a development process that allows you to monitor and fix performance throughout your development cycle.

There's a lot of stuff going on in the systems these days. FS event monitoring for time machine, spotlight in the background. Users want a lot of capabilities so we want to make sure that we're using the system to its best capacity. And it's not just speed. Users want a responsive system.

They want scalability. If they toss ten times more data at it than you ever did, are your N squared algorithms going to be able to do that? Are they going to go out of control? Memory use is a big deal. If you are testing on a two gigabyte maxed out development system and then your users work with it on 512 megabytes, are you going to be paging a lot and thus having a lot of problems?

And battery life. It's a huge issue now that we're selling more and more laptop computers. You saw the sales numbers. We want those batteries to last through an entire plane flight and play all our movies and listen to iTunes. So, important things. So in this session we're going to look at Shark for profiling your applications. We'll look at how to analyze 64 bit binaries. And we'll talk some more about X-Ray that you heard about on Monday. And specifically in this session, how it makes use of D-Trace under the covers.

So the flagship analysis applications that we've got at this point are Shark and X-Ray. Shark, hopefully you're all familiar with it. It's great for profiling your application or your entire system to show where the hotspots in the application are and help you fix those, including giving suggestions for fixing them. X-Ray is a new direction for us that really lets you go in and peek beneath the covers to understand what's going on in a way that we haven't been able to look at processes before.

These are part of the overall performance tools suite for Mac OS X that we've been shipping for a while. X-Ray is new this year. With these tools, we can monitor performance so we can see when there's an issue, and we can analyze it to understand why there's an issue. So they cover various areas like execution behavior, CPU speed, I.O. They cover memory use, and there's tools for analyzing graphics.

So hopefully most of you are familiar with most of these. The tools like Top, Big Top, and Activity Monitor for looking at overall system performance. Quartz Debug to help you figure out, are you drawing too many times to the screen? Really help visualize that. And then the best tools on the system now for analyzing, Shark and now X-Ray. For memory use, Object Alloc is still probably the best. Malloc Debug is still there too.

X-Ray is getting some of these capabilities now, and we've got command line tools as well. These ship with the X-Code tools releases, providing full support for Mac OS X, all of our major environments, Cocoa, Carbon, Unix apps, and all the major languages. Objective C, C, C++, Java, and Shark supports Fortran as well. The graphical apps are integrated with X-Code, so you can make this part of your development process. So here to tell you more about Shark, I'd like to bring on Rick Galthor. Rick Galthor: Hi, Rick.

We're going to cover briefly what Shark is and what it can be used for, and how you actually might go about using it in your workflow, and some new features for this year. So what is Shark? Well, it's a profiling tool. So what does that mean? Well, it's a tool to help you find the performance characteristics of your application as well as the performance bottlenecks within it. So you can actually find why you aren't running as fast as you could. Shark supports a wide variety of languages. Almost every language that compiles down to the native machine we can do analysis on. We actually support source line information for many of the compilers.

We also offer things as both GUI and command line. This is really helpful. We actually, the command line version is especially useful for doing things like automated regression suites. You can actually have it collect Shark sessions on these suites and come back and look at them if you see problems.

So we make a big deal about Shark being very simple to use. Performance shouldn't be painful. To this end, we actually like to call it one-click profiling. You really have one button that you need to worry about, and that's a start button. Pushing this button, you instantly are collecting profile information. You don't even have to do anything else. 30 seconds later, a session will appear. It'll show you what was happening. And from there, you can go on your way.

We also realize that not every time do you need to look at just one type of information. So we also offer many different configurations. This lets you change what you're looking at, what type of characteristics you're interested in. And we also offer a number of bundled presets so that you don't have to actually create your own configurations. You can go for the most common types of things right off the bat.

And we also let you narrow your scope. Sometimes you want to look at the whole system, sometimes you don't. So we'll cover this in a little bit more detail in a little bit. So it's really simple to use, but one of the important things is that it's actually really powerful, too. You can have an easy tool that doesn't help a whole lot.

In this case, we end up with Shark brings the performance bottlenecks right to the top of your session. It shows you this is where you have something happening. We do this so you don't have to go hunting for it. It brings it right to you. We also find the code within each function that is the important part.

For example, you can see this line of source code has 45% of the samples, and we actually change the background color to bring this up and highlight it. In fact, when you open the function, it'll jump right to that line to get you looking at where are you spending a lot of your time. We also offer tips, performance tips, in the sessions so that you can simply look at these tips, and they encapsulate a lot of the common performance bottlenecks that you run into, and they offer suggestions as to how you might go about fixing this. in your application.

And the really important thing with Shark is it's very, very, very low overhead. Most of the time, you won't even know that Shark was running during the session. It won't even show up. Because you don't really want to look at what Shark's doing. You want to look at what your application's doing. So we strive to keep our overhead as low as possible.

So let's talk a little bit about how you might use Shark in your application. The first thing you need to think about is what exactly do you want to look at? What type of information do you want to collect? So three of our most common profile types are time profile, system trace, and static analysis. So let's go ahead and take a look at these. Time profile is what you might consider a classic profile. It's a look at what was actually executing on the processor at given time intervals.

And we aggregate that and give you a session and show you exactly where your time was spent within your application. We show you not only what function it was in, which is what the self column is, it shows how many samples were in that function, but we also aggregate it through call stacks. So you can actually look and see, yes, there were 79% of the samples in this function, but what different call paths did it come from and which one contributes the most? Because some functions may only actually see a lot of samples in one particular function. invocation area.

So this is really helpful for finding CPU-bound problems, but sometimes you're actually running into performance bottlenecks interacting with the system. So I have a thing we call System Trace, and it looks at the boundary between the system and your application. System calls, VM faults, a lot of things that can be performance bottlenecks, and they're usually pretty tricky to find.

So we offer a couple views. The first one is the summary view, and it shows where are you actually spending time during this trace? Is it in the kernel? Is it in your code? Is it in VM faults? Where are we actually spending a lot of this time?

But sometimes that's not sufficient, and you actually want to see what really happened. So we have the timeline view, which actually shows what CPU was executing what thread in a graphical format, as well as what events occurred during this time sequence. So you can actually look at each individual event. And not only do we show you that an event happened, but for many of them, we show you a call stack of what in your application caused that event. Now, this doesn't work on everything-- for example, VM faults.

But for system calls and things like that, you can see how you got to it, because the system call name itself might not be what you actually called. For example, in this case, you see that our application, Flurry, actually made a call through Foundation, which ended up in LibSystem, which then made a system call.

And static analysis is yet another way of looking at things. It's a way to look at the binary without actually running it. So you may have an isolated section of your executable where you're not tracking down the runtime problems. You're just trying to figure out, how can I optimize this function to the maximum? And what this does is it runs our analysis engine and pulls up the same performance tips that you would see in the runtime behavior without actually running it. It just looks at the file on disk.

So you've picked what you want to look at, and now you have to pick what on the system you want to look at. You know the type of information. So we have three different ways of-- or three different scopes of sampling. Some of the configurations work better with certain types of sampling.

For example, System Trace is very oriented towards looking at the system as a whole. You don't particularly want to look at an individual process. You might after you collect the information, but it actually collects what happens with the OS as a whole. Time Profile can actually be used in both everything and an individual process. You can pinpoint it on just your application and see what happens.

File, on the other hand, lets you actually look at just the file and disk. And this is really useful for static analysis where you're looking at potentially what the compiler output is or if we can see any problems with custom rolled assembler, various types of things like that.

[Transcript missing]

Thanks, Rick. So, as he said, my name is Ryan Du Bois and I'm here to tell you what's new with Shark. We spent the last year adding some scalability and robustness to Shark as well as a host of new features and I'm going to highlight five of them.

First feature I'd like to talk about is UTF-8 support. As some of you may know, the old Shark had some trouble with source files, application names, and framework names with UTF-8 characters. That is now a thing of the past. The new Shark supports all these things out of the box.

The next really exciting feature I'd like to talk about: 64-bit profiling. We've added completely universal support for profiling 64-bit binaries. It allows you to get above the 4-gigabyte line for symbols and code. This will be instrumental in performance profiling your Leopard applications because the frameworks and the code are going to be loaded up above 4 gigabytes.

We've also added EM64T disassembly support. You can see the syntax in Intel or AT&T, whichever is more comfortable for you. And there's even an integrated instruction set reference. As you'll notice from this screenshot of a time profile, the addresses here are above the 4 gigabyte line, and it's using 64-bit registers. So I'd like to give you a quick demo of this in action. So we can switch over to the demo machine.

So this is a little application called Fractality. And what it does is it draws fractals and allows you to animate them in motion. We contacted the developer of this application. He was gracious enough to let us port it to 64-bit on Intel hardware and push it above the 4-gigabyte line.

Even allowed us to change the name to Uber Fractality to demonstrate the UTF-8 support. So I'm going to go ahead and launch Shark and take you through kind of the workflow. We're going to pick a time profile. We're going to just target fractality. And what we're going to do is go ahead and start it animating.

So that's what Fractality does. Keep your eye on the frame rate while I'm profiling this. I'm just going to press Start. You'll notice it doesn't drop terribly much at all. I could have let that run all the way, but I decided to stop it early. And here we go. This is the Shark profile.

If I look at this and try to interpret it, I'll notice there's a lot of GL stuff going on. And that's because this particular set of drivers does not have hardware acceleration. However, I do have a save session of Fractality from a machine with hardware acceleration. So if we go ahead and poke open this function, the one that comes to the top, Make this a little bit bigger.

You'll notice we're spending most of our time calculating sine and cosine. It takes us right to the line number of the source code and highlights it, as Rick was saying, with the colors. If I click over on Assembly, you'll notice the addresses are above the 4 gigabyte line and it's using 64-bit registers.

Can we switch back to the slides? So the next feature I want to talk about is Dwarf Debugging Support. We've added support for Dwarf Debugging Format, both 32 and 64 bit binaries. You can mix and match with STABs, and we've even added support for DSIM files along with it. Fractality, as you just saw, was built using Dwarf symbols, so all the source line information came from Dwarf.

The next feature I want to talk about is something we like to call Symbolication. So let's say, for example, you write an application and it goes to your QA team before you notice a performance bottleneck. The problem here is the QA team has it as it's going to be released, so they don't have any symbols, any of the debugging information.

But they go ahead and take a Shark session anyway, trying to be helpful, and they email it to you. With the old Shark, you would have had no idea what was going on. With Symbolication, it allows you to point Shark at a symbol-rich version of that same binary and discover the symbol names after the fact.

Next feature I want to talk about is called the windowed time facility. So for another example, let's say you're writing a video player and you're testing it. You go to play videos. Every once in a while under certain conditions you'll notice you're dropping frames. Not quite sure why. And the problem is it's hard to repeat.

With the old Shark, you would have had to press start, hope that the frame dropping was going to happen in your sample window, press stop, and hopefully you got it. Kind of a pain. That all is now a thing of the past. What windowed time facility allows you to do is press start and forget. Shark will continuously collect a certain limited sample window in the background and you press stop after your target event happens.

So with the new Shark, you can just press start, set your sample window, you're watching your movie, going along, boom, you notice you're dropping frames. What do you do? Wait till it's over, you press stop. Shark gives you that last window of samples as your session and you can pinpoint exactly what happened.

We've added support for this in Time Profile and System Trace. There is a windowed time facility checkbox that allows you to set the sample window for both of these configurations. So it should be readily apparent to you when you go to use it. Here to demo that, I have Rob Barris from Blizzard Entertainment. Rob?

Okay. I'm just going to switch the demo machine here. Can everyone hear me okay? Fantastic. Hi, I'm Robert Barris, 18-time WDC attendee, first-time speaker. I run the Mac software group at Blizzard Entertainment. Our most recent Mac product is called World of Warcraft. This is a massively multiplayer online role-playing game.

It is a large and complex product, and we ship updates to it regularly. In fact, over 40 updates since it first came out in the fall of 2004. And part of that regular update process is identifying and correcting any performance regressions. As this is a product where new features and capabilities are being added with every release, it's always possible that some new behavior sneaks in that is unpleasant. So we like to catch those things and fix them as quickly as we can. And Shark has been extremely valuable to us in identifying those types of problems.

With the new window time facility, the task of finding transient hiccup-type events during gameplay has become a lot easier. I think of the window time facility as TiVo for profiling. As Ryan just explained, the ability to have the sliding window of most recent events kept stored continuously and to let you stop when you identify a problem really frees you up to go tackle some of the more surprise issues that may occur in your application.

So let's have a look at World of Warcraft and see if we can spot one of the little hiccups that we'd like to eliminate. Due to time limitations today, we won't be able to actually dig into the code and make the fix and rebuild it and so forth. But I think we'll demonstrate the strength of the window of time facility and how it can help you zero in on something that may be elusive.

Okay, so I'm going to start up Shark. You are visible here, excellent. There it is. OK, I have it set to the window time profile version. You'll notice that's at the bottom of the menu. It's distinct from the regular time profile and system trace at the top. And the one thing to look out for is the new sample window.

So what we have set up here is we've asked to maintain a rolling window of 10,000 recent samples, each one being a stack crawl done on your whole machine at a given moment. And they're spaced one millisecond apart. So this works out to a 10-second trailing window of activity, which basically means that if you see something strange, you have 10 seconds to hit the hot key, stop it, and have a look at what happened. That's not good.

Log in to our character here. Excellent. There's my teleporter character. She can go all over the world and see various interesting issues. Okay, here we are. This is the city of Ironforge. It's one of the central hub cities in our game of World of Warcraft. And we're just standing in front of the griffin trainer.

This is an in-game character who can let you hop on a griffin and fly to another city. So I'm just going to run down there. Oh, let's start up the Shark profile. This is the fun part. So Shark is now continuously sampling and keeping the last 10 seconds of activity in its buffer.

Now, I have seen several times, and I'm going to try to reproduce it right now, a hiccup when we go through the very first little portals. We fly up and out of this mining area and start to leave the city. There'll be a little hiccup, and I'm hoping to catch that and show it to you live.

Here's the trainer. I'm going to talk to him. Say, "I would like to fly to this other city." And he puts me on the bird and we start flying. And I think it will happen in the next few seconds. There was one. I'm expecting another one right about here. There it was.

Hey, it happened. I hit stop on Shark. I can actually just hide the game now because we're done there. Shark is analyzing the samples. To save time in the demo, I repeated this process earlier, and I have a snapshotted version of it right here. What I want to show is when it first comes up, you'll see something like this. You'll see the percentages spent in your various routines.

And a lot of these are very familiar to me. They have to do with rendering the terrain, routines inside OpenGL, internal data structure updating. This is like the bread and butter of what the game is doing frame after frame. And when we're running 40, 50 frames a second, that's somewhere around 20 milliseconds per frame, which corresponds to 20 samples that Shark took for an average frame. And we can go see that right now in the timeline view, one of my favorites.

If I click on one of these columns in the timeline view, what that's showing is that the spikes are the depth of the stack of your thread. You can actually view various threads, but we're watching the main thread here. Each one shows the depth of your thread at that moment in time when the sample was took. And if you tap the arrow keys, you can go from sample to sample. Can you see that okay? Can everyone see that okay? And it will show you the stack crawl evolving on the right. Take notes quickly. You'll see the names of all our routines.

So basically one frame should be about 20 of these. That's like a normal frame. There's another normal frame. So what we're going to do is we're going to scroll to the right. We noticed that hiccup happen. And what I'd like to do is if you think you see something that kind of breaks the pattern of the activity here, shout out the word bingo. StatCrawl is all pretty much the same height, pretty familiar.

So there it is. That's the event I was looking for. And what we discover here from reading the routine names is I'm going to highlight this one here. This is the post-initialization phase for a new player coming into view. And it's actually jumping into our Lua interpreter to run a script to extract the guild name. If you're familiar with the game, each character has their name and their guild name displayed over their heads.

And what happens when we fly through that little porthole is you fly over a very crowded area of the game where there's maybe 100 people standing around. And when they come into range, the engine is-- oh, what did that have? The engine is-- oh, that's the real profile. It just finished processing.

The engine is pulling all of those characters and running a Lua script on each one to find out what is your guild name. It's doing it all in one frame. Whereas before we had this nice even flow and the frames were taking 20 samples, 20 ticks each, this goes on and on and on, and we're just spending a lot of time in the Lua interpreter.

So what I've done in the span of a single run, a single capture, is I've identified exactly what's going on, and just by reading the routine names, I can kind of get an idea for what I might be able to do to improve this. One idea would be to stagger the guild updates so we only do a certain number per frame and smooth out that load and you would avoid that hiccup. That basically concludes the demo. That's it.

That feature makes me really happy. We asked for that about a year ago and it makes me a happy programmer. Okay. Thank you, Rob. Thank you. I'd like to finish up the Shark portion of the presentation with a quick summary. As you've seen, Shark is easy enough for beginners, yet powerful enough for the seasoned professionals. It offers many customizable workflows. It even gives you complete profiling analysis down to hardware events, should you so choose to use it. We do have some exciting new features, including the windowed time facility, which you just saw demoed live.

There's also a new version of the Chud tools, of which Shark is a part, on the ADC website, and it should be available today. So after the session, please go download the latest version, try it out, and feel free to give us feedback. And I'd like to turn it back over to Dave Payne for the rest of the session.

Oh wait, I lied. More info, there are links to performance documentation on the ADC webpage. And there's also a Debugging with Shark lab later tonight, downstairs in the Performance lab, 6 to 8. All of the Shark team will be there, so feel free to come on down. Use it live, ask us all your questions, and we'll be happy to answer them. Thank you.

All right, thank you. So I'm going to switch gears a little bit and talk about other ways that we can look at 64-bit processes on the system. All right, thank you. So I'm going to switch gears a little bit and talk about other ways that we can look at 64-bit processes on the system. Somebody gets a hold of the system and is not a developer, they can run sample and help figure out what's going on.

VM Map is a process that can look at the memory map of your app, and we'll actually look at that. And Gard Malik will help find memory corruption on 64-bit apps now. So you may have seen this if you've attended some of the 64-bit sessions, what the memory map of a 64-bit process looks like.

So the intent is that to help, as you're developing, to help catch pointer issues and transition from 32 to 64-bit, we'll by default have a 4-gigabyte page zero. So that's the entire memory space of a 32-bit process will not be used by 64-bit processes. User space starts at the 4-gigabyte line and goes up to 128 terabytes. So.

Please fill that up real soon. The shared libraries, DYLD, and the system comm pages will go at the high end of that. You may not see that on your systems today. Then we've got some invalid range there based on hardware limitations, and we'll be loading the kernel up in high memory space.

So I talked about that the VM Map tool can help visualize what the memory map actually looks like in your process. So with this command line tool, you can look at where the binary image sections get loaded. You can look at what the file names of files that are mapped into your process are and where they're mapped.

You can look at what virtual memory regions compose the malloc heap and look at VM allocated regions that get created outside of malloc. Some subsystems, some libraries do actually call VM malloc or VM allocate, and you may also be doing that, possibly in special circumstances. And we can see where the stacks are.

Now I talked about GuardMalloc a couple of slides ago. So what does GuardMalloc do? This helps force a crash if you are using memory in unsafe ways in your application. It can be real easy to miss these things in standard operations. You've got small buffer overruns and maybe that's not catastrophic in the situations you've seen. Or maybe it's that elusive crash or it happens once in a blue moon and you don't know why. So with GuardMalloc you can help trigger some of these conditions. Buffer overruns, accessing memory that you've already freed, or possibly uninitialized memory.

So to use that, again this is 64-bit aware now, because it's causing a crash in your application it's best to always run that in the debugger. So it's most convenient to do it from Xcode. Just go down to the bottom of the debug menu and turn on the "Enable GuardMalloc" menu item.

You can also do it from GDB in Terminal by setting an environment variable to insert the guard malloc library into your process as it launches. That's essentially what Xcode was doing too. So you can read the man page for lots more information. How does it work? Well, first off, it puts every allocation on a separate virtual memory page.

And then to catch buffer overruns, it actually aligns the end of your allocation at the end of the virtual memory page. It's more common to overrun a buffer than to underrun. So by default, it's set up to do that. So we have a 4K page. If you allocate 100 bytes, the 100 bytes is at the very end.

You walk one byte over, and we crash because we've got a protected, unallocated guard page after that, and you've got a memory violation trying to access that. If you think you might be getting buffer underruns, you can set another environment variable, which will then restructure it so that the memory allocation gets aligned at the start of the page, and the protected page is at the start.

When you free memory, it VM deallocates the entire page. So the memory is gone now. So if you try to access that after you freed it, boom, you crash immediately. So you can see immediately when it happens why your bad memory access is happening. And to help try to catch uninitialized memory, there's another environment variable to set that fills memory with a pattern that's not going to be Typically not a valid pointer.

It'll end in 5.5. So, in fact, when I first got this working for 64-bit, I thought it wasn't working because the app was crashing. But that's because it was working, and we had a size mismatch in one field in a library that was causing buffer overrun. So let's take another look at the Fractality application. If I can switch to demo machine. So what I want to do is-- Well, I think Factality is still running here.

I can run VM Map. And again, that had the UTF-8 name in it, so I can just give a partial name with standard characters that I can type in on the terminal. I'm able to run this. So if we go up to the very top here-- let me clear this, try it again.

Up to the very top, you might ask, well, how do I know whether my process is running 64-bit or not? As a developer, you can run this tool, and it shows right here that this is a 64-bit process up the top. We also report the output format. We've got some large developers who actually use VM Map as a part of their standard development process to see when memory changes in major ways and then go hunt people down.

Why did you put that there? So then we see all the non-writable regions of the process. We can see that we start at the 4-gigabyte mark with the text of the application itself, then the link edit segment. And then in this Leopard preview, the libraries are actually loaded right after that. So we see OpenGL loaded right after the link edit segment of Fractality. And we see all the libraries that got loaded in. We have a few shared libraries on our system. We also see shared memory segments.

[Transcript missing]

Core Graphics Areas and we see DYLD loaded up in the high memory space up here. Then by default, the writable regions are split out from the non-writable, but you can interleave them with a command line argument if you would like. So the data segments are from the shared libraries, the data that comes in with them.

And we can see various malloc heap area, font support, the actual names of some map files that get brought in. So this really gives you some visibility into how memory is being used in your process. And down here we see where the stack space is. Okay, back to slides please.

So we haven't quite finished the job. There's some more memory analysis tools that we've got that are not 64-bit aware yet. So the ObjectAlloc application is a great way of looking at the dynamic memory use of an application. It can show the objects in the malloc heap, so core foundation and Cocoa objects, sorts them by object type. You can see which ones you've got a lot of, which ones are growing and shrinking.

You can look at specific ones and the retains and releases so you can kind of help see where memory leaks are occurring. There's also a set of command line tools, the Leaks, Heap, and Malloc History tools that for when we ship Leopard GM, those will be available with 64-bit also.

So now let's switch gears and talk about the new tracing tools that we're bringing into Leopard. So X-Ray and D-Trace. X-Ray, as I mentioned before, is a way to let you peek beneath the covers of an application or your entire system to really get greater insights into what's going on.

We're adding some of the functionality of some of the other tools, like ObjectAlloc, into X-Ray. And this really leverages the power of an underlying facility called D-Trace. So I'm sort of setting up two additional sessions that are coming up. We've got a full session on X-Ray tomorrow, and then a session on D-Trace on Friday morning for D-Trace kernel logging. But there's so much to talk about here and with X-Ray that we need to get a start on it now.

So what is D-Trace? It stands for Dynamic Tracing. This is open source technology that was ported from open Solaris. We've been working with the Sun engineers who had created this. This lets you trace activity in either the whole system or a specific process. And I like to think of it as kind of a middle ground between a debugger and print statements.

So with a debugger, I can go in and put breakpoints in, I can put conditional breakpoints, but it's a little bit laborious. I can compile print statements in, but oh, I should have some way to turn them off for my shipping application. But D-Trace was designed to be able to dynamically turn it on on production systems.

With Solaris, they actually tell their server customers, if you're seeing slowdowns on your web pages, your web server, turn on D-Trace dynamically on the processes, see what's going on, and then you can turn it back off safely. It's a preview release, so we're still a little bit working on the safe part.

So, but if you're not using DTrace, there is no overhead. And the overhead when you are using it kind of depends on how many probe points you've enabled. A lot of the power of DTrace comes in from the D script language that filters and aggregates data at collection time. So you could be having a long running system that's collecting data for a month, and then you come back in and see aggregate statistics on that.

So how does X-Ray use D-Trace? Well, X-Ray has a lot of instruments in it. And for what we're shipping the Leopard Preview, about 2/3 of them probably are based on D-Trace. This lets us look at file system activity. So for all of these operations, we can get arguments and backtraces for where these are occurring.

So where are my opens and closes, and what are the file descriptor names and the file descriptor numbers, file names, what are the sizes of reads and writes of I/O that are happening? Who's changing permission on that particular file there? Or who's setting a lock on a file? When does this directory get created? Who's putting that file in /tmp? X-Ray uses D-Trace to figure all this out.

It also has an instrument to look at shared memory regions. And we've added instruments for looking for messages to zombie objects for Objective-C and to monitor when garbage collection happens with Objective-C. But D-Trace with that D script language is extensible. There's a graphical instrument builder in X-Ray that you can hear about tomorrow that lets you build your own instruments using D-Trace.

So let's look at the architecture. So as I mentioned, X-Ray uses a set of instruments or plug-ins. So some of them are resource use plug-ins that look at things like CPU use and IO activity on the system, kind of summarizing the kind of information you would see in TOP, but also putting it in graphical format like Big Top does. Then we've got some other plug-ins like Objectalloc that uses a library.

So that's a library that gets inserted in. But then a good number of the instruments use D-Trace. What they have is a specification of what the D-Trace script should be. When you go to run with that instrument enabled, X-Ray dynamically generates the D script and passes that down to the user Sbin D-Trace command.

Now the DTrace command parses the script into a bytecode format which then passes into the DTrace engine in the kernel. The DTrace engine inserts all the probe points that you've specified and actually interprets the script in the context of the kernel, collects the data there, and passes it back to user space as it goes.

But, so X-Ray is using D-Trace in the same way that X-Code uses GCC and GDB, leveraging the power of open source technology. But as with GCC and GDB, you can use D-Trace standalone. So, for example, you could remotely log into a machine and run a D-Trace script to see what's going on on the other side of the world. So that's kind of what I'll talk about here. So what is a Descript? A Descript is a set of clauses that compose three parts. There's a probe specifier that says, these are the points I'm interested in monitoring.

You can have a predicate that's optional that says what are the conditions under which I want this probe to fire. So that's part of the scalability of this is that you don't have to have it fire and collect data for every time a routine gets hit, but only the times you're interested in. And you can have actions that provide much of the power here, actions that are commands for what to do when the probe fires. So there's a simple little script here, but I'll show you some more in detail as we do the demo.

So I talked about you specify a probe name. So it's important to understand how we specify probe names. Consists of four fields, and they are, in the current language, separated by colons. The first field is a provider name that specifies what instrumentation mechanism does this probe use. Basically a way of splitting the probes up into sets, kind of based on how they're implemented. Then a module name, which might be a library, like core foundation, or it could be an Objective C class name, or a kernel module name.

Function name is pretty self-explanatory. This is the function I want to look at. And name is kind of a semantic thing, like entry or exit. Entry or return for when the function gets hit. So some examples here, if you specify the DTRACE colon colon colon begin, or just begin, that would be the first one that gets executed when your script starts up.

The next one, syscall::entry says, "I want to probe on every system call." And that's actually quite doable. And for the PID provider, the third one here, you have to specify a process ID. Say, for process number 759, I want to break on the core foundation's CF retain when we enter that.

So what set of providers are there? If you want to look at a specific user-level process, there's the PID provider, where you specify a process ID. We also, because Objective-C and Cocoa are very important for us, and you'll notice that the D language has colon-separated fields, we actually implemented a variant of the PID provider that we call the OBJC provider, that lets you specify class name and method name rather than library and function.

Then there are some system providers that we've got. So Syscall, this works today with both enter and return of system calls. Profile is timed events. So you can specify every 893 milliseconds I want to fire a probe and see what's going on. And then FBT is really interesting. It's function boundary tracing inside the kernel. So tell me every time I enter and return from a specific probe point in the kernel. Device driver writers, for example, might find that really interesting to see how the kernel is interacting with their driver.

Now there's a lot to DTrace. We don't have it all done yet. There's a number of providers that we don't have yet for looking at kernel lock statistics or user space locks. Sysinfo, VMinfo, the scheduler information, IO. PROC would be user space for when does fork and exec get called. But it's quite usable with what we've got today.

One additional thing that we have running in the labs and we don't have in the preview yet is what's called User-Land Statically Defined Tracing. This allows you to specify where you want static probe points in your application code. So there are various things out in existence that use these today. There will be a demo tomorrow morning in the Java 64-bit VM talk where they show the use of static probe points in Java. Demo where it's not available in your preview DVDs right now.

So this is kind of that middle ground again. You can dynamically turn these probe points on. So they're like the printfs that you put into your code, except you don't have to have #, ifdef, etc. When you're not using them, they're just no-ops. So very lightweight. So let's see some demo here.

Alright, this will be the graphically flashy demo on this one. Okay, so the first basic thing we can do is list the probe points I've got available to me in the system. So I do that with the -l argument to the DTRS command. 18,000 probe points available. So if we look through, we see that I've already looked at PID 440, which happens to be the text edit that I'm running. We've got all the system calls here, P thread, entry and exit, semaphores, etc., etc.

And then FBT is the functions within the kernel. So that's already a pretty big menu of choices. Let me clear the screen from that. And let's say that I wanted to look at what's available to me with core foundation. So again, I type in the process number I'm interested in, which is 440 for the text edit that I had already launched. So I can say the PID provider, core foundation library, all functions, so I'll just leave that one blank. And I want to see the entry points. Actually, let me restart TextEdit here.

I'll hide the finder. And let's get the process number of text edit again. 4.7.7. So going back to the previous D-Trace command, change that to 4.7.7. There. For some reason I'm not getting core foundation lines. Sorry, I need a dash end. I'm specifying the name of a provider here.

[Transcript missing]

There we go, that's better. So these are all the entry points on core foundation. So there's a lot of flexibility here. But what can we do with that? Let's say I wanted to see all the graphics call, all the draw calls that are being made in Objective-C in... In that text edit run.

So that starts up and we see that this script has matched 350 different probe points. So when I bring text edit to the foreground, I'm starting to see probes dynamically firing the background. If I put it in background and bring it up again, more of them fire. So what we're seeing over here is the CPU number that the probe executed on.

The probe identifier, which isn't too useful to humans really. And the probe name, then some information that we custom printed out. So a nice formatting of the probe point. If I hit control C, then I'm printing out a summary of what we had. So here we see that we had 80 calls to NSImage drawing rect, 42 calls to NSClipView draws background. So how did we do that? So let's look at the script there.

So this script has two probe points, or two clauses. One we see is an Objective-C probe point with $1 as the provider name. $1 is the first argument that I passed to the probe. So that was the 475 process number. Then I'm saying, I don't care what class.

Tell me about all classes and the draw method with any suffix on it. Tell me when that gets entered. Then I'm doing a custom print statement. I'm also then setting up an aggregate, which is basically an associative array that's using the probe module, so the class name, the function name, as keys, and keeping a count.

Then at the end, there's a default way that aggregates will get printed, but I've done some formatting here so it shows up a little bit nicer with an end probe. Okay, so let's look at another one. Let's look at file opens. So we only matched three probes on that one. So let's create a couple of files.

Do some things. And it's just collecting data. When I control C, then we see that it's opened some files here. Again, this is a count of the files. We see that it's opened text. The process text edit has opened a preferences file ten times. So that's interesting. So what we're doing here is breaking on the entry to the open system call, checking the executable name.

: Breaking it open, getting the file name, the pointer to it, recording that. Then when the open system call returns, if the file name has been set, then we record the executable, the file name, and account. We can also get a little bit more information about why it's happening.

Oops, I need the... In this case I've written this script to take an executable name A couple of things. And what this does is it gives me the file descriptor that was returned, the file name, it recorded the time it took to open that file, so 191,000 nanoseconds, and the full backtrace of where it happened. So this one's kind of interesting as to how this is working.

So again, breaking on open entry and return, but checking for a specific executable name, recording both the file name and the timestamp. And if the timestamp is non-zero, then we check, is it a file name that I'm interested in? So check whether the path name has the substring slash preferences slash in it. If so, then print out information, arg1 is the return value from the open, and use the useStack function to print the backtrace. So back to slides, please.

So what's our status with DTrace in the Leopard Preview? So the platform that we recommend that you work with us on for now is a 32-bit Intel Macintosh. That's the one that currently supports the PID and Objective-C providers. Based on the code that we've ported from OpenSolaris, that was significantly easier for us.

We'll have to do some real work to support PowerPC. And because of this, X-Ray works best on that platform also. And I need to tell you about UStack. You saw a symbolicated backtrace there on my 32-bit Intel machine. In the Leopard Preview that you have, unfortunately, DTrace doesn't give you symbols, but X-Ray's use of it does give you symbols. So this is basically telling what you can do with DTrace under the covers of X-Ray.

But be careful with it, with your PowerBooks, because it will hang, if you call a use stack, it will hang immediately on a G4. We've fixed that in the labs. There's a lot of Solera-specific concepts that we haven't yet mapped over to Mac OS X, and as you saw, there's some providers that we don't support.

To get more information about DTrace, the best source is the Sun Big Admin DTrace website and the 400-page Solera's Dynamic Tracing Guide. So what we've found easiest is to download the PDF and to use Preview's search capabilities to search through that manual. Unfortunately, it's not all relevant for our platform at this point, so we'll be working with the Sun Guide. on that.

So we've covered a lot in this session. To summarize, we've got some powerful tools for you on Mac OS X. Shark keeps getting better and better, more powerful all the time to help profile the hot spots, and now to use the windowed time facility to really kind of start it and forget about it until your problem occurs. With X-Ray, you can understand your application and system behavior, and it uses D-Trace that you've seen.

Again, there's an X-Ray session tomorrow, the Java 64-bit VM session will discuss it, and the D-Trace session on Friday. We've added 64-bit support, and we're continuing to expand on that. We're trying to put a lot of effort in to help you create great applications. So we'd really like your feedback about what works for you, what doesn't work, what additional tools you'd like to see, etc.