vImage - WWDC 2003 - WWDC Index

Core OS • 1:01:49

View this session to learn about Apple's Vector-accelerated Image Processing Library.

Speakers: Craig Keithley, Robert Murley, Ian Ollmann, Eric Miller, Ali Sazegari

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning and welcome. As we were setting up for the conference, I had in the database that this was Keithley secret session number one. I had a number of people ask me if I had a second secret session and, well, no, I wasn't so lucky. The reason why this was secret is because we really wanted to make a dramatic statement about some new features in our acceleration libraries. What we'll be talking about today will be techniques that you can use to accelerate image processing using the image as part of the Accelerate Library. So, to do that, let me bring up Robert Murley.

Thank you, Craig. I'm very happy to be here this morning to introduce to all of you a major new technology from Apple, our vector-accelerated image processing library called Vimage. Before I jump into that demonstration, though, there's another message that I would like to make right here at the outset.

All of the vector libraries starting with Panther are going to be contained in a new, high-performance, state-of-the-art computing framework called Accelerate. Accelerate contains not only Vimage, the major subject of this talk today, but also all of the vector libraries that have been available previously in OS versions in the framework called VecLib. The Digital Signal Processing Library, the Basic Linear Algebra Subroutines, LAPAC, the Math Libraries, and the BigNum Library. So I wanted to make sure that that was clear before we went on.

Now, today we're getting into the main part of the demonstration here. What you'll learn about Vimage, well, that may be a little optimistic. I'm going to try to convey about Vimage anyway. Functionality, data structures and API, at least some of the examples of them. It's much too extensive to go over completely.

Some of the features that are not included in the first two subjects. Then I'm going to bring up one of my colleagues in the vector and numerics group, Yann Olman, who will talk about implementation techniques and performance. And then finally, although it's not on this slide, there will be a section at the end by Eric Miller of the Architecture and Performance Group on an overview of the CHUD tools, the performance tools.

So what is in the functionality of the Vimage library? It can be broadly grouped into these main topics listed here. I'm going to talk a little bit about each one of them, so I won't belabor it at this point. But at this time, I'd like to jump into the first demo of the morning.

What I want to show you is an image processing function called inversion, which is a very simple technique where each pixel of an image, in fact each color component of each pixel, the complement of that value is taken. So what you see here is a picture of a bed and breakfast in Portugal that I happened to stay at earlier this year.

And what I'm going to do is perform an inverse operation on it. And what you see is essentially the photographic negative. Now the functions that are comprised in Vimage run quite a gamut in complexity from quite simple to quite complex. This one is way over on the simple side. It's very easy to do.

The first subject, first major category was convolution. I want to say a word about area processes. In image processing, area processes are processes that use both the source pixel and other pixels nearby to generate the destination pixel. Both convolution and morphology, the first two examples I'm going to give you, are examples of area processes.

Convolution, in particular, creates an output pixel by taking a weighted sum of pixels nearby the input or source pixel. And that weight, and therefore the effect of the process, is determined by a matrix called a convolution kernel. And that weight, and therefore the effect of the process, is determined by a matrix called a convolution kernel.

So, um, I want to give you an example of how convolution works. What I have here is a fairly blurry and kind of low intensity image of downtown Lisbon. And I want to emphasize that this was probably poor equipment and not the photographer's skill that was involved here. But anyway, I'm going to operate on this with a 5x5 sharpening kernel. And what the result of that is, is this.

A lot of the blurriness is gone. Features are much sharper than they were. I'll go back. That's the blurry part and the sharpening part. And you can see the kernel there is 5x5. It's not that complex. So all of convolution... ... ...operates the same way. It's all matrix multiplication of areas of pixels.

And the effect simply depends on what the matrix is. Another example, this edifice. I'm going to use a rather extreme edge definition process or edge detection process and produce an embossed image. This is, as you can see, an extremely simple convolution kernel, 3 by 3, to get a pretty dramatic effect.

The second I'd like to talk about, or the second major topic, morphology. Morphology in general adjusts the shape of objects in the image to conform more closely to the shape of a probe. And the probe is defined also in a matrix called a morphology matrix. In practice it can be used to take objects, small or large objects, in an image and lighten them or darken them, make them larger or smaller. It can be used to alter their shape, to remove fine details while preserving larger objects, and so forth.

So to demonstrate this, I have here a couple of very simple images, a small circle and a large circle. I'm going to do a morphology process on these two images using a probe in the shape of a right triangle. And it's a fairly large matrix, about the size of the smaller circle.

So the result of that operation is this. As you can see, both the circles take on a more typical shape. They have a more triangular characteristic. And in fact, the smaller circle, in the smaller circle, the center circle completely disappears because of the size of the convolution kernel.

Secondly, I'm going to take this image, well that was called a dilate operation by the way. I'm going to now do what's called an erode operation. I'm going to take a circular filter about the same size as the kernel here, operate on these two images, and I get this result.

So you can see that in the case of the smaller circle, all the circularity is gone, it's turned completely into a triangle. And the smaller circle has taken on some triangular structure and lost a lot of its circularity. So there is a couple of examples of morphology in action.

The class of functions in geometry is pretty much self-explanatory. They perform some sort of a geometric operation on the image, transform it, make it larger, smaller, reflect it, whatever. For an example of geometry, I'm going to take this picture of a J, and I'm going to transpose it or translate it, make it bigger, but only in the longitudinal direction. And that results in this image.

Secondly, I'm going to go back to the original picture and do a shearing operation off to the right side. and that results in this image. And is it my imagination or is that bird getting more irritated with each picture? Maybe I've just been looking at them a little bit too long.

Histogram operations are those that use an intensity distribution histogram of the image to perform some function. The example I'm going to use is histogram equalization, a process whereby an image with a poor non-uniform intensity distribution is modified so that intensities are distributed more evenly. So I'm going to go back to this bed and breakfast here in Portugal and perform this equalization operation on it, and it results in this. Now what you can see is there's a great more detail visible here in this image than in the original.

Notice in particular the weather stains below the windowsills on the second floor. They were virtually undetectable in the original image. So this, the equalization operation brings out a lot of detail that was absent in the original. This is probably a lot closer to the way that building really looked, would be my guess.

Here's an example, or rather the before and after histograms of the intensity distribution. I've added all three color channels into each bar to simplify it, although in practice the operation is done on each color component separately. But you can see in the before image there's a lot of white with some starkly contrasted black, and in the after image it's much more uniform, a lot of different grays.

So I could go on quite a long time actually about functionality, but we do have a limited amount of time, so I want to proceed on to some examples of data structures and API. First I want to talk about data types and layouts that we support in our initial incarnation of the image. There are two different data types supported. One is the 8-bit integer per color component or per channel. I'll use those terms interchangeably. And the second is a 32-bit floating point value per color component or channel. We also support two different data layouts.

One is the planar layout whereby each channel is in its own array. And if I'm using an RGB image as an example, that simply means that the reds, the greens, and the blues are all in their separate buffers. And if you're calling a convolution, or excuse me, an image process to perform some function on this image, you would need to call it three times for all three color components. The advantage though is if you don't want to do the process on all three channels, you can do them on only one or two as you wish.

The second layout is what we call the ARGB interleaved layout, where all the color channels are interleaved into a single buffer. We support at the current time a four-channel interleaved layout, which can be either four 8-bit integers or four 32-bit floating-point values.

[Transcript missing]

We do supply data conversion utilities to go between the different data layouts and different data types.

And now I'd like to go on to what is probably the single most important data structure, almost the only public data structure we have in Vimage, the Vimage buffer. As you can see, it's a very simple data structure, only four elements. There's a pointer to the start of the data, which would be the upper left-hand corner of the image, a height, a number of pixels, width in the number of pixels, and then a row bytes, which is the number of bytes from one row to another, or the stride from one row to the other.

Pictorially, if the name of the Vimage buffer is image, then we have image.data at the upper left-hand corner. You have the height, the width, and if you imagine that that white space to the right of the image is extra memory that's not used in the image, but just sitting there at the end of the row, then you can see that the row bytes parameter includes that length in the stride. That comes in handy if you want to do 16-byte alignment on each row, for example. That's not a requirement by the image, but certainly may be helpful in your own work.

[Transcript missing]

This allows you to do tiling, if you want to do that, to take advantage of caching, although we will also do that for you if you wish. It has quite a number of other advantages as well. So here's an example of equalization, an example of a very simple call to Vimage, image equalization.

There's only three parameters, it couldn't get too much more simple. The Vimage buffer source, the Vimage buffer for the destination, and then the flags word, which the information in the flags word varies with each function. You notice that you don't have to specify what the data layout or the data type is because that's implicit in the name of the function, in this case, planar8. So every function has four different variants: planar8, bit, planarFloat, interleaved8bit, and interleavedFloat.

There are some functions that do require us to know in the image both the full image buffer and the region of interest. And these are the functions that I mentioned earlier I referred to as area processes. The components are all shown here. You have the full image buffer, the source ROI, which may or may not be smaller than the full image buffer, a convolution kernel, a matrix shown by the yellow rectangle, and then the destination buffer, the result. I'd like to go into the relationship between these things a little bit more. So this is the discussion, a further discussion of image buffers and regions of interest. Alright, I think we all know what the full image buffer is.

In a call to an area function, morphology or convolution, the region of interest is not specified by a second V-image buffer, but rather simply by X and Y offsets from the beginning of the full buffer. So, as you can see here, you would indicate the upper left-hand corner of the region of interest by an X and Y offset from the upper left-hand corner of the full buffer. The Robites is the same in both cases.

You also pass a V-image buffer indicating the destination, which has a height and a width and independent row bytes. And notice that we have not specified as yet the source, the region of interest, height and width. And that's for a simple reason. It has to be the same height and width as the destination, so we simply take it from there.

This is an example of one of these function calls, convolution. You have the source and destination image buffers, the offsets to the region of interest, and then some other information defining the kernel and a few other things that we need to know. So this is probably one of the more complicated calls that you're going to run into.

We have three computational cases that we need to worry about when we're doing these calculations. The first one is fairly simple, and to explain this, just keep in mind the four different elements that I'm talking about here. The full image buffer, the region of interest, the convolution kernel, which is simply a matrix, and then in this image the source pixel shown by the tiny red rectangle there.

So if we are going to calculate a destination rectangle from that source pixel, we need to do a matrix multiplication of the pixels in the region of the source pixel as shown there. The first case is very simple because the entire matrix is contained within the region of interest, so there's no issue about where the data comes from.

The second case is a little bit more complicated. What happens if the computational matrix extends out beyond the region of interest? And this is exactly why we need to know what the full image buffer is, because if it still remains in the full image buffer, then we can use that data without further concern.

The third case is the more complex case. What if the computational matrix goes even beyond the full image buffer? And in that case, what happens? What happens if the computational matrix goes even beyond the full image buffer? And in that case, what happens? What happens if the computational matrix goes even beyond the full image buffer? In that case, we have to do something to substitute for the pixels that are missing.

So we have an edge case problem, and we supply you in this instance with three different options to deal with these edge cases: background, color, edge extend, and copy in place. And to demonstrate these three, I'm going to start with this as an original image. All the lines between the different colors are clean and smooth, and the edges are clean.

And I'm going to do a blurring operation on it. And the first time I'm going to do this, I'm going to specify that for the edges, the color to use is black. If we don't have a pixel in the computation, we'll use a black pixel. So the result of that comes out like this.

You can see that the colors merge together on the edges, and on the outside of the image, it just fades off into black gradually. The other extreme of that is a background color of white, which ends up looking like this. With a black background, you can see quite a difference there.

The second case, so that was the background color, the first option that we give you. The second option we give you is edge extend, which means that we take the pixels at the outside border of the image and just extend them out, copy them out as far as we need to, to perform the operation. So the result of that blurring operation is this, and as you would expect, you really don't see any change when you get to the edge of the image, it just continues on as it does in the beginning, or in the middle.

The third case is copy in place. And what we are saying there is that if we don't have all the data we need to do the computation at any point, then we won't do it. We'll just copy the source pixel to the destination pixel and be done with it.

And this is what that looks like. You have to concentrate on the edges of the image and you can see that towards the edges there is no blurring effect. Once the computational matrix goes off the edge, we just do a copy from the source. So those are the various options that we give you to handle the edge cases.

A couple of features that I haven't yet mentioned, or maybe I have. All of the Apple libraries, the vector-accelerated libraries, are optimized for all Apple processors. So if you are, for example, running on a G3, if the host system is a G3, then a form of any given routine that is not vectorized but still highly optimized for scalar will be chosen. If you're running on a G4 or a G5, then an appropriately optimized vectorized version will be chosen. This is all done transparently to you, the caller.

Our system, our library is multi, Vimage in particular here is multi-processor safe. I should also mention that it's interrupt safe if you take some precautions to make it interrupt safe. There's a lot of routines in Vimage that do call malloc to allocate memory. However, if you don't want it to do that, we do give you the option to supply your own memory. The calls that need memory also have an auxiliary call that returns to you the minimum buffer size that we will need to do the operation. So you can call that, allocate your own memory, and then there will be no system calls during the course of the operation.

The image is a standard part of Panther. The data structures are unencapsulated, simple and flexible, and unlike a competitor or two I could name, but won't, there is no license fees. Okay, so that completes my portion of the talk. I'd like to bring up my colleague, Ian Ollmann, who will talk about implementation techniques and performance.

Thank you. I wanted to touch on two subjects, mostly what you can do to use Vimage most effectively in your app to get the best possible performance out of it. And then just for your own curiosity, some of the things we did to tune the functions that you can get through the Vimage subframework under the acceleration framework.

So a couple of things that you can focus on, a lot you'll touch on. So there are some alignment, memory alignment things that you can do. We don't require that you do anything in particular, but some things help, so I'll mention them. I'll briefly talk about tiling and then also some multiprocessing or real-time considerations.

[Transcript missing]

Tiling, of course, is a commonly used technique in image processing. Basic approach is you divide up your image into smaller segments which are cache-sized, and this allows you to operate on segments and keep them in the caches while you're working on them. So, for example, if you had several chain filters you wanted to do in series, rather than apply one to the whole image, then do the next filter to the whole image, then the third filter to the whole image, you could pick a small subset of the image, do all three to that, and that means that for the second and third filters you'd be very likely to have the pixels already in the caches, so you'd be less likely to pay any penalty for going out to DRAM to get them.

So a few tips on how to do that. We've found that tiling is only helpful some of the time, not all the time. So don't waste your time if it isn't. And we found it's very easy to simply assay to find out whether or not tiling is going to work for you by just pushing a small image through your code as it is, unoptimized, and then push a big one through. Take a look at how many pixels per second you're able to calculate in each case. If there's a big difference, then maybe tiling will pay off for you, and it's worth the time to go through it.

In our experience, we found that tile sizes that are the rough will fit in the L1 cache, which are probably about 16K to 32K, work best. Wide is better than tall or square, and it can be very wide. We found cases where only 16 pixels high, but 1,024 is the optimal case. We also do some tiling in some of our functions.

If you're going to do your own tiling, in certain cases we imagine, although we haven't found any examples of it, that these two things work. These two things could interact adversely, so we provided you with a flag you can pass that says, "KV image do not tile," which basically tells us not to tile. You're going to do it yourself.

Another thing you can do is take advantage of our planar data format. Originally we were thinking of only providing planar, but we had so many requests for ARGB that it's a feature. However, there are many drawbacks to ARGB, and if you use planar data formats you can get around them.

First of all, for ARGB you may not wish to operate on the alpha channel, so it's 25% or 33% more work to use an ARGB format in that case compared to just operating on the three color channels. So going with planar would allow you to just do the work that you need to do and skip over the other stuff and touch less memory as well.

Another nice thing about planar is that it's kind of a limited form of tiling in the sense that you've now split up your image into three smaller or four smaller parts. So in certain cases this may allow you to exist entirely in the cache rather than half in and half out. So that would allow you to push through several filters with just red, for example, and then move on to just green and do pretty well.

One of the problems with geometric tiling, which is what I presented in the previous slide, is that if you've got something with a kernel matrix that needs to be applied and where for each pixel you need to look at all the pixels around it, that can make tiling a little bit tricky. So with this one, we're going to do that.

And then finally, a bit of an implementation detail, a lot of our ARGB code will take the ARGB interleave format, convert it into planar, do the work, convert it back, and then give you the result. All that happens in register, so it's pretty fast, but it's nicer not to have to do it at all. So if you use planar data, you probably will get somewhat better performance. we often see the difference is about 30%.

[Transcript missing]

So in practice, in order to achieve that, the simple things, you can unroll loops. We aren't doing that to get rid of the loop overhead. We're getting that to make sure that we have 8 or 12 or 50 or however many parallel calculations we have going on concurrently so we can keep the processor full. We identify and eliminate compiler aliasing. So if you have pointers pointing to buffers, the compiler might not know how these overlap and it might decide to keep the load store order from load, do operation store, load, do operation store in strict order.

We want to eliminate LSU bottlenecks. A lot of code just spends all of its time loading data in and out of register. So we look for ways to merge small operations, many small operations into a few big ones. And that way we can spend most of our time actually doing work. If you have certain instructions that are spending a lot of time, they take six, eight, ten cycles to get through. Then we try to find enough work to keep us busy while we wait for that to happen. We avoid branching like the plague.

So we use a lot of selects and other kinds of things to make sure that our code flies in a straight line. As I mentioned earlier, we try to keep all the execution units busy at the same time. So if we're busy doing something in the floating point unit, this might be a good time to also be loading data for the next loop. So we schedule things pretty aggressively.

And finally, we prefetch our data. Just to make sure it's in the cache when we need it so we don't have to take a long stall waiting for data to appear out of DRAM. Insofar as our... ...tiling goes, we only did it for some functions because we only found only some functions benefited.

Generally what we did was we took a look at... ...first the experiment I suggested earlier, run a small image and a big one and see whether there is some improvement for doing smaller images. We also took a look at different tile shapes. So here you see a graph where I've taken a 3x3 kernel and a 21x21 kernel for the same function.

And looked at how much time it takes for different tile widths. The tiles are all the same size. It's just we widen them and... ...shrink them vertically at the same time. So you can see that there is some advantage to a particular tile width in this case. 1,024, 2048 bytes is probably the optimal case. So that's what we choose.

So, and then of course we tune these things per processors. We actually end up running this experiment several times to make sure that the tile sizes we pick for G3 are optimal for G3 and the ones we pick for G4 and G5 are optimal there. Finally, just to stress, like everything else in Accelerate framework, we vectorize.

So our intent is to use the velocity engine across the board everywhere we can. So you'll see that in the final product, we're going to have Altevec pretty much everywhere. The only exception is going to be Histogram, which is a class of functions that just don't work very well with the vector unit. Typical speedups we see over scalar code for that is 4 to 10 times. If you haven't tried vectorization, I suggest you do.

That doesn't mean that our scalar code is any sludge. We make sure that runs as fast as possible, too. And in a couple of cases, such as our sampling filters, we use the extra speed to deliver a lot better image quality. So hopefully you'll like that. And I've got this as a beta release, so I'm quite finished. Every bit of vectorization I'd like to do, but certainly we're working hard at it. Finally, experimentally driven optimization. We never guess.

If we find we are guessing, we try to figure out how to run the right experiment to find out what's actually going on. So obviously, always profile. I'm sure you've heard that before. You can use tools like gprof and sampler, but those only give you function-level information. It'll only tell you which function is performing slowly.

It won't tell you why or what part of it or what instruction in particular is getting a stall. So actually, most of our work is done using Chud or Shark, which they're going to talk about later on today. And I'll invite Eric to give us a short overview.

Eric Chud: Thank you, Eric. So we're going to talk about what we're doing with our CPU for. And we also use CPU simulators like SIMG4. And so these things can be used to actually narrow in and directly tell whether we're running into cache misses or paging or any of a number of other problems, which historically have been very hard to tell what exactly is going on. And you're just kind of guessing what's going on. But we don't. We zero in on a problem and solve that. And that lets us very efficiently get to the high-performance code.

So, finally, if you aren't already, I'll urge you to inspect our compiler output for functions that really make a difference, since we are almost always surprised by some of the mistakes we make. So with that, I'll introduce Eric Miller from the Architecture Performance Group to come up here and tell you a little bit about Chud, which is the tool that we use to tune our code.

Good morning, I'm Eric Miller with the Architecture Performance Group. As Ian said, the CHUD tools are one of his favorite toys, and I'm glad he put it on the list, although I would like to see him reverse the order and put it above GPROF and Sampler, but that's just me. So, what are CHUD tools?

Well, there are a suite of performance analysis tools. There are several that are interesting, probably the most interesting we'll get to in a minute, but the idea behind them is that they give you low-level access to the performance monitor hardware counters in the processor and the memory controller, and then we have implemented some software versions in the operating system that behave exactly the same as the hardware performance monitor counters.

The idea is to help you find problems and improve your code performance. And the best part is they're freely available on the web. And they're also on the Developer Tools CD. One of the neat things about the CD this year is that you'll be able to install the tools and immediately we have a Chud updater which is very similar to Software Updater.

[Transcript missing]

So we generally will put out a release every week at least during the beta period and probably slightly reduce the frequency later once we have it gold master. So there are three main tools. The first tool is a profiling tool called Shark, which Ian alluded to. It is an instruction level profiler.

It can do many things that we'll get to in a minute. And not the least of which is, Ian mentioned that you can inspect your

[Transcript missing]

Saturn is a call-grab visualizer, as it says. What the idea there is, is it's kind of like using Gprof. It goes through and actually instruments all your application code and then produces the results of how often the functions get called. But you can also have auxiliary information with regard to performance monitor counts.

We also have several tracing tools. Amber, which actually when you run Amber can collect every single instruction that is executed on behalf of your application on the processor and put that into a file. And then those files will be consumed by ACID, which is a tool that we wrote in our group, and by SIMG4, which is produced by Motorola, and SIMG5, which will be produced by IBM. Those are cycle-accurate CPU simulators. And of course Ian and his team use the SIMG4 product quite readily.

The other thing you can do with the CHUD tools is instrument your applications. And along with that, I'm running out of dots on the slide, but you can also create your own application performance analysis tools using the CHUD framework because that's the exact framework that we developed in order to create Shark, Monster, and Saturn.

So, I mentioned performance counters several times. What are they? Well, they're a series of dedicated special purpose registers actually in the processor and in the memory controller that we create that's in the G4, G5, and G3 systems. So what we can do with those is set them up to count and record what we call performance events.

Things like the number of L1 cache misses or L2 cache misses or L3 cache misses or instruction counts, instruction misses, execution stalls, page faults in the operating system. There are a plethora of events. In fact, on G4, you have something in the order of maybe 200 events you can measure. On G5, there are literally thousands of events that can be measured.

So we use the CHUD tools, and in particular the CHUD framework, to configure and control all the PMCs. So, I'm not going to do any demos this morning because we're pretty short on time, but I just wanted to mention Shark because all you do to use Shark is push the start button and it will profile the entire system.

It defaults to a time profile and what that will give you is in your application when you select it from the list of profiled threads or processes, you'll see where in your application in relation to your source code will highlight it for you and show you this is where you spent your time.

If you do an event profile, suppose you selected CPU cycles, Shark can tell you exactly how many cycles were spent in your code for a particular line of code. And Shark captures every single thread on the system, the driver, any drivers or kernel extensions, the kernel itself, and all the applications that are running at any given time.

The best thing about Shark is it's very low overhead, as are all the Chud tools. You can actually set the time profile down to a minimum of about 50 seconds. So, you can actually set the time profile down to a minimum of about 50 seconds. So, you can actually set the time profile down to a minimum of about 50 seconds. 50 microseconds per time sample, which is a couple of order magnitudes smaller than you can use with sampler.

It also gives you an automated analysis which will show up as this column of exclamation points beside your code. So we annotate your source code, you click on these annotations, and it will tell you things like this loop has a non-changing variable and it's serialized, so you might want to move that variable out of the loop, or if this loop is a good candidate for Altevec or parallelization because there aren't any data dependencies.

We do static analysis and this can lead to the surprises that Ian mentioned from the compiler. It actually will show you the disassembly with the compiler generated on your behalf and annotate that as to how many stalls you'll have, how many delays might be involved from other aspects.

New features this year from, well, let me say this, Shark was formerly called Shikari in the Chud Tools from last year. So it's been renamed Shark with a lot of new features. One of the features is that you can now save and review all the sessions that you collect for later analysis. And there's also a command line version that you can use to instrument with scripts and things.

So we use this command line version of Shark whenever we have a old Unix scientific application that just runs in the command line. And it has a launch script. We can just script Shark to begin and then run our command line application as normal and then script Shark to end.

Here's a little photograph of SHARC. And what you can see in the left-hand picture would be the result of actually a time profile. And in this particular picture, we were running a test and it turned out that the square root function was 42% of the time. At the bottom of that left-hand picture, you can see there's a little process menu, and that lists all the processes that were running on the system when you did the trace. You can choose from any of those, and normally you would choose your own. This is a screenshot from last year's demo.

[Transcript missing]

The next tool is Monster, which is the most direct way to configure and set up the performance monitor counters.

You can use a, in CHUD tools in general, there are timed intervals so you can select a certain number of milliseconds or microseconds or seconds for that matter that you would like to collect per sample in the hardware. You can also collect data based on other events. You can set it up to collect a sample every so many cycles or every so many instructions completed or every so many cache misses. There's also a third way that actually is related to both which is called a hotkey. All the CHUD tools have a global hotkey. In the case of Shark it's Option Escape, in the case of Monster it's Command Escape.

And if you use those keys, you don't actually have to have the application in front of the and David Koehn. The next session will be about the application of the Monster. The information from the memory controller about transactions, reads and writes, and you know the amount of time because they're sampled over time, for example.

Then you can take those transactions and apply a calculation to them we call shortcuts. So you can say every read is 16 bytes, so I take the number of reads, multiply by 16 bytes, I have the number of bytes, divide by the time, I have the bandwidth. So you can set up these calculations in Monster and have additional columns in your spreadsheet.

and these calculations are just standard infix, mathematic notation with parentheses and it's basically a four function calculator. There's a table and you can also draw charts and Shark is also capable of drawing charts. Then you can also, new in this version of Monster, you can save and review the sessions and the nice thing about this is you can review sessions on a system that you don't have in front of you.

So you could do collections if you had a G5 at your disposal. You could collect data with Shark or Monster on your G5, then take it back to your laptop or your desktop G4 or even your iMac and review those results and print off the charts and those sorts of things. And there's also a scriptable command line version of Monster which is new this year.

Here's a screenshot of Monster. On the left of the first of the leftmost image, there's a column that where you if you click on those entries, will highlight those columns in the data. So, and when you highlight columns of data, you can then just press the draw chart button and it would result in a chart. And there's many options for charting. There's bar charts, various colorizations, line charts with markers, logarithmic scales, direct scales, samples over time, and samples as a single x-axis just per sample plots.

You can see in this particular case that what's been highlighted are some of the shortcuts. So a load store session was done, so all the load instructions were collected. All the store instructions were collected. And all the regular instructions were collected. And then percentages of each were calculated along with that, you know, for every sample.

Each sample is listed horizontally in the table there. And so vertically is each of these shortcuts. So then you just highlight those columns of shortcuts and we plot the percentages, which is what you see in the second picture. There is quite an extensive set of sampling controls to configure the performance monitor counters in both Shark and Monster.

So the last thing is a new tool we call Saturn, which record, like it says, you record your function call history, and the way we do this is by instrumenting the functions at entry and exit with GCC. There's a compiler flag that you throw and do a build, and it'll inject all the Saturn entry and exit prolog and epilog functions in every function in your application.

Now, to be completely thorough, you have to go through and recompile all of the frameworks and libraries, and that's similar to Gprof, which is really not that fun to do, so most of the time we like to focus just on actual application code. But the nice thing about Saturn is that once you have this function call history, you can visualize that call tree, and here in this image you can see that the call tree for CSE under main has been highlighted, and you see the red dashes in that stack of the call tree. And you can see the red bars there.

That's where that function is called and run. So what you would want to use Saturn for is, in particular with C++, you have a lot of call depth. So you want to -- things are very skinny and tall. You're spending a lot of time calling functions and not doing any work, so you want to try to avoid that. You want to have a nice flat profile. You can also collect call counts, PMC event counts, and execution times by using the performance monitor counters with your instrumented functions that are injected at entry and exit of your each of your functions.

So, as I mentioned on the first slide, we've got the instruction tracing and simulation. AMBER is the instruction tracing mechanism, and the resultant files are in a format that's called TT6. These TT6 files are consumed by the other programs mentioned on this slide, which is a trace -- ACID is our internal trace analyzer, and actually, the ACID trace analyzer is the parent of the code coach and the parts in SHARC that explain why you have bottlenecks and what you might do to change them. These come out of ACID. And it can also do a couple things on its own, which is memory footprint of your application.

It'll give you a little GNU plot file. You can find your instruction sequences that may be an issue and then try and remove those through the informational notes that it gives you. SIMG4 is a cycle accurate simulator for PPC7400, which is an old MAX processor from the early G4 systems.

And SIMG5 will be available in the future -- in the near future, and that'll be a cycle accurate simulator for the new PPC970. These can be quite handy in tracing particularly complicated performance issues, although the output to SIMG4 and SIMG5 requires a terminal window. Maybe -- maybe would require maybe a 50 inch monitor that would work.

Lastly, the CHUD framework is available to, like I said, instrument your source code. One of the things you can do with instrumentation is do one function call to start and stop monster or shark sampling. So you can sort of put a caliper around your interesting code. Suppose you find a piece of code that Shark says is a hotspot and you want to get more detailed and just trace through that, you can add code.

It's CHUD start remote performance monitor and CHUD stop remote performance monitor. And what happens is you set it, you just click a key in monster or shark and it will be in remote mode and be waiting for messages from your application and your application only. So you can just... and many others. You can just collect the data for your interesting code. You can directly read and report on the PMCs by writing small pieces of code, either instrumented in your application or write a separate stand-alone application.

As I mentioned, you can write your own performance tools and do all the things that need to be done in order to create a performance tool like SHARC, which is control the performance monitor counters, collect the information about the system hardware, which can be handy in a lot of ways. You can know that you're on a G5, you can know that you're on a G3, you can know the bus speed of the system, the amount of memory in the system, number of processors. You can also modify some of that information.

And there also is an HTML reference document online that describes all the various functions in the CHUD framework. Here's a small example of code with the CHUD framework. And this is, I mentioned, that instrument your code to start and stop Shark or Monster. So you just have to include the CHUD H header file, initialize, and then acquire the remote access, start Remote Performance Monitor with a label that will show up in your output in Shark or Monster so you know which instrumentation it was. Then you run through your important code, stop the monitor, release the remote access.

Secondly, a slightly more complex mode, I mentioned you can write your own performance monitoring tool. You initialize, acquire the sampling facility, you turn on some special filters, maybe mark your process as the only one to be counted, and then you set the events. In particular, you say both CPUs, process performance monitor counter number one, event number one, which happens to be cycles, and event number two, which happens to be instructions. Clear the counters, start the counters, your important function executes, stop the counters, then you collect these results and you can perform a calculation and get cycles per instruction in your own application.

For more information about that stuff, you can get your own download at this web address: developer.apple.com/tools/debuggers.html And then you can always contact myself and my colleagues on the Chud Tools Development Team at this email address. And we try to be pretty responsive, and that's probably the best way to get your feature requests and complaints into our queue.

Let's see what's next. Oh, I guess I'm done. So let me bring up Mr. Keithley. That'd be great. So the roadmap, a couple more sessions today, obviously one specializing in CHUD itself. We should move on to Q&A pretty quickly. We're into that time right now. Here's some contact info. Our reference library information.