Using Processor Performance Events to Investigate the Architecture and Tune Applications - WWDC 2009

Mac • 56:58

Unlisted on Apple Developer site

Downloads from Apple

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Good morning. I'm Phil Kerly and I work for Intel. I am a Principal Engineer, and I work in the Software and Services Group at Intel, and I've been doing performance analysis work for probably about a decade now. And when we had the opportunity to present a session here at WWDC, I thought about what would be something that we could add to your experience here that Apple doesn't already have all the expertise for, and one of them happens to be related to the processor. And so I thought, well that's something we should stick to. We should stick to what we know best. We'll talk about the architecture a little bit.

And then I thought about some of the experiences that I've had working with various ISVs across the different third party ISVs both on Mac as well as on Windows. And one of the common threads that I saw was that a lot of people do performance analysis, but they tend to focus very much on kind of a time-based sampling hotspot analysis, and they never really quite get to using the processor events to help them analyze their application and their performance. And so I thought, well that might be something I can talk about and maybe impart a little bit of knowledge about how to use those, that combined with Shark. So today, I'm going to talk about the processor performance events.

How you can use them to investigate the architecture because a lot of times, it's just interesting to write small little kernels to go fiddle with the hardware and see exactly what's happening under the covers. And the more that you know about the architecture, the better position you'll be enabled in tuning your particular applications. Now, using the performance events is a huge, probably multi-day discussion. And so, what I really want to focus on, since I think that most people understand time-based sampling, is really just kind of a basic understanding of how the performance of that counters work.

And then just maybe leave you with a little bit of strategy and some techniques to help improve your code and get started using the performance events. And as you get better accustomed to what they can do for you, then you can dig a little bit deeper on your own. What it's not.

It's not a tutorial on programming the performance events. I mean that's really a tools, application type domain and for most of us we're not interested in actually writing the tools, we just want to use the tools to their advantage and leave the programming of the events to people like the Shark engineers.

Now, I'm going to talk about performance events on the hardware, but I don't want to leave the impression that that's where you should start. Clearly, performance analysis has to be a top-down approach. You have to start with the system level. You know, you want to focus on your disk IO, your memory, processor utilization. If you're spending most of your time waiting on network IO, there's no sense to even bother worrying about the processor events if the processor is mostly idle.

And then of course we have multi-core with the new Mac Pro 16-way virtual cores. Want to make sure that you're taking full advantage of those cores, and that you've implemented you're parallelism. That you're getting good utilization across all of those cores which means resolving any of your lock-in [Phonetic] tension that you might be having and making sure that you're using the appropriate APIs that you need to use to get the best performance for threading whether that be POSIX threads or Grand Central Dispatch.

And then, once you've got that level of tuning down, then you can really talk about the microarchitecture level, and we're really focusing on where are the processor stalls, what is the branch prediction happening, what does the data and code alignment look like and focus on the architecture. So, that's what I'm going to talk about.

Not that I'm ignoring the upper part because I'm assuming that that's already been done, but focus on the architecture itself. So, what is a processor event? It's, there are counters in the processor that have been included so that they detect microarchitectural conditions. There's two types of conditions that you're really looking for. There is occurrences which is the number of times the event has occurred.

The number of times can happen more than once per cycle. For example, the number of instructions retired, we can retire up to 4 micro OPS or 4 instructions per cycle. So there is the case where the number of occurrences can actually happen more than once in a cycle, and then we also focus on duration. So some things like, you know, when is the integer divide execution unit busy or just things like unhalted clocktick cycles.

Both of these can be constrained on conditions So for example you can get the number of instructions retired in your application but you can also, using the right tab conditions, find out when you're not retiring any instructions. And why that might be useful is because, if you're retiring a lot of instructions in a particular piece of code that you've got and it's doing well, you know, sometimes you want to focus on what's not doing so well and where you're not retiring instructions in that, in those cycles.

So, conditions help constrain the events and expand the number of possible events that you can actually collect beyond the couple of hundred events that are already provided quite a bit. So it's a very powerful tool to peer into the microarchitecture. So here's an example of what are some of those event conditions we talked a little bit about, but we can constrain the number events to only be in ring 3 which is user mode.

You could constrain it to only be in ring 0, the OS mode, or you can constrain them so that you get both, so that you get events in both ring 0 and ring 3. You could also trigger on these events, so you can cause PMI or performance monitoring event interrupts to occur after so many number of instructions that have been retired or so many unhalted clockticks that have retired and occurred. And then you can, as I gave the example before, you can trigger when there is no instructions retired or you can trigger when the execution units are idle.

So if the execution units are idle, there either is no work for them to be done or they are actually be installed. Now the last two actually are ones that, well, you can actually configure them. Currently today in Shark, the right bits to be able to actually collect those events aren't quite exposed to the end user. So there is some limitation on Shark but certainly the full performance event counting capability that Intel provides in the hardware would support those 2 events.

So we have 2 types of events, just to give some basics. There are the model-specific events, and these are the vast majority of the performance event counters that we have. They're very much tied to the hardware and the architecture of the processor. The reason is because the underlying architecture changes, and we have the tick-tock approach to processor development.

And so every other year, we're actually modifying the architecture and so those type of performance events and things that we think are of interest to general programmers are the ones that we think they should be aware of, we add those to the processor. But we've also been aware that there are a number of events that are just common across all of our architectures.

So we've started implementing what we call architectural events, which are events that we basically say are going to be in every processor moving forward and that they will be consistent with that architecture. And the number of events is evolving, and today we're up to 3. It's not a whole lot but they're probably the most important 3. We use them a lot. It happens to be unhalted clocktick cycles, reference cycles, unhalted reference cycles, and instructions retired.

And then we break down the counters into 2 types. There are the fixed counters which are the architectural events, and the programmable counters. So the fixed counters are ones that you have no control over. They are fixed count, fixed function. They're providing the architectural events. The programmable counters, again, are the vast majority of what you can actually count in and they're tied to the architecture. So, one thing about the fixed counters is that they're not supported in all of our platforms as you look backwards.

It's only a forward looking architectural improvement going forward. And with the programmable counters, not all events can be counted on all of the counters. So if you go into Shark and you actually scroll through the list for a particular counter, be aware that if you're looking for a particular event that you know the processor should support. If it's not in that particular counter list, you should go check the other ones because they're not all supported in both sets of the programmable counters.

And because internally we have been using this performance events to help improve a lot of software that we engage with, a lot of ISVs, we started increasing the number of counter. So today on Nehalem, you get 4 of these programmable counters versus 2 on previous generations. And then there are basically 2 basic modes of operation.

There is the counting mode and the triggering mode.

The counting mode is basically a query where you decide based on however you want to count it, you just go and query the processor for the event counts. So you clear it out, you go and you just read the event every so often or at the end of your application when it's run-- after it's finished running, and then that's what you get. Now, you can't correlate that to an instruction pointer because, you know, it's arbitrary in terms of when you're actually querying or pulling it. There's nothing related to the actual instruction pointer.

The triggering mode is actually an event, the PMI, a performance monitoring event that actually occurs, and this actually signals the processor that says, hey, based on your conditions I've reached that number of events and therefore I'm telling you to-- that this has occurred and now you have the opportunity to actually go-- or the tool has an opportunity to go and actually query the event.

And this, you can actually correlate to the instruction pointer because when the event occurs, it's usually somewhere near where that event actually happened in your code. In the interrupt, you can go look to see what process happened to be running and go look up the instruction pointer. So it's-- there's definitely a high correlation between the event occurring and the location in your code.

So this really enables 2 different sampling methodologies. We use both. The primary one that I use for client type applications which have different phases of execution tend to be triggering mode. But if you have an extremely steady state type application where you're doing basically the same processing over time and you can just query it on a periodic level, we've also used that one as well. But the triggering mode is-- tends to be the large majority of what I personally use.

So let's talk a little bit about statistical sampling, first, correlating the triggered samples to code. Because you're actually getting the interrupt, you have to remember that if you're collecting and interrupting after so many instructions, it is a statistical sampling. So if you're interrupting every 300 million events, instructions retired, or whatever it might be, then it depends on your application if you have some kind of harmonic in the application such that, you know, you run you know 2,999,000 events in one area code and the one extra one that causes the event happens to occur in some other location, you know, you could reach a point where you're actually pointing at the wrong point in code. So you have to be very careful about how you read the statistical sampling and make sure that you're getting the right picture and we'll talk a little bit more about that.

And that really leads into the calibration which is you want to interrupt just the right amount. It's kind of the Goldilocks. You don't want too many, too much because if you're interrupting, that's causing your program to behave differently than the way that you originally intended. If you don't do it too-- if you do it too little, then you don't get quite the resolution into your application that you need to know exactly where that particular instruction pointer is that's correlating with that particular event.

I kind of liken this to a camera zoom. If you take a full big picture, the resolution and the detail that you see is not as good. But if you drill down too closely, things become fuzzy and out of focus and you start focusing in on one location but it's not necessarily the right location. So it's a matter of getting the right focus for calibration.

I'll talk a little bit more about that when I bring up chart. And then there is the potential for a little bit of skidding your applications when you're doing performance events. There is a delay between the actual interrupt. By the time you actually get processed in, the event gets signaled over to the performance counters, the interrupt happens, you go query the process and find out what the instruction pointer was for that interrupt. There's a little bit of delay. Now fortunately, this is mostly hidden by the reorder buffer that's in the processor.

But there are some events that it's particularly more hazardous than it is for others. For example, branch instructions. If you actually have a branch instruction for which you're collecting the branch miss event for and the event happens and then it gets signaled, and then you go to read the IP, you're already into your target so you don't always know what the exact branch was correlated back to the instruction pointer, and that's what skid does.

The Intel processor does support what's called precise event-based sampling which has tried to improve on this process. And in fact, if you have precise event-based sampling capability, it will actually guarantee that your skid is no more than one instruction away from where the event actually occurs. The hardware actually captures the IP. Now, for branch prediction and targets, it's still not quite accurate because it's still one instruction later. But we're working on it.

We'll get to the point at some point where we'll actually capture the right IP address for the event. But anyway, the whole point of this is really that your result-- your results are driving your analysis. And statistical sampling and how you collect the data and how you interpret the data is very important to getting the right answer and bad results can drive wrong conclusions. So I'm going to talk a little bit about some of the pitfalls that you might run into to help steer you in the right direction when you start doing performance event-based sampling.

So let's just bring up Shark for a moment. Okay, so I actually have Shark here running. You've probably all seen this. So one of the first things that I did, this is actually under the config so I'm bringing up edit, and I'm actually going to go to the view for advanced. OK, and so here you can see very clearly that there are a number of options. One of them, the default, is the time-based sampling. So this is where you set how often you want it to sample.

Whether that just be a time-based sampling and Shark is actually getting the IP or if you were to collect some other event, how often you're actually going to go query that counter for that data. There's also here, which is the CPU P-- PMC program or the performance monitoring counters, and this is what I tend to use so I switch over to that. And then, this gives you, and it actually picks the right set of events that are based on your particular processor that you happened to be running on. So I talked about model specific events.

This is where it comes into play. So Shark is smart enough to actually detect what processor is actually running, that you're running on, and actually gives you the right set of events to go utilize. So in looking at this, I talked a little bit about the conditions, and one of those conditions was the privilege level and so here we're talking about user versus supervisor. This is essentially ring 3 versus ring 0 or both. These are the actual events themselves, and I'll bring up one here.

Let me just find it. So a typical one is the CPU clock unhalted, and it gives you a set of qualifier events. Again, this is part of the conditions on which you can qualify the particular event. And this is the button that actually tells you whether you're going to do the querying approach or whether you're actually going to do the trigger approach.

In this event right here, or this sampling interval, is actually the number of events you want to occur before you actually trigger to get the location in the IP. The default in this case is 10,000 and I can tell you that if I was to do unhalted clockticks, I would be getting interrupts way too quickly on a 3 gigahertz machine.

So, typically we would set this at about 1 millisecond in general, so we would take-- we would just take the frequency of the processor and basically put that number in there divided by 10, just to get us in the right ballpark. So be careful when you take the defaults. Make sure that you understand.

A lot of times I'll actually run this, figure out how many events actually occurred, how long it actually ran, and figure out what the right interval or what the right sampling interval should be for the particular event that I'm looking at. Something like unhalted clocktick cycles is going to be a very large number.

Something like loop string detecting might be a much smaller type of event. Now, I will-- now this is not a, you know, an official release of Snow Leopard, and so Shark is still not quite correct when you look at Nehalem, so be careful. Some of the events aren't quite configured correctly. Just looking at unhalted clockticks, I can tell you that there really is no event called total cycles that has a bitmask of 2.

So that one I happened to know is not actually configured correctly, and hopefully we'll get that resolved with the Shark team before Snow Leopard is released.

And so here we have the-- in Nehalem, I told you there were 4 programmable events, so here is up to 4. And then we have the 3 fixed counter events. So you see, you can't actually change those values or set any of the reserve, but you can still determine whether you wanted to be user or supervisor.

The other thing to be careful of is that when you actually start looking at some of these events, OK, it might be very tempting to actually start clicking on every one of these and saying, OK, you know what, I really want to know when the L1 data cache store has an invalidate state, a shared exclusive or modified state and start clicking on this. Not all of the events are supported where you can actually collect-- click on every single subevent in the event list.

Some of them, it's not clear whether it's going to be an add, so-- or an and of the event, or whether it's going to be an or of the event or whether it's just going to be garbage. So, one of the things is when you start looking at some of these subevents, is to be very clear about what they collect.

And if you're unsure, write a little routine, in this case, that does and L1D's cache store and actually measure what these events are giving you. And then if you actually click on multiples of them, determine whether it's actually giving the total for all of the events or some variation thereof.

So anyway, that's how you configure these performance events in Shark. So, here is an example of time-based sampling. I showed you where that actually was in the Shark configuration. And so here is where you actually can figure it for time. And I told you that it's not necessarily a good idea to actually take a time-based sampling with a counter event.

So, for example here I actually clicked it, I said I don't want triggering mode. You can't do triggering and time-based sampling at the same time, so you shouldn't be able to select trigger with the timer set that way, Shark actually give you an error and tell that you need to fix that.

But you can actually set up this particular configuration, but you really don't want to do that. And if you do, you have to be very careful about how you're interpreting that data. Now you can actually set up time-based sampling using the performance events as well, using the unhalted clocktick cycles event. It's essentially doing the same thing. Instead of using the OS time-- counter to interrupt and figure out where you are, you're actually using the processor event. And again, you can pick triggering in counter.

There's nothing that actually stops you from doing that. And again, you know, be careful because-- Let me get back here. We're going too fast. The mouse is a little bit-- OK. So what I did is I actually wrote an example showing statistical sampling with 2 functions, one that is completely compute bound, there are no load instructions whatsoever, and one that is mostly load bound.

So it's got L1 data cache misses. And if I were actually figuring out-- and I wrote this such that I expect 75 percent of my time to be in the compute bound function, and 25 percent of my time in my load bound function. And the reason why it's 75 and 25 is because t%hat's the way I wrote it.

I wrote it and then measured it and said, "Yeah, it's going to be 75 and 25."u But what I wanted to demonstrate is that I expect 100 percent of my load misses to actually be in the load bound function and 0 misses to be in the compute function, because I don't have any instructions.

So, if we look at the instruction, and this is a little bit of assembly code. I tend to work in assembly because I know exactly what I'm issuing to the machine. And in this case, I just have a loop with a bunch of increments. There are no load instructions whatsoever, it's very trivial. I just-- the reason why I have a bunch of increments is because I wanted to get to that 75 percent mark, and so I just kept adding a few increments until I got to that point.

And then here is my load bound function. It's a little bit more complex, but essentially it's just a bunch of loops that actually are doing the load, and there's only one load in the function. But I wrote it such that I'm actually striding through the cache so that I know that I'm missing the L1 data cache every single time. And if I run that with Shark, and I have a triggering, either time-based sample or triggered on unhalted clockticks in counter mode, what I find is is that in this particular case, the time really was 75 percent and 25 percent.

But if I looked at the L1 data cache misses, if you actually looked at the percentage of the ratios of those 2 values, those 2 big values, you'd find that 75 percent of the load data cache misses were in the compute function, which is not the case, and then 25 percent.

And that's strictly because there is no correlation between when you're going to have an L1D cache miss versus when you're going to have a time interrupt or where you're spending most of your time. And so you end up just matching that. Now, you could still use some of this data if you're careful. What you can't do is associate it with these particular functions.

If you wanted to know, kind of get a time-based sampling of your-- where you're spending your time but you just want to know what the total count for the event was, you could add these 2 values together and that would give you kind of what your full workload ran and what the count was. But that's about the extent to which you can use those particular values. You can't associate them with the function. And in fact, here we just see that the-- we're really just matching the same ratio that we measured in our time-based sampling, which is not correct.

So, this is an example of actually running with the triggering where I actually collected on L1D replacements. L1-- that every time I have a miss, right, I'm going to have to replace something in the cache, so I happen to use that particular event. And it turns out that doing that I find that I'm getting 99.3 percent of my L1D caches, replacements in the 25 percent or the load bound function, which is correct. So here is a clear case of using statistical sampling to correctly get the right location in the code.

[ Pause ]

And essentially, that takes us-- So if you actually drill down using the trigger sampling and you actually drill down into the disassembly for this particular routine, it actually takes you right to where that load instruction is, and you'll notice that it shows like about 38 percent on the load instruction, and then 54 percent on the next instruction after it. That's where the skid comes in, right?

So, the fact that the event occurred on the load, by the time we actually detect and figure out what IP address we're on, sometimes we're on the load and sometimes we're an instruction behind. But it takes you pretty much right where you want to be. The skid isn't that bad. You will find that skid with branches tend to be a little bit harder to trace back to if you're not using the precise event-based sampling. So, some skid is not unusual.

In fact, if you have extremely compute-intensive code in which you're retireing a lot of instructions per cycle, you'll actually see banding where you'll see like 3 or 4 instructions and then you'll have your percentage on like the third or fourth instruction, and then another set of 3 or 4 instructions with another one.

And that's because the reality is that the processor is retiring, you know, 3 or 4 instructions in one cycle, so what IP are you actually picking when that event occurred. You're really just picking the last one. So sometimes if you see that banding, that event can really be associated with maybe 2 or 3 instructions ahead of it just because of the way the interrupts are happening and the way the instructions are retiring through the machine.

[ Pause ]

So that's kind of the basics on performance event counters, and now I'm going to talk a little bit about what information you can actually get from the processor and the events. Now, in Nehalem, there are hundreds of events. There is no way that I could talk about every single one of them.

So what I'm going to trying to do is actually walk through the architecture and look at some of the major components of that architecture, and highlight some of the maybe more interesting events that can be associated with them. But really to show you that just about every point in the architecture, in the pipeline, you're able to get some performance information from it. Now, interpreting this is a little bit trickier and we'll talk more about that.

Now I'm focusing on Nehalem because that's our latest processor. It has more capability in terms of the number of performance events that are supported. But this is actually true for any other previous Intel processors that are supported on the Mac platform.

So all the way back to [inaudible], there are some level of events that are supported but I'm not going to talk about those particular architectures. And so right here, we've actually divided up the architecture into 3 domains.

There is the front-end domain which is actually fetching the instructions and doing the decode, and then going to be providing that back to the execution unit. So the execution unit is the other domain. This is where all of the register file renaming happens in allocation. This is where the retirement buffer lives.

So as instructions or micro OPS are being retired, if they're out of order because they are out of order. This is where they're maintained so they can actually retire in order and then the actual execution units and then finally we have the kind of the memory hierarchy which has basically the cache, and out to memory itself. And I just highlighted a couple of items in here so that as you see from the decoder, you're actually getting up to 4 instructions decoded into this back-end or into the execution unit.

And then from the rename allocate station, you're actually-- and can again issue 4. And once you get into the reservation station, which is the one that actually reserves and makes sure that all of your resources are available to actually go execute that instruction, there are actually 6 ports.

So you can actually issue up to 6 micro OPS into the execution unit. There are these kinds of constraints throughout the architecture and so this is one of the things that we look at when we start using the performance events is to actually look at, you know, based on what is my application doing, what is my bandwidth support within the processor, and am I getting close to that theoretical maximum throughput at each of those stages. So, let's look at the front-end first. The front-end is what actually feeds the execution unit or the execution engine.

If you're not feeding enough instructions, and it is possible to not feed enough instructions to the back-end, you essentially starve the rest of the machine, but it's very nice because we have lots of-- Just about every point within this front-end, we have events to be able to show you where you might be actually stalling or running into resource issues.

For example, the L1 instruction cache, you can actually see how often you're stalling trying to fetch from the L1 cache. So if you wrote your code very branchy and you're trying to fetch and you're only getting maybe one instruction on a fetch line, so fetch line is up to 16-bytes wide.

If you're only getting one instruction because you're branching to another fetch line and those, and you have enough of those that you're exceeding your L1 cache, you're going to be stalling for L1 cell cycles. Now, it's also possible that you're not able to, you're fetching enough instructions and you're actually filling up the instruction queue which is the queue right before the decoder, because you're not decoding fast enough.

And so you can actually determine, you know, am I fetching OK and in fact I'm fetching enough but I'm filling up the queue that I couldn't really do it any faster even if I wanted to, at least from a fetch standpoint, that you might start actually looking at the decoder. And so you can actually-- there are events associated with the decoders. You can tell how many instructions have been decoded.

We have a 411 model so we have complex instructions in decoder 0, and then up to single micro OP decoders in the other three. There are complex instructions which are decoded out of the ROM which bypass the decoders. So if they don't fit in the 411, so if your instruction is larger than 4 micro OPS and the complex decoder can't resolve it, it has to go to the micro ROM which can have a string of micro OPS. And then you could also tell when you're actually delivering your micro OPS out of the loop stream detector. So the loop stream detector on Nehalem supports up to 20 rate, 28 micro OPS.

So after instructions are decoded, OK, if the loop stream detector detects that, hey, I've seen this pattern consistently after a certain amount of iterations. It will actually start issuing out of the loop stream detector and you can actually shut down the rest of the front-end to save a little bit on power. It's not necessarily a performance saving except if you have some decoder bandwidth issues. If you have some decoder bandwidth issues, the loop stream detector could give you a little bit of performance boost by avoiding having to go through the decoders.

But there are lots of events associated with this front-end. And in fact we have a tremendous number of branch-related events. You can find out when you have branch address clear events for which the branch predictor doesn't know what the address actually is going to be and so it has to clear that address.

There are instructions retired associated with, you know, when you've mispredicted, you can tell when you've mispredicted, you can tell when you predicted and it can be taken or not taken. So you can qualify each of those events down to exactly what you're interested in. You can find out when a branch prediction unit is actually false, so it's possible that the branch predictor looks at 32 bytes ahead of the units. So the fetch line is 16 bytes.

The branch predictor is 32 bytes. It's actually possible to fill up that branch predictor unit such that the queue is full and you can't actually put anymore branch prediction in there. It's also stalling because you don't get the correct branch predictions. So the front-end is very well heavily supported with processor events. If you look at the execution unit or the execution engine, again there are tremendous numbers of events here to support the analysis.

So assuming that you're getting enough micro OPS out of the front-end, OK, enough instructions into the execution units, now you can start looking at, you know, what is going on from an execution standpoint. Am I stalling in the execution units? So, some of the resource stalls that are supported are things like the ROB being full. So, if I'm not retiring instructions but I've gotten enough instructions started and starting to execute but I'm not completing them fast enough, my ROB can actually fill up.

The reservation station which has 36 entries on Nehalem is also another resource that you can fill up. So if you are actually getting enough micro OPS in but you're not getting them out of the execution units, because the results are stored in the ROB once they finish the execution unit waiting to retire, if that's backing up, so if you're execution units are stalling then you will start stalling and filling up your reservation station. And then of course if you're doing load and stores, your load and store buffers can start filling up, that will also stall waiting on either cache or memory loads to occur.

And then there are some issues with the domain. So if you were doing floating-point in SIMD and integer, some instructions support multiple data types like that so you can actually do a load using a floating-point load instruction and then go use it in an integer SIMD instruction, and that's actually what's called a bypass or domain bypass stall because you're switching from one domain to another which will also slow down your execution.

And then you can actually find out exactly what port you're dispatching. So if you have a routine that you're actually looking at the execution for and you partition that out into the ports that are being used, it's possible to actually look at that and find out whether you're actually being limited by the port that you're actually executing on. And it's also possible to actually decide that, hey, I'm port 0 bound limited or port 0 and 5 limited, let me look for other instructions that I can start to spread some of this out.

Maybe instead of doing moves using registers, maybe I'll actually load it from cache and use port 4 to actually do the load and it will hide the latency, because you're executing a lot of instructions and you're spreading this out over the number of ports that you actually have going on. So again, there's lots of events that are supported in the execution unit to figure out where it is.

And what you tend to try to do is start looking at the back of the pipeline first. So you're looking at where is the resources being consumed. You know, the first thing I wouldn't look at would be at the front-end decoder. I would start looking at the end. How am I retiring micro OPS? What's my micro OP retirement rate? If I'm not retiring micro OPS, what's the bottleneck? Is it in the execution stalls? OK, are we having problems with the execution units?

How to get in the micro OPS through? And if that seems to be clear, I'm never filling up the reservation queue and everything's-- then they seem to be idle, then I start backing up and looking up further into the, to the processor, into the front-end in the decoder side. And then the cache and memory subsystem has, again, a number of events associated with cache behavior.

So, you know, you're looking at the number of L2 cache lines in. You can do-- you can actually partition them into the ones that you've actually demanded because you've actually written a load instruction for it and the ones that are actually being prefetched for you.

So you can partition that.

You can see the L2 lines out, same thing, was the L2 line out because it was a demand store or was it because I did a prefetch in it, evicted some other instruction. And then you can just look at the number of memory instructions retired, load, stores. And the interesting thing here is that we've actually tied some of these events back to the DRAM.

So on Nehalem, because the architecture of the platform has actually changed from the front-side bus that we originally have to a NUMA-related architecture. If you actually have NUMA enabled, you can actually see where the memory is actually coming from and figure out how often it's coming from you local DRAM for your socket or whether you're actually loading that memory from a remote socket.

And the performance delta is pretty astounding. It's 30, 40 percent difference depending on where your application happens to be running and where it's actually loading that data. So, there's lots of events to investigate the architecture. And for a lot of these events, it's really good to just write kernels, you know, use the events, figure out how the processor is actually working.

Make sure you understand what the event is actually doing and what it's actually counted because some of them can be fairly complex. Now there's a little bit of a wrinkle when you start looking at this. I told you at the very beginning that there is essentially performance events can count occurrences and number of events in cycle or duration, so occurrences and duration.

But, as the processors have gotten more complex a clocktick is not always a clocktick is always a clocktick. There's actually various clock cycle domains, and this can actually impact your results. So when you start looking at some of these cycle events, you have to be aware what cycle you're talking about so that when you compare them to your expectations, that you understand and can interpret the results correctly. So one is, you know, make sure that you understand what that domain is. I'm going to talk about what the various cycle domains are and how you compare them to each other.

So one of the reasons why we have differences in clockticks is because we support a concept called turbo mode. And so if we have, in the old days or prior to Nehalem with turbo, you know, when you ran cores, you ran them at a certain frequency. If you didn't use them, you could halt them and actually put them to sleep but the other cores just stayed, you know, at the same frequency.

There was no additional boost. What we found is that, if a couple of cores actually go to idle, OK, they actually get halted, we actually have a larger thermal design point which we can actually increase the frequency of the other cores. And so turbo mode will kick in, and your clock domain actually changes. So while you still, a clocktick is still a clocktick in an unhalted cycle, it actually is incrementing faster because the frequency is different.

And so I actually collected, and this is a little bit on my chart, essentially unhalted reference cycles. Unhalted reference cycles are ones that are cycles for which-- well, it used to be tied to the front-side bus clock, OK, but-- and it still is in some sense that we say that the unhalted reference cycle, the increments at 133 megahertz increments.

So it's actually closer to wall clock time. And then we have what's called a thread cycle which is the actual cycle for every increment for which no matter what frequency it happens to be at, it's just incrementing at that particular rate. And then we have what's called wall clock time or read time stamp counter.

And in this case I've collected both of these events and it turns out that the total clock cycles or the wall clock time is both the unhalted and halted clock cycles. It's the real time stamp counter, and in this case both numbers turned out to be the same.

But if we look at the reference cycles here, and that's the 43 million, OK, if we take that and multiply it by our frequency reference or multiplier, then we can actually come up pretty close to the wall clock time. So essentially what this tells us is that if I take that 43, and in my case it was a 2.66 megahertz machine, multiplied it by 20, I pretty much get the same 874 or 875 million clockticks. So what that tells me is that, because the total cycle is both unhalted and halted, that that particular CPU, CPU-8 was almost exclusively unhalted, it was running consistently. And so there's really almost no unhalted cycles associated with that CPU.

So that's the unhalted reference cycles. Now if we look at all the unhalted cycles, it's really independent of the frequency. So here's the thread or the cycles for the thread of the core cycles, and here we're over a billion. So if you look at those 2 numbers, how does that actually, you know, translate? I've got more clockticks than I have wall clock time?

So in this case it actually turns out that we actually got an upside from turbo mode and in this case it was really close to 15 percent improvement on this particular CPU. It turns out to be 3 bins up on the frequency. So instead of being a 2.66 part, it actually was running for this particular core at about 3.06 gigahertz.

So that's the unhalted thread clock cycles. Now if you look at some of the other CPUs that happened to be running these are-- I wrote this as a single-threaded application, you can actually see that the other cycles are actually halted and so their numbers are extremely low, and those are the unhalted thread clock cycles, but we're actually seeing that when it was actually running CPU-3 was less than 0.1 percent active and the other ones are pretty much in that same ballpark. So you can see that there are a number of clock domains to be aware of. There's the wall clock time, there's the reference clock, and then there's actual the execution cycles associated.

So be aware and know what the events are actually counting. And one of the reasons for that is because one of the things that I talked about on the back end is that we can retire up to 4 instructions per cycle if you actually used-- if you've collected in this particular case, if you've collected the instructions retired and the unhalted cycles, you would see that it would appear at least if you just based it on the wall clock time that you're actually retiring more micro OPS or more instructions than 4.

And so I certainly don't want that to kind of throw you up. So, how to interpret the results. You know I get the question a lot about, you know, here's my data, what does it mean. Well, it depends-- it depends on your application. It really depends on what is your application doing. You can't look at the performance event counters independent of your application or independent of the architecture.

Now you can use some of your ratio events to give you a little bit better idea. Certainly if you came to me and said, "Hey, I've got my CPI and I'm getting-- and I'm getting, you know, my instructions retired per clock is 4" And I'll tell you that there's really nothing, other than eliminating instructions, there's really nothing you can do to improve your performance with that kind of IPC. But if you came to me and said, "Hey, that's 0.25", I would say, you have a lot of head room. You need to be up in the 2 range if you're really more in the compute arena. So use the ratio subevents to help guide some of that.

And then really understand where your performance bottlenecks are by using the performance event counter. So if you can base them on the architecture, start working back in the pipeline, you know, start with your instructions retired, working back up through the pipeline, then you can start to better understand whether your data or your results makes sense based on your application. And once you understand where the bottleneck is, then you can start to think about what you can do to actually resolve that particular bottleneck.

So you do get very big numbers, very big counts, no doubt. And-- but you really need to correlate them to your expectations and that has to be based on the application as well as the hardware. So to kind of try to give you an idea of how to get started.

Basic characterization, so a number of these events are actually fairly obvious I think to most of us that are doing performance analysis, simple things like in cycles per instruction, your branch prediction per micro OPS retired. So, for example, what happens with a branch retired if you miss a lot, if you missed that prediction. The processor is still fetching all of those instructions, all of those micro OPS and executing them.

And no one figures out, hey, I mispredicted, it has to throw all those micro OPS out. They never get retired. So if you find that you know the number of branches per micro OPS retired is getting, you know, out of whack then you know that you're probably missing a lot, OK, and you're throwing a lot of stuff away so really, the thing is to try to figure out how to eliminate those branches, how to improve the branch prediction for those branches, maybe using SIMD type instructions and actually operating on maybe both paths of execution and making a decision at the end about which computation to use.

But there's a number of things you can do to try to improve your branch prediction. And then certainly L1 data cache miss rate and L2 data cache miss rate. The reality is that you know the biggest bottlenecks are going to be your memory and your cache latencies, right. So memory, you're in hundreds of cycles, you know, the L3 cache, the last level cache, you're in the tens of cycles of latency for those loads. You know you get into the L2, you're in you know 10, 12, 14 type cycles.

You get into the L1 and now you're in the 3 and 4 cycles. So, based on that you can start to figure out where are you actually being bound in terms of your memory accesses and then you start to figure out how that correlates to your application, how much data are you actually loading?

Are you loading more data repeatedly than what you really need to even load from a processing standpoint? So you're really correlating that to kind of the known constraints based on the L2, L1 last level of cache latency in understanding your branch prediction. And then you have to kind of compensate for this turbo and thermal impact.

One of the things I talked about was turbo mode. The converse is true as well. If you actually have a fully loaded processor across all cores and you're really doing extreme amount of computation, get that processor really, really hot. It will start the thermal throttle, so instead of getting 15 percent performance boost, you can actually see a reduction in frequency as well to compensate and try to keep that processor within its thermal design limits.

So, the other thing is to really start out with the bigger bottlenecks first. It's really focused on the memory and cache bandwidth and then starting from the execution units because that's where you're going to be loading that memory into, you're going to be executing on that data and then start working back to the front end decoder. Rarely is the front end decoder your problem, but I have seen applications that have had that as the issue where the execution units are mostly idle.

And then beware of the cause and effect. So, you know this is really focusing on the clog and not the symptom. So if your bathtub is overflowing, OK, the symptom is that the water is spilling out over the tub and fixing that isn't going to help anything. It's really the clog that's down on the floor in the drain some place that you have to unclog and that's really the same effect.

If you start clogging at the execution units or the retirement location, then you start filling up those, all those other queues all the way back into the decoder and then you, you know, the decoder stops decoding because there are no more places to put the decoded instructions. And really start with simple events, so know what you're measuring simple things, you know instructions retired, L1 cache misses, L2 cache misses.

You can get into very esoteric events but start with the simple ones first and then when you start to hone in on where you think the bottleneck is, you can start actually looking at some of the more, you know, sub event and more complex events to try to focus in a little bit more.

So, you know, really drill down using the performance events in really understanding your workload's behavior. And in summary, the event monitoring is a very powerful tool. Intel has invested a lot of time and a lot of effort and a lot of silicon to put these performance counters in there.

They're certainly not there because there's, because we don't think they have value because they have a tremendous amount of value. We use it all the time internally to investigate and understand the architecture and understand how applications are working. For me, I come from a position where a lot of times I have no idea what an application is really doing.

You know, it's some big monstrous thing that has, you know, 10, 20, 100 engineers that have been working on this thing for the last decade and so it's very hard to understand exactly what the application is doing. But by using the performance event counters I can actually do a fairly good characterization to get a better idea of how that application is working. And then once you understand where the bottlenecks are, what your constraints are, then you can start using that to help optimize your applications. It certainly supports very simple analysis but also supports extremely complex analysis.

And then the more skillful you are as a user with the tool, with Shark, with the event, the more you know about your application, the more you know about the tools, the better you are at actually becoming the expert. So, you know, use the tools, use what's available and certainly performance event counters are a contributor to that. So become your organization's expert using processor performance events. There are a number of references that you can actually use.

The Intel's Architecture Software Developer's Manual has all of the events and what they do. Shark User Guide has a list of the events as well and their descriptions and then if you happen to be cross-platform, Intel VTune Performance Analyzer both on Linux and Windows, have a lot of good material to go to.

In fact the Intel Performance Analyzer actually has ratios of events that have been documented so even though Shark doesn't necessarily give you ratios explicitly, you can actually go look at the VTune event, see what ratios they're collecting. They give you a lot of hints about what those ratios should look like in terms of what those-- what the counts should be and whether you have a problem or not, so definitely another good reference to get started on performance events.