Developing for Performance - WWDC 2002

Tools • 54:44

Don't miss this runtime-neutral discussion of the principles of performance and efficiency as they apply across the operating system. We will identify top performance obstacles, along with strategies for eliminating them. Developers will complete this session with a checklist of concepts to consider and items to look for which will enhance their products and the Mac OS X user experience.

Speaker: Joe Sokol

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Ladies and gentlemen, please welcome Technology Manager for Development Tools, Godfrey DiGiorgi. Good morning. It's good to see that on a Friday morning we get as many of you here for an early morning session. How's the show been going for you? Good. Great. We've talked to you over the course of the week on various technologies in our frameworks and applications. We've seen Jaguar and all the new things we're doing there.

Today we're going to be focusing on a lot more of the tools in terms of their detail aspects and how to use them and stuff like that. So for the first session today we're going to talk about developing for performance. And to do that I want to introduce Joe Sokle, the manager of the Core OS Performance Group.

Hi, good morning. Hope everyone had a fun time at the beer bash last night. Maybe be a little bit awake this morning for this. Anyway, I am Joe Sokol, and I'm going to Man, that's very sensitive. Okay. So today we're going to actually talk a little bit about performance philosophy as opposed to the actual APIs and things that you might use to make your applications go.

So the philosophy that we've kind of chosen to follow here is that improving system performance is like a diet. It's a never-ending battle. Once you start down this road, you've just got to kind of stay with the fight. And just like a diet, it's a lot easier to do if you're participating in a group.

[Transcript missing]

To win this battle, we need to look at using our system resources efficiently, both when the app is quiescent and when it's actively running and doing something. Now, quiescent does not necessarily mean that it's in the background. That app could be in the foreground. If the user is not interacting with it, it has no work to do, it should do just that. It should take no system resources. When the app is active, we'd like you to use the system resources as efficiently as possible to implement whatever work that you have to do.

You also need to pay attention, basically what we're asking you to do is pay attention to the fundamentals. Okay, so these are things that everyone has learned as an engineer through the years, but sometimes we forget, and what we want to do is really focus on why this is important to system performance.

There are no magic APIs. You know, I cannot give you an API that basically puts you on a slim-fast diet and makes your application more efficient, right? It's up to you guys to actually go and take a look at what's going on and make changes in how you're utilizing these resources.

Okay, so basically we're going to go through this in three stages today. I'm going to do a brief description of the difference between performance and efficiency and why you care about both of these concepts. We're going to talk about the fundamental elements of system performance. So these are the three things when I'm referring to system resources. These are basically the three things that we're talking about: use of the CPU, use of memory, and use of file system. And at the end, we're going to go through some processes and some lists of tools on how to evaluate your use of these system resources.

Okay, so performance and efficiency. They're not synonymous. People get these confused all the time. Performance is a result of efficiency. So the more efficient you make your application, the better it's going to perform in a variety of environments, not just in a kind of stand-alone test bed that you might be looking at.

And performance does not guarantee efficiency. So you may go and really work hard to make your app perform and meet certain criteria that your customers are asking for, but that does not necessarily guarantee that it's doing it in an efficient way. And in fact, often that's what happens. You set a performance goal, you reach that goal, and then you stop.

You may be using -- consuming 100% or 90% of the system available resources at that point, but you've met your performance goal. The processes for improving both are very similar. We will address that in the third portion of this talk. We will also talk about why you should care about each of these points as we go.

So what is performance? It's the speed at which operations complete, right? It's something that is measurable. It's something that is perceived by the user, which is why you care, because it's also something that the user is going to call in and complain about, right, if it doesn't meet his requirements.

Some common performance metrics that are used to measure this, to measure how well you're doing, would be things like throughput, you know, so bytes per second, so maybe how fast you can move that file across the network. Operations per second, so that would be maybe like the number of frames per second that you could draw on the screen for maybe like a game, something like that.

And of course, latency, which is pretty much what the user always sees, right? It's the response time. So he sees all of these things as a request that then takes some amount of time to actually complete, and, you know, that's really what interests him. So how long does he have to sit in front of the screen there and wait for that operation to complete?

So what is efficiency? Efficiency in this context is really the cost to complete an operation. For any one of those given operations that your app is making many thousands of, what was the cost in system resources? It minimizes the use of system resources. If you have an efficient algorithm, it tends to minimize the use of system resources. Efficiency in this context is really the cost to complete an operation.

For any one of those given operations that your app is making many thousands of, what was the cost in system resources? It minimizes the use of system resources. If you have an efficient algorithm, it tends to minimize the use of system resources. Efficiency in this context is really the cost to complete an operation. For any one of those given operations that your app is making many thousands of, what was the cost in system resources? It minimizes the use of system resources.

If you have an efficient algorithm, it tends to minimize the use of system resources. Efficiency in this context is really the cost to complete an operation. For any one of those given operations that your app is making many thousands of, what was the cost in system resources? It minimizes the use of system resources.

If you have an efficient algorithm, generally we look at it as an average. The memory footprint, which is probably the most critical of the three to look at, because that kind of dictates whether or not things are really going to operate efficiently, and IO frequency. So how frequently are you going to the disk and for what size operations?

Okay, so this all kind of rolls up into system performance. System performance, when we talk about that, what we're really talking about, it's a result of all the individual apps and OS efficiencies, right? So when you roll that all together and you take a look at a system and it's got 20 applications running on it, it has a bunch of stuff that the system is also doing, that really dictates how well the system or the efficiencies of what you're doing there really dictate what the system is doing. So if you're working set size exceeds the amount of physical memory you have, I mean, the system performance just tanks. I mean, it's not a slow drop-off. It just becomes, you know, pretty much unusable.

Why is this important to you? Because the system performance affects the user's perception of any individual application performance. So you guys can work really hard to make your app perform really well. But if you've tested that in isolation and you haven't mixed it in with other applications to see how they affect it and what the system is going to do to it, when the user actually uses it, they may see a very different thing, right?

Your testers are telling you, "Hey, we meet all of the performance goals that we set. Everything looks cool." All of your customers are telling you, "Hey, you know, this thing just doesn't rock," right? It's not performing the way you claim it's going to perform. So how do we go about improving this? Because that's really the -- Yeah. -- the nut here.

All right. We focus on the fundamentals. Okay. There are three fundamentals that really control how well the system is going to perform and how efficient your app. It's your use of the CPU. It's how much memory you're actually tying up in a working set. And it's how much file system I/O that you're initiating. When you look at it from that perspective, I mean, that -- every system performance problem that I look at, right, it always boils down to one of these three things.

All right, in looking at the CPU... There are two models that are commonly used. And I know that we've been driving this point home, I think, throughout the conference. I was at the Carbon event session. And we're asking you to be event-driven. This is really a good way to control your appetite for the CPU.

The important thing here is that only real work is done. So when you get an event, you respond to the event, right? You do the work that's necessary. You do it in an efficient way, and then you go back to sleep. The app can become totally idle, which is a really good thing. We're going to talk about a couple of implications of that here in a second.

The other way to go is more of a polling model. And the thing that's really bad about this is that often, in fact, the majority of the time when you poll, there's actually no real work for you to do. You run around your polling loop looking for things to do, and the majority of the time, there's no real work to do.

So, it's a pretty inefficient use of the resource. Now, the other thing that happens with the CPU is every time you use it, right, you're causing memory and possibly the disk to be accessed. So, it's a, you know, it's a VM system. You don't know whether or not that piece of memory you're touching is really physically present or not, or it's going to have to come in off the disk.

So why event-driven? So really, fundamentally, what's really good about this? In terms of the fundamentals, anyway, it minimizes the working set. It just, if you're only responding to events and only doing the work that's absolutely necessary, you're going to touch a lot less pages. Touching less pages means that there's less paging going on in the system. It just means the system is going to feel that much more responsive. Another point that you might not think about is the ability of the system to aggressively manage its power.

In Mac OS X, we do all of our power management basically in the idle thread. So the system has to go idle before we can start to make use of any of the little hardware tricks for napping or dozing the processor or possibly doing frequency shifting if that processor is capable of doing that or it needs to be done.

And the more aggressively we can manage power, the longer a battery lasts in a portable-- and the fan just comes on less often, too, and like your TI. So, you know, that's a good thing, too. People complain about that. They don't realize there's a fan in there until, you know, the system is able to do that. The system has been banging around for a while. And there are more CPU cycles available to do real work. So if you're not using the CPU to just go pull for things to do, that leaves CPU cycles available for other applications.

And why not to pull? So it's just the converse. When you're pulling, you're basically keeping a larger working set heated. And that means there's going to be more contention for memory. When it's a single app doing this, it's easy to say, you know, it doesn't really matter that I'm going around this loop and making a few API calls and touching a little bit of memory. You know, what can that hurt?

But when you think about it in the context of there are 20 applications running or 30 systems running on the system that people have started up, forgotten about, and they're sitting in the background, it starts to add up and it starts to make for quite a bit of contention for memory.

More contention for memory, more paging. If the system is paging, it's a sluggish system. It's just not going to perform very well. And then back to our power management point. The system, if we kind of take this to an extreme, the system may never become idle, or it may be idle for only short periods of time, not enough for us to really do anything with the power management, so the power management may be compromised.

And, you know, the battery life shrinks. The fan is on all the time. And people are unhappy about that. But that's a part of system performance that, you know, people pay attention to, right? How long that battery lasts. And it's an inefficient use of system resources. You're going to be using the CPU, touching memory, possibly causing pages to come in off the disk to find out that you don't have anything to do. There isn't really anything more inefficient than that. I mean, that's just about as bad as it gets.

All right, so touching memory. So we've got this CPU. He's executing some instructions. What happens? All right. So if everything is like really operating nice and tight in your cache, you're just sitting in this L1 cache and you're just banging away at it. The TLB there is the translation look-aside buffer for doing the virtual to physical address translations at the CPU level.

That's a good place to be. As memory use expands, then you get into maybe the L2 cache, the L3 cache. Now you're down to memory. We've color-coded this so you can see that as we move further down the chain here, things get slower and slower. So at this point, if we're actually executing out of memory, so if our working set, if the set of memory that we're touching has exceeded those first level caches, if we're executing out of memory, we're running ten times slower than the L1. And it's actually probably a little worse than that on a faster processor.

If we're actually hitting the disk, right, to go get data or to go get code that has to be executed in that polling loop, you're now running 10,000 times slower than the L1. So you can see why you want to stay out of that case. The other thing to note is that if you just touch one byte of memory, it has a lot bigger impact than you think. Caches are organized around 32-byte cache lines, right?

So to get that one byte in, you're evicting another cache line all through the L1, L2, and L3. The TLB table is built around pages, right? So if you have to address a byte in a virtual page that it currently doesn't have an entry for, it's going to evict an entry in order to make that translation. And that basically represents a 4K set of addresses that are being tossed in terms of translation.

Memory. Memory, we allocate everything in 4K pages. So the minimal size physical page of memory that you're going to get when you touch a byte is going to be 4K. So that means, again, your byte basically is causing 4K worth of data to be evicted. And the disk is managed the same way in terms of 4K pages to back that.

But, again, up around the processor, if you can keep your memory appetite small and you can keep everything in the caches, or worst case, in the memory, the system performs a lot better. Once it goes to the disk, things get bad. All right, so now let's move on to, memory.

All right, so again, in terms of the fundamental thing that we're looking at here, it's really the working set. This is really the thing that controls system performance. And basically, a working set are the pages that are being actively touched by the application. So these could be in RAM-- hopefully they're in RAM-- but could be out on the disk also.

And let me tell you, when you're working sets-- if you have a set of pages, a set of memory that you have to go pounding through in order to complete an operation, and you can't get it all to fit into your physical RAM, that's a bad place to be.

You have to be aware of that fact that if you're banging on memory, it may actually be coming in off of a disk. What does the working set consist of in this case? Well, it would be all of your heap or heaps and any code that you're executing.

Every API that you use increases the size of your working set, right. Sometimes it has long-term effects. So sometimes it's short-term like it's -- you're heating up stack pages, running some code, other times, though, it's actually leaving memory allocations behind, right, so like window buffers, which can be quite large. So, again, be aware of the fact that each API you use has a cost. Leaks. Leaks increase size via heap fragmentation.

So -- Interestingly, as we've moved along in the OS development and the OS has become more stable, applications have become more stable, we've implemented things like sleep capabilities for the system. People are not rebooting their systems nearly as often as they used to. People are not quitting their apps either. So what ends up happening is you now have applications that are very long-lived.

They may be actually running for weeks at a time before they either quit or the system is rebooted. Time is your enemy when you have a leak. Even if the leak is slow, eventually it will start to build up. What will happen is, depending upon your usage patterns of how you're allocating memory and how those leaks are showing up, it may actually be increasing the size of your working set. So if the leak doesn't all happen at once and just get paged out, then you're going to have a lot of problems. You can look at your application, and even if it's a slow leak, go after it and get rid of it.

Another thing that you can assume is that long-term caches are always paged out. So if you're creating a cache and it's building up over time, when you go back to access it, be very careful in how you do this. Try to avoid walking through the whole cache. If you can just do cache lookups into it, that's fine.

But when you're doing maintenance, be aware that when you go running through the whole cache, Most likely, a really high degree of probability, it's been paged out, which means you're going to page it all back in to do some little bit of maintenance. You need to really be aware of that and make sure that if you do have to do it, do it in a way that's efficient and that you're not doing it too frequently. Best performance when all the pages are in RAM. Again, can't stress that enough. The system performs a lot better if it's not paging.

So how does your memory appetite and the VM, how do they interact? Well, here's the problem. Typically, 40% of the pages of the system are dirty. So these are valid pages that some application wants to keep a hold on, but they're dirty. They've been written to, which means that we're going to have to write those pages to disk whenever unused pages are not available.

New memory is accessed, so if you go and allocate a new big chunk of memory and then go and access it, the very first time you access it, we're going to have to provide you some physical pages. Well, those physical pages have to come from someplace, and the system is going to go through, and there's an LRU to keep track of working sets and all that.

But basically, when the system is running kind of saturated, we're going to have to page pages out. So when you initiate file I/O, that's also going to possibly trigger this. And of course, when you're paging in, you're paging -- this is the worst case. If you're paging -- you're paging pages back in that you need, and we're having to page pages out in order to make room for them, right?

So you're getting a dual kind of I/O going on there that you don't want to have happen. Clean pages in the system are just stolen, but again, 40% of the time, when we go to steal a page, we're going to have to write it out when the system is kind of in a steady state, you know, fully used or utilized in terms of memory.

All right, so let's talk a little bit about working set impact. So here's a way to basically organize a cache that is probably fairly common, where you have a header. attached to some data, and you just kind of allocate the whole thing, chain them together, and hopefully, you know, get some use out of the cache.

But the thing to remember here is that each time you touch any portion of that item--so this element--you're bringing the whole page in. So if you just need to run the list to update timestamps or, you know, do some other maintenance on it, and you're not actually even--you're not actively looking at the data at that point, you're still causing the whole page to be brought in.

So if you had a cache with, say, you know, 10,000 elements and it organized this way, and you chose to run through and do some maintenance on it, you're touching, you know, basically 10,000 pages of memory. Unlikely that that will be sitting in the cache or sitting in memory at this point.

So here's another way to organize it that reduces that in the case where you're having to do these maintenance functions. Break out the headers into their own separate allocation, right? Point them off to the data. So at least in the case where you're having to go through the headers to do something with the timestamps or whatever, it minimizes the impact on the working set. Now, with that said, there is still a downside to going this route.

There's a downside to it, and that's one of the things that you have to be aware of, which is if you're allocating these things into your cache over time, so if there's a lot of temporal spacing between the allocations and between the insertions into the cache, you're not really going to help if you do each of these headers independently because, again, they'll spread out through your working set, and we'll end up with each one of those little headers being in its own page, right, but mixed in with other things, not with your data, and it won't have accomplished anything.

So in this case, if you absolutely have to have a cache, when you break it into this kind of organization, make sure you cluster allocate those headers. So, you know, ask the system for enough memory so that you can have 100 headers, right? And then you keep track of that and dole them out, right, as the cache grows.

And when you run out, you ask the system for enough memory for another 100 headers, so at least you keep the headers compressed, right, in terms of the amount of physical memory that they're sitting on. So that's a lot better way to do it. that. Okay, so now we're done with memory. Moving on to file system I/O.

All right, so sequential versus random file I/O. This is an interesting thing. Again, we've kind of come to expect that disks are fast. The kind of modern disks that we ship now are easily able to do, you know, 40 megabytes per second sequential, a single disk. But that's going purely sequential, which we often don't do in the system. And in fact, when we're paging, we never do. when we're paging, we're doing much more random accesses on the disk.

And disks really have not gotten any faster in the last six, seven years, eight years. They really don't randomly seek much faster. And basically, the throughput that you can get out of a disk that's purely random is only around 0.4 megabytes per second. So it's about 1% of the sequential throughput.

It's about 100 IOs per second. That's really kind of pushing it. That's like a 7200 RPM drive, right? Some of the slower drives in a Mac OS X system are really pushing it. But if you're using a portable that don't spin that fast because of power considerations, they're not going to even yield this.

So that's just something to keep in mind when you're accessing files, too, that, you know, sequential access patterns are a lot better than random access patterns. Now, larger I/Os are more efficient up to a point. We really like you to not do one-byte reads. That's not a very efficient way to get your data out of the file system cache.

So larger I/O requests are more efficient, but the thing to remember about this is that your I/O buffer that you've created and that you're sucking data into, that is eventually-- I mean, that basically consists of dirty pages after the I/O is completed, or while the I/O is completing. So a reasonable size is 128K.

I mean, that's--you know, it could be smaller. It could be a little bit larger. That's a pretty reasonable size that will pay back good dividends in terms of the efficiencies making the requests, but doesn't get too much in the way of the system in terms of a lot of dirty pages.

Again, taking--you know, taking this kind of thought to an extreme, if you were to create an I/O buffer that was, say, I don't know, 32 megabytes in size, and you issue a single read of 32 megabytes to fill it, because you've got files that are gigabytes in size, and you think, "Well, Joe said, you know, "bigger is better when it comes to I/O, "so let me go to 32 megabytes." Well, what would happen there potentially is that as the I/O is completing-- because it doesn't all complete at once, right? Early pages are being filled with data.

As the I/Os are completing, we're releasing those pages back to the system. By the time that you actually get control back that your read is completed, if the system is busy doing other things, we may have paged those pages out for you, right? So your I/O buffer is now sitting out in the swap file. And when you come back to touch it, right, you're gonna pay the cost of sucking it back in off the paging device again.

So you'll actually end up with, you know, doing at least two I/Os there, probably more, because we'll probably also be paging stuff out again to make room for it. So that's why I say, you know, it's up to a point. You don't want to take this to an extreme. Another way to go would be to investigate doing some file mapping for reading large files. So in the Cocoa world, I know this is, you know, this is kind of the norm.

In the Carbon world, you know, this may not be something that you're used to doing, but it's something that you should investigate using the, you know, the basic system primitives for actually mapping files and touching them in that way. The interesting thing about this, of course, is that you don't create any dirty pages.

If you're just reading, the pages, you know, are brought into the system, but they're not dirty. They can just be tossed if they need to be tossed in order to make room. We don't have to page them out. And if you go back and retouch the memory, we just go back to the backing file and, you know, retrieve it from there. There's no swapping involved at all.

[Transcript missing]

in 10, we're actually keeping track of the access times in those cases. And those access times are part of the metadata. Now, we don't write those out right away, but every 30 seconds, the update process is gonna run, and it's gonna issue a sync on the system to try to sync out anything interesting to the disk.

And it's gonna go ahead and sweep all those up and sync them out to the disk. So if you're sitting there polling, looking for some work to do by enumerating the directory, looking for a file-- you know, say, waiting for a file to appear, basically what you're causing to happen is every 30 seconds, we're gonna write the disk. Okay, that's gonna keep the disks from going to sleep.

So that's not such a good thing. A better way to do that would be to stat the file or do a getcat info and look at its modification date, look at the directory's modification date to see whether it's changed or not. Because the stats, just retrieving the information about that directory doesn't cause the access date to change. So-- in that case, you could sit there and keep, you know, accessing, looking to see whether anything new has come into the directory, but you're not having any impact on the file system in terms of IOs going to the disk.

[Transcript missing]

you have to be aware as developers that your application may actually be living out on a server. Or if it's not living on the server, maybe the preference files or your history files or your cache files that you're creating actually are ending up on a networked server someplace, not on a local disk.

So you need to take that into consideration. You need to probably make-- well, you do. It's not probably. You need to test with that in mind. So when you're doing your final testing to see how things perform, make sure you do some testing with things targeted to networked file systems, not just off the local disk, 'cause that's the way your customer is gonna use it.

And basically, the thing to remember-- network performance is unpredictable, right? So again, if you're designing something that is predicated on your ability to predict how long the I/O is gonna take to complete, then you need to really make your customer aware, then, that he needs those files to live on a local disk, that he cannot have them sitting out on a network store someplace.

And then the third point here is polling models tend to overwhelm networks. Again, when it's one or two or three machines that are doing it, it's not too bad. When it's 2,000, 3,000, 4,000 machines in a K-12, you know, institute or higher ed institute or, you know, a larger corporation, the network administrators really dislike that, right? If you're sitting there constantly, banging on the net, looking to see whether a file has come into existence or not, you know, that's just not a very good thing to do.

All right, so... The big message here, the fundamentals are important. It's the most important thing that you can be looking at, right, is that your app does not run in isolation. Your app affects and is affected by other apps. That's basically what we're saying here is that when you're testing, don't assume that just because you've reached a certain level of performance or efficiency that that is adequate, right, testing in isolation. You need to really look at it, you know, with some other typical applications that your customers might be running when they're running your application. And app efficiency, this directly drives system performance.

We can provide you very efficient APIs, which we really don't do yet, but we're working on it. But even if we did, if the app is intent on using a lot of memory, there's not a lot we can do about it. There's only so fast that we can move pages in and off of a disk, right? Most of our systems ship with one disk.

You know, there's contention for just normal file system mixed in with all the paging that's going on. Really, the only way to keep the system performance, you know, adequate is to everyone, you know, stay on this diet, right, and really control your appetite for memory. Minimizing the working size. I mean, that's, this is really the thing that we're driving, or I'm driving at today. I mean, this is the most critical element.

And to reinforce this, so let's take a look at what's going on in terms of working sets. Say you've got, you know, so you've got a system working set which basically consists of like the finder, the dock, all of the little demons and things that run behind or back to make the system work the way it does. Wired memory in the kernel.

There's a whole bunch of stuff that just kind of sits there. Some other set of applications, and I threw in, you know, IE and mail as some typical things that might be running on the system all the time. Stuff that you probably would never, you know, you never really quit. It doesn't have to be our mail program, but any mail program.

So say at this point, the sum of all the working sets, so the things that everyone, the working sets that the apps are keeping hot, either through their idle loops or through work that's being requested or event processing, whatever. Say that that represents 80% of the physical memory that's available.

At this point, this is really good. You've got 20% headroom here. The system is going to feel very snappy. You're going to be able to do things with the applications that you haven't been doing, and they're going to basically be able to happen without any page outs occurring. The user is going to be pretty happy in this case.

But now let's say the user decides he wants to listen to some music, so he brings iTunes up. So now iTunes is put in the background and he kind of forgets about it and it's playing elevator music or whatever. It's just providing some entertainment for him. And say that that, again, this is not to scale, not to say that iTunes utilizes 15% of any physical memory footprint, but let's say that and maybe some other apps that were launched pushed this up to 95%. Well, at this point, as long as you don't sneeze, the system is still going to be fine. So basically, don't touch it, don't ask it to do anything that it isn't already doing, and it's probably going to be okay.

But there's very little headroom here. At this point, almost anything that you do that you haven't been doing, so if you launch another application, if you print and you haven't printed in a long time, almost anything major that you do is probably going to cause some paging to happen. But it's still not, you know, it's not fatal.

Basically what will end up happening is there will be a little blip and then things will settle back down and, you know, okay, fine. You know, it didn't feel quite as good as it could have, but it wasn't horrible. All right. Now let's throw in, let's say, your app, for instance.

And again, not to say that any one app is going to take this much space, but, you know, just for purposes of keeping this short. And we push the working set up to 125%. So the sum of all the working sets. So this is, you know, every second, basically, you're trying to touch memory that represents 125% of your physical footprint. You are now in, like, no man's land.

This is, your system is thrashing. You don't want to be here. Anything that you do, anything that you try to do is just going to be painfully slow. And there's not a whole lot we can do about it. I mean, if the memory is being touched, we have to shuffle it in and out. And we do our best, but we can only go so fast.

All right, so how do we avoid that situation? We go on the system resource diet. Okay, so again, we've been talking about this. You need to minimize your working set, right? And a couple of good ways to do this. Lazy initialization, right? So when your app launches, don't initialize everything that it could possibly do and touch a bunch of memory. Wait until the user actually requests a particular service.

I know a lot of apps have a lot of really cool features, but they're not always used, and certainly they're not always used. Event-driven, again, we talked about that and we've talked about that in other sessions, I'm sure. That will tend to really help control the working set.

don't optimize at the expense of efficiency. So again, if you look at it in isolation and you're trying to improve a particular aspect of performance in your application, and you choose to do this by, say, creating a very large cache because you say, hey, the disks are too slow, I don't want to deal with it. I'm going to pull all my data that I need, I'm going to slam it into a cache, and that's what I'm going to use.

Well, your app may actually end up running slower, depending upon what the other needs of the system are, other applications that are running. You can't assume that you can have all the physical memory, right? There are other apps running, they need to share that. And what will happen is your cache will just get ending up being paged in and out. And it will be paged in and out at our discretion and the patterns that we choose.

And that may end up making your app, you know, just not run as well as it could. Caches are not necessarily evil, but they should be used, you know, as a cache. If you're going to be accessing a particular cluster of data, you know, over a very short period of time, sure, pull it in, keep it in the cache, but basically understand that it might be better for you to release it when you're done and then recreate it later when you need it again as opposed to letting it go out, depending upon how long it, you know, takes you to create it.

You have to monitor efficiency during development, right? This is not something that you can do after the fact, or it's not done easily after the fact. You want to watch your CPU usage, memory footprint, file system IOs. I'm going to talk about the tools here in a couple minutes and, you know, how you would apply the tools to do this, but this is really critical. If you're not monitoring, you don't know where you're at. Don't test in isolation.

So once you've decided that your app is performing okay, come up with some kind of a test scenario that includes half a dozen or, you know, seven, eight other applications that you think that your customer might have actively running on the system. And remember that, again, we're encouraging them to leave the apps running, right?

I mean, that's what the dock is for. People launch applications. A week later, they're still sitting in the dock, right? They've forgotten that they even launched it. If that app is continuously polling, you know, it's putting a load on the system. It's not doing anything for the user at that point other than chewing up resources.

All right, so here's a recipe. This is a basic recipe for improving efficiency. And they're really, you know, they're probably different ways that you can cut this a little bit, but basically it all boils down to having some process that you can repeat. And something that is measurable. So you want to set a goal, right?

You don't want to just go into it thinking, well, I'll just start, you know, making my code more efficient and we'll see where we, you know, kind of get to. You want to set a goal. You want to look at how much memory you're using currently and say, you know, let's see if I can cut that down by a third or cut that down by half, right? Set a very aggressive goal. Identify a metric that you can measure, right? And then measure it. Get yourself a baseline. Really understand what your app is doing at this point.

Once you've done that, use the tools that we're going to talk about to analyze where that memory usage is coming from. So where in your application are you actually allocating all that heap space? And then go through, analyze, apply the changes that will bring that usage down, and then iterate until the goal is met. It's not a one-step process. Often you're going to have to go through this six, seven, eight times in order to actually hit the goal that you've set for yourself. it's unlikely that you're going to catch it the very first time.

Okay, improving performance. So, it's the same recipe. So, I'm asking you to improve efficiency. Your customers are asking you to improve performance, right, or specific performance aspects of your product. You can use the same recipe for doing this, right? You set a goal, select a metric which will be different than one for efficiency, right? Measure it, apply, and iterate.

I mean, this will work, right? This works every time. Now, the thing to remember is you should be monitoring all those efficiency metrics while you're doing this so that while you're improving your performance in one area, you don't want to become less efficient in terms of your fundamental usage.

All right, so here are some tools that we can use to actually go after this stuff. So to look at what you're doing in terms of CPU usage, I've broken these things up into two categories, basically, the tools. There are tools for monitoring, which basically just give you an idea of where you stand, right, what--how your--or how much of these system resources that you're consuming.

So a good one for CPU usage, an easy one to use is top with either the minus D or the minus A option. The D option just gives you a snapshot, basically, you know, a delta, so what happened in the last period, one second period, five second period, whatever period that you select.

The minus A option is kind of interesting because it basically is an--it accumulates all of the data from the point that you launched top, and it gives you a snapshot then of what happened from that point through to the point where you decide to stop it. So it's really useful for kind of getting an idea of what impact you have on the system, like when you launch an app or you do a particular operation like maybe print or save a file.

You could start top with the minus A option up right before you start the operation and then shut it down right as the operation finishes, and it'll kind of tell you, okay, for that period of time, you know, this was my impact on the system. I used this percentage of the data to get this data. So it's a really useful tool, and it's really useful to be able to make these kind of things happen.

So I think that's a good point. And then the last thing, I think, is that you can use the D or the minus A option to get the data from the point that you launched the operation. So you could start with the minus A option, and then you path was to that particular primitive.

There's Thread Viewer, which will, if you've got a multi-threaded application, is very useful in terms of determining which threads are, you know, actually consuming the CPU. And then Shikari and Monster are two new apps, or two new utilities, that have a lot of power for really drilling down and finding out what you're doing to the hardware, right?

So this basically gives you full access to all the performance registers, you know, either on an individual application basis or on a system-wide basis. But if you want to find out, or if you care what your cache hit rate is or how many branches per second that you're executing or, you know, on and on and on, those tools can, you know, really give you that view.

Okay, for memory footprint, again, top, just the basic default top will give you a pretty good view of what's going on in terms of how much memory you're utilizing. Top-minus-W gives you a lot more detail and also will provide you with a kind of running delta of what's happened since you launched top. So if you run it that way and you watch it over time, you can see whether or not your memory footprint is actually growing.

And it will detail that for you. So that's a pretty good one to run longer term. Like if you had an automated test suite, for instance, and you fired that up, I would fire up top-minus-W along with it, let the automated test suite run for however many hours or days that you run it, and then take a look at kind of, you know, how much, you know, did you grow in terms of your memory consumption. Leaks, okay, so again, after you've been running for a while, run leaks, right?

Sometimes the leaks are small enough that you may not really notice them, but if they're, like, as we talked about earlier, if they're there, they do add up over time. So you want to keep, you know, kind of on top of that. Don't let it go to the last minute. And then for analysis, we have heap, which will tell you where things are, I guess, organized, where your mallocs are sitting. Malloc debug and object alloc which is more for our Objective-C friends, I think.

File system I/O monitoring. Top-minus-D is good again. That just gives you a good indication whether or not there's any file I/O going on at all, right? If you see file I/O going on to the disks or through the network, top will show you either. Then FS usage is a good tool for monitoring and getting a better view as to what's going on.

If you've used FS usage before, we've added a couple of new things to it for Jaguar. It now actually reports any physical I/Os that are actually occurring and also the physical block address on the disk and how much I/O actually occurred. So you can really get a feel for how random the performance of the disk really is in terms of the I/Os that you're issuing, where they're actually being pulled from the disk. It also breaks that up into plain data versus metadata versus VM operations. So it gives you a pretty clear picture of what you're doing to the disk.

Sampler has a nice feature in it that you can basically tell it to sit on any I/O operation. So you just kind of click a box, say, "I want to know any time an I/O operation is called, please record the callback stack for me so I can find out how I got to that point." That's a very powerful way to be able to track where all the I/O is coming from because sometimes you don't issue it directly, right? It's some API that you're calling that ends up being a fairly expensive operation in terms of file system I/O. So this would allow you to see that that API, maybe you don't want to do that API in your polling loop.

So, in closing, applications that perform efficiently should be your goal. We want you, you know, we want all those great features. We want your apps to perform well, but we want them to perform efficiently because that makes our customers happy, right? That makes the whole system feel better. And it makes, quite frankly, you know,

[Transcript missing]

First, I want to thank Joe for an excellent presentation. I hope you got a lot out of that.

Some references. The Mac OS X System Overview book is always a useful reference for understanding the system architecture and how and where to get things done. Mac OS X Performance talks a lot about the same principles that Joe's been talking about, and they're available both on your disk in PDF form or as print-on-demand through the developer website. We have our Tools page on Apple Developer Connection where we talk about all of our performance tools and performance tools available from the third-party world.

A roadmap for today. Okay, in this session we have drilled rather deeply into principles of developing for performance. In the next session in the tools track we are going to talk about the compilers. And a good way of getting good performance out of your applications is to understand your compilation environment, understand exactly what the compiler is doing and where we are going with compilers. That is in room J immediately following this session.

This afternoon we have both the development tools feedback forum in room J1 and session 908, which is not listed on this slide for reasons beyond my control. Session 908 talks about getting the most out of project builder to utilize the best of the build system. And again the last session of the day, debugging in Mac OS X right in this room, 5:00 this evening. And with that I would like to invite the Q&A panel up to the stage.

For contact, discussion of performance, performance tools, if you have any need to get in touch with the engineering team, our development tools engineering feedback address, macos10-tools-feedback at group.apple.com. That's beginning to roll off my tongue lightly now. is a useful way of contacting the entire development tools engineering group, and I am always available as a primary contact.