Maximizing the Performance of Resource-Hungry Applications - WWDC 2006

OS Foundations • 1:05:21

All software developers want great performance, but users of resource-hungry products in the scientific, content-creation, and enterprise realms demand it. To optimize your application for the best possible perfomance on Mac OS X, you need to understand the low-level details of how the system manages memory, threads, and I/O. Come find out how to use Apple's tools, methodologies, and APIs to track, analyze, and optimize the performance of your most demanding applications

Speaker: Mike Smith

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So, performance. Along with user interface and documentation improvements, it's probably the longest standing feature on most of your to-do lists. certainly on mine. But before we talk a little bit about performance and the black art of performance tuning, Probably a good idea to think a little bit about why we care about performance in the first place. I mean, performance tuning is a lot of work. It can sometimes lead to some fairly uncertain results. We're all pretty busy anyway. I mean, if it's not a sure thing, why bother at all?

So first and foremost, if you're competing in the marketplace against other applications, superior application performance is a first-order competitive advantage. Your customers that are dutiful at any rate will be comparing your app against others. Magazine reviews will also run competitive comparisons in which a 1% or 2% performance margin may mean the difference between the top and the bottom of the list of their recommended applications.

In addition to that, your customers themselves are probably your most powerful advertising medium. If they like your application because it performs well, they'll say nice things about you to their coworkers, to other people in the industry at conferences like this, and they'll also be more favorably inclined towards your other products.

You can impress your customers a couple of different ways. With the push towards more mobile systems in particular, a theme you'll be hearing from me quite a bit today is efficiency versus performance. If your application not only performs well, but is efficient on a portable system, that means more battery life and less heat. Both of those things are things that customers are very sensitive to.

In addition to that, if your application performs its basic tasks well, you have effectively surplus performance left. You can add animations, other user interface niceties that will again impress and please your customers. You can also potentially do more complex, more sophisticated things in your application's regular workflow. And if you're trying to add a new feature to your application, the performance of that feature is often one of the gating criteria. If you can't make it fast enough, it'll never make it in in the first place. And so performance doesn't just adjust how your customers feel about existing features. It literally determines whether or not you'll get a feature in the first place.

We're talking today particularly about resource-hungry applications, and so it's relevant and important to think about what hunger actually is in this context and how that affects the distribution of resources. Mac OS X's fairly free with its resources. The overall philosophy is that if an application wants something, the system will give it to that application.

So if we're talking about resources, what sort of resources? We have CPU time, time on I/O peripherals. We have access to and allocation of memory, storage space, mass storage in particular, power. And as I said, X is very free and giving with those resources. And so you need to be aware of the fact that when you're using a resource, there's a good chance that you're actually taking it away from someone else.

That actually brings up something of a critical point, which is that hunger itself, along with performance, is really a subjective thing. If there are abundant resources, the fact that you're using a lot of them doesn't necessarily matter a great deal. But if you're competing either with other applications, with the system, or with yourself, despite the fact that your workload hasn't changed, you are, in that context, a hungry application.

It's fair to say that when your application is doing work, everything takes time. Your code takes time to run, your I/Os take time to perform, the system takes time to respond to requests that you've made to it. And most of these resources, these time-bounded resources, are single-use. If you're running code on a particular core in the processor, no one else is running it at the same time. If the disk subsystem is doing work directly related to an I/O that you've issued, no one else is getting that same work done at the same time.

With regards to the actual application of these resources, CPU scheduling is managed in a fairly conventional priority-based round-robin scheme. We do have a number of tweaks in the system that attempt to maintain responsiveness. From the application perspective, you should mostly ignore those because we're trying to second-guess you. And if you second-guess us, bad things happen.

I/O scheduling varies greatly with the device. If your application is particularly sensitive to that, you need to actually spend some time to understand how individual peripherals respond to multiple requests, particularly in a heavily loaded situation. That's not really something I can offer you any general guidance on here. It's very much situationally dependent.

Memory is quite possibly the most abused resource in our system. Virtual space, most people think of as free, and to be sure, allocating virtual space doesn't cost you a great deal. It's a relatively efficient operation. It does, however, have some overheads. Every allocation that you make consumes wide kernel memory, and that wide kernel memory in turn is a consumption of physical memory. It also consumes kernel virtual space. So whilst it's mostly free, it's not entirely free. You find that you're competing for virtual space against other portions of your own application, as well as frameworks that you're linked against that are doing work on your behalf.

Of course, if you want to do anything with our virtual space, you need physical memory to back it. And physical memory is most certainly not free. Finding physical pages requires substantial work, typically because we are trying to satisfy the requests of everyone else in the system, including the disk caching subsystem. Physical memory is almost always spoken for.

If you have customers who, if you yourself have noticed that applications such as Top or the Activity Monitor, once a system has been up for any appreciable period of time, report no free memory. Well, that's good. We're using that memory for something. But it means that if your application needs it, you're taking it away from whoever currently has it.

When thinking about the use of memory, it's important to understand that memory is managed in page-sized quanta. On all of our currently shipping systems, this is 4 kilobytes. It means that any time you touch memory or allocate memory and use it, you're using a 4 kilobyte quantum. The fact that you've allocated an 8-byte object notwithstanding, that's 4 kilobytes that you've consumed that are currently being denied both the system and potentially other applications.

I mentioned that memory is taken away. So we use a modified LRU algorithm. Anyone who's taken a kindergarten-level operating systems course will be familiar with the two-handed clock algorithm. We've stuck with it for a long time for the simple reason that it works. We have a few modifications that help us prioritize things such as disk cache pages are reclaimed slightly more readily than other pages. But fundamentally, we stick with this because it's simple, it's fast, and for the vast majority of applications, it does a very good job.

Unfortunately, if you actually completed that kindergarten-level OS course, you'll be well aware that in situations of heavy page use, when the working set of the system actually outstrips the available physical memory in the system, your algorithm breaks down almost completely. And so there's a very sharp knee in performance when you manage to push the working set beyond the available physical memory. Virtually everything in the system will come to an apparent standstill. We'll talk a little bit about degenerate conditions later and why it's important for you to be able to recognize these and perhaps moderate your behavior.

Mass storage poses its own set of interesting challenges when it comes to optimization. If you are using large amounts of storage, You will have faced the challenge of actually organizing the data in that storage. It's very popular to use the file system namespace as a low-rent database to partition your data into a collection of files and directories. Now, this can be fairly efficient, particularly if you don't feel like writing a database or if you want to take advantage of the fact that the file system will cache both the directory metadata and the file data in order to speed up your application.

But you need to understand that there are costs associated with all of this. Caching that file system metadata consumes, again, kernel, virtual, and physical space. It also imposes overheads in actually accessing these files. Each directory lookup may potentially involve a disk access. It certainly involves several different page references inside the kernel.

As far as allocation strategies are concerned, disk space is shared. You can certainly impose quotas, but they're typically used in an administrative fashion on a per-user basis rather than a per-application basis. So between applications, it's fair to say that disk space is shared on a first-come, first-served basis.

It is possible for an application to make disk space reservations if it's particularly critical that disk space actually be available. There's a feature that's only available on some file systems, but the vast majority of our systems installed using HFS+ as the local root file system. So you can be fairly certain that if you need to make a reservation, you can do that on the root volume.

As I pointed out to begin with, performance tuning is considered something of a black art, which is really kind of unfair because it's not super difficult. But to get folks started and on the right foot, I'm proposing a five-step program for better performance. You can come up with seven more steps if you feel that's important, but I'm going to stick to five. The first is to avoid unnecessary work. I'm going to expand on these more, so I'm not going to talk too much about them right now. The second, which seems like it completely contradicts the first, is to keep busy.

Third is to be a good citizen. As I've already alluded, your actions affect the behavior of the system, and likewise, the actions of other applications will affect you. It's important to get along. When it comes to actually attempting to assess your performance, don't guess. Don't say, "Hey, that looks snappy.

That was a bit quicker than last time, I'm sure." Measure it. And finally, when you're actually looking to find things that you can improve, take advantage of other people's work. Apple in particular provides you with an enormous collection of ready-optimized algorithms, code, and services. I'll talk very briefly about those a little bit later on.

too many steps involved. Modern software architecture emphasizes modularity and code reuse. There's a great pressure on time to market and time to feature. These are great techniques for achieving those goals, but they introduce some unfortunate overheads. and quite often, it can lead to significant performance problems. As I mentioned before, pretty much everything that you do takes time. The more time that you spend doing something, the longer that in the end ultimately it's going to take, and thus the less performance is actually perceived.

But this actually brings up an interesting side point that I'm going to get back to a couple times again later on, which is the difference between performance and efficiency. Performance is typically measured in the time it takes to perform some task or the number of tasks that you can perform in some unit time. Efficiency you can measure slightly differently, which is the amount of total actual work done by the system in order to perform this task.

In the simplest situation, you have a chain of directly dependent steps, and so performance and efficiency are ultimately the same thing. But with multi-core, multi-threaded systems with asynchronous work, with peripherals doing work for you, the dependency chain between individual work units is not necessarily quite so straightforward. And so you can complete a task in the same elapsed time, but with very different efficiency. Efficiency is particularly critical when we talk about power, which I'll get to a little bit later on as well.

So doing less work seems obvious, but after a few years actually analyzing the performance of various applications, you realize that maybe this isn't necessarily quite the case. How much work an application does, the amount of work that it ends up doing in order to perform some sort of task, depends on a whole variety of things.

The architecture chosen for the application, the specific algorithms used in its implementation, the tools, the language in which it was written, the libraries, all that sort of stuff plays into the overall performance of an application, before we start talking about bugs, misunderstandings of components, and so on and so forth.

Talking about implementation decisions, and as I've mentioned on the previous slide, there's a great trend towards component reuse. And when you combine that with a great deal of pressure on many of you to get things out the door as quickly as possible, you often find components that aren't really well-suited to the task being pressed in because they get you past the feature point.

There's also the fact that many of these components are opaque. You're encouraged to consider them as black boxes, and so you don't have a great deal of visibility into their overall behavior. And sometimes the behavior of those blocks is actually misrepresented, perhaps because someone wants you to pay for them, perhaps because you simply misunderstood the way that it's being described.

So when you're putting together a piece of software and you are using other people's components, you're using system-provided services, maybe you're just doing some stuff yourself, it's important to understand how much it actually costs you to do these things. There are some particularly relevant ones, given that I'm beating up on modular software design right now. Cross-module costs.

There's the obvious cost of calling another function, the fact that it potentially perturbs the code flow, it has impacts on the cache. The calling convention, the setting up of arguments and tearing them down. If this is a function that you call very frequently, you may spend a lot of meta-work simply calling it in the first place.

On the larger scale, you often find yourself performing format conversions, packing structures, building things up into XML plists just to tear them down again on the other side of the call interface. All of these things are costs that are directly associated with the choice of modular architecture, and these things need to be considered when you're actually designing your application. Or alternatively, when you're reconsidering the design of your application in order to meet a performance target.

There are a couple of special costs that are associated with calling system services. A lot of services on Mac OS X are implemented as separate processes, and so you'll find that there is overhead involved in inter-process communication. If you have the luxury of specifying or deciding on an interface, you may find yourself making trade-offs between convenience and performance.

Calling the kernel directly also imposes some overhead. There's a protection boundary crossing, obviously. There's argument marshalling. There are potentially credential operations because you may be performing something that's actually an authorized operation. As a general rule, it's good to assume that any time you invoke anything else anywhere that it costs, factor that in over and above whatever it may cost that other module to actually do the work for you in the first place.

Some specific examples of things that are good to avoid, as I mentioned, system calls. But particularly when you're making system calls, often you're talking about memory allocation, file system I/O. Frequent memory allocations. Some of you would certainly have seen a number of articles in the popular hacker press about terrible performance that Mac OS X has, which usually devolves into, "My goodness, I'm calling Malik far too often." And typically what that devolves into subsequently is that Malik is making memory allocations in your process address space, which involves calling into the kernel.

That's something that is very good to avoid. There's a degree of locking and overhead involved in setting up and tearing down address space mappings. And if you're not doing very much work with the virtual space that you've set up, you're basically wasting your time. Likewise, small I/Os, particularly if these I/Os are actually going to hit a physical device, there's a good deal of set-up and tear-down involved in almost any I/O operation. So a small I/O operation's overall elapsed runtime is dominated by set-up and tear-down costs. You want to amortize those much as you amortize memory allocation costs.

In a process communication, once again, we're talking about system call overhead costs here. If you're not communicating very much information between the two processes, you're paying a lot of overhead in order to get that information moved around. Shared memory operations, particularly if you're working on a segment that you have previously shared with another process, and you have lock-based or lock-free algorithms for moving that data around, can be much more efficient.

And obviously, try to only ever do anything once. Sometimes this isn't possible. Caching the results of an operation is just too expensive, consumes too much space, so you may have to trade off the cost of caching your results against simply re-performing the operation at a later time. But it's something to consider.

This one gets a slider by itself because this really sort of reaches into the practical implementation of the machine itself and how your code runs on it. Touching memory. Anything that you do that touches memory costs. If you're touching stuff in the L1 cache, that's cheaper than touching data that's in the L2 cache.

And if it's out in main memory and you have to take a full-on cache hit for it, that's going to cost you a great deal more. So anything that you can do to avoid touching memory, and this goes all the way from algorithm design, through data structure design, through the way that you actually lay out the tasks that your application performs. Avoid copying things.

I mean, up aside from the fact that a straight-up copy wastes virtual and physical memory, the act of actually reading and writing that data costs time. It also costs power. So think about allocating memory for your data structures with the patterns that you actually access those data structures in mind. Allocate memory, put things in that memory that you're going to hit multiple times with relatively good temporal locality of reference. Also consider physical locality of reference.

If you do that, if you cluster things appropriately, you can reduce the number of cache lines that your algorithms use. You can also reduce the number of pages that you touch, and thus the size of your working set, which incidentally also reduces the number of TLB entries that you'll actually require to get the work done. TLB entries themselves tend to end up in the cache, and so they compete with your data for cache space. So once again, using fewer pages saves you cache lines.

If you're trying to keep the number of pages that you use down, as I'm encouraging you to do, think about the way that you actually access your data. Linked lists are super convenient. There are great macros for them. There are marvelous classes. They're nice and easy to understand and debug. Unfortunately, once you put a few things in them, they perform really badly.

Typically, if you're looking for something in a linked list, and you have no other indexing for it, you're going to touch maybe half, depending on what data you're actually looking at, maybe half of the entries in the list. If the list has been built dynamically, they're probably scattered all over the place, and so you're going to be touching an enormous number of pages.

Now, it's fair to say that you've allocated those pages, and so you're not paying the cost for allocating them, but you are using those pages. And that means that if those pages have been reclaimed and they're currently being used by someone else, they have to be brought back. Any time that you touch a page that isn't currently resident, the thread that is touching that page will block synchronously until the page can be brought back. It's terrible for performance.

If you have the luxury of allocating your data structures up front, think about the patterns in which they're accessed and try and group members, assuming your data structure members are smaller than a page, within pages so that you have good physical locality of reference. You can also consider breaking the data structures down so that the portions of the data structure that you actually access are kept separately, and once again, cluster those within pages with good physical locality of reference. Most of you would have been exposed to the concept of indexing your data. It's a great idea.

Here's a really simple example of how to reduce the cache line utilization if you are stuck with or if you're considering a linked list implementation. In the first example, traversing the list, where you're comparing the tag against some key that you're searching for, you're going to hit two cache lines for every entry in the list. One for the tag, another one for the list entry.

In the second example, provided that the structure is allocated such that the first two members don't cross a cache line, and this is typically the case, structure this size is likely to be allocated on a size-aligned boundary or something fairly close to it. And if not, you can ensure that yourselves. Checking the tag will have brought the pointer to the next list entry into the cache, and so when you go looking for the next entry, you're not going to stall waiting for a cache fill.

So the second step that I brought up, which does indeed sound like it's somewhat contrary to the first, is to keep busy. I should perhaps qualify that. When you have something to do, you should always be working on getting it done. All the modern Macintosh systems you can buy are at least dual-core. Some of them have more. It's an industry trend that I think we can see continuing.

And so this means that if you actually want to maximize the performance of your application, if you want to utilize all of the performance that's available to you in the system, you need to start thinking about how to get concurrent work performed on your application's behalf, whether your application itself does it, whether you are finding enough work for other services in the system to do, that you can keep all of those cores busy all the time. If you are working on a single-threaded application, you're going to find that the overall throughput of your application, and thus its potential performance, is nowhere near up to the maximum potential of the system that you're on.

There are a number of system services. In fact, the vast majority of them can be used in asynchronous fashions. The POSIX asynchronous I/O. There are an enormous number of callback-oriented operations in the Carbon APIs and so on and so forth. These are all great ways to have things done on your behalf.

If you are likely to actually block, then you need to arrange one way or another for there to be someone else in your application, another thread, another process somewhere else, that's willing to pick up the slack, that is likely to be ready to be able to run to get something done so that your task is completed.

There are a couple of overheads associated with this. We'll talk about a few of them in a little bit. These are really things that you need to consider in the design process as you partition your work. A lot of applications-- this is particularly the case with games-- tend to have been implemented in a single threaded fashion because that's the easy way to think about things.

It's the way that we tend to do things. We'll do one task, and then another, and then another. So there's a real body of design work that needs to be done at the architectural level for most-- that's probably unfair-- many applications. And so, thinking about these pitfalls is fairly germane at this point.

There's a lot of folklore out there about threading and locking. Some of it is hard won from long experience. Some of it is perhaps not so valuable. I'm going to bring up a couple of quick starter points to think about. that combat a couple of these common misconceptions. But overall, the point that I raised earlier about how important it is to measure and understand the behavior of the things that you're doing in your application is paramount here.

There's a very common conception that locks are basically free unless they're contested, and unfortunately that's just not true. Locking and unlocking imposes a fairly significant overhead on the system. And that's exacerbated even more by the fact that, particularly if you're pursuing a very aggressive lock/unlock strategy, the party most likely to take a lock that you're just about to release or you're considering holding for some time is likely to be yourself again.

So there is considerable value in considering exactly how likely contention is to be. And then once you've actually implemented measuring to ensure that your assumptions about contention are actually correct, If you're not contending on a lock, but you are taking and releasing that lock a lot, it suggests that maybe you could afford to hold it for longer and spend less time taking it and releasing it.

It's particularly important to bear in mind that locks are serializing operations in order to actually provide the guarantees that they do about access to data under and not under the lock. They take one of the major advantages of modern microprocessors, that is the ability to perform operations out of order, and they toss it all out the window.

It's actually worse than that because in order to maintain these guarantees in multiprocessor systems, every other processor in the system has to be aware of the lock in some fashion. And so there are system-wide overheads. Anytime you take a lock, you're affecting every other processor in the system.

And so again, there is considerable value to be had in amortizing your use of those logs. The cache is another resource that's very heavily impacted by concurrency. There are many algorithms out there that either attempt to automatically determine the size of the cache available to you or query the system for that size, and then make assumptions about their likely performance based on their usage of that cache.

If you're running on one of our current Intel-based systems, and if you've actually looked at the architecture of these systems, you will have noticed that cores share cache. This means that not only code that you're actually aware of running, but code that you are completely unaware of is competing with you for cache space. Given that you have no idea what this code is doing, it may be making assumptions about cache usage as well. You have to be much more careful in algorithmic assumptions that you make about the usage of the cache.

And you need to be aware of what's going on while your algorithm is running, because you may find yourself needing to adapt your behavior based on what you discern about other usage of the cache. It's also worth bearing in mind that if you call out, going back to the modular software issue again, if you call out from your algorithm to some other section of code, that call and what that code does will also affect your cache usage.

Mac OS X is a multi-user, multi-application, multi-processor, multi-just-about-everything operating system. And this is particularly relevant to your applications because your application will never own the system. Even if you're the only app running, well, even if you think you're the only app running, you're not. There's a whole bunch of other ones. They're all doing stuff.

They all have expectations about what the system may or may not do at at least some level. Anything that you do on the system, regardless of what it is, has some impact on other applications in the system. If you're consuming CPU cycles, they're CPU cycles they might not have had. If you're doing disk I/O, that's disk space, disk bandwidth that they own and have.

And in fact, what you do affects you as well. If your working set is sufficiently large, you may find yourself cannibalizing yourself. If you have a busy thread, you may be taking processor time away from other threads. And of course, what other applications do is going to affect you as well.

There are a few applications that have very consistent runtime environments. If you're building a large scientific application, you have 500 machines, a nice air-conditioned room, you can make some fairly steady-state assumptions about what the system environment is going to look like. You've got a pretty good idea of how much physical memory you've got. You've got a pretty good idea about how likely you are to be competing with someone else for processor time, so on and so forth.

But this really isn't the general case. Most of your applications are going to be run in a fairly uncertain environment. Even if your code starts out in a controlled research environment, as we've seen over the years, that code tends to migrate into general usage. What you're doing is actually even vaguely useful.

Someone is going to say, "I can build a great app around that," and they're either going to buy your code, use it if it's open source, or they're going to reimplement something based on whatever you've published. So it's very important for you to understand the runtime environment for your applications and what your expectations about that runtime environment are.

It's also really important for you to make good use of your resources. Any use of a resource implies that you're doing some sort of work. As we've already considered, doing work is something to be avoided at all costs. So make sure that when you're actually using a resource, you have a need for it.

You're cannibalizing those resources from someone else, in all likelihood. And so you want to consider the impact that that has on them, and what impact may actually come back from your doing that. And again, the memory point, and a term that you may hear us bandy around a little bit, memory pressure.

Because pages are reclaimed on an as-needed basis, and that need is driven by the size of the working sets of the applications that are currently active, we describe the desire for physical memory as pressure. It tends to affect the rate at which pages flow through the LIU cache, which is where the pressure term comes from.

The greater the pressure, the faster the recycle rate for pages, the greater the chance that your application is going to hit a page that it needs that, due to this pressure, has been reclaimed for use elsewhere. If you consume fewer pages, you reduce that pressure. We've already talked about some of the ways that you can reduce page consumption. Those things have a direct flow on to the overall performance of your application and every application in the system.

It's also good to consider holding resources for the shortest possible period of time. There's obviously a trade-off here, as I already mentioned. Allocating these resources takes time. Freeing them also takes time. But there's a cost to your holding them idle, in that they are quite often denied someone else.

Given that other applications in the system can consume resources and that we will give those resources to them on an as-needed basis, you can't count. at any particular point in time on having the resources that you actually need. I'm sorry if this sounds kind of unfair, but if someone else needs them and we think that maybe they need them more than you or they're just asking more frequently, we're going to give them those resources because that means they'll-- at least we believe that means they'll get their job done.

You can generally assume that you get a few CPU cycles and that you have some pages for your program text to be in. I mean, it's not realistic to expect an application to adapt to its environment if it can't run at all. So, perhaps that makes you feel a little bit better with the situation.

But the availability of pretty much any other resource in the system can change on a more or less instantaneous basis. Even while you're running, even if you think that interrupts happen to be off because you're in a device driver or something, someone else somewhere else in the system can be running on another core and get their hands on a resource that you think you ought to be able to have.

Power management will also play into this. Again, in order to keep the power consumption on portable systems down, we will moderate the processor and performance of various other components in the system. And so assumptions that you've made about the overall throughput of the system, even just given straight-up resource availability, aren't necessarily valid. All of this uncertainty means that you need to think about your application's ability to deal with shortages.

I don't want to make this the normal case. Obviously, from a performance perspective, you want to encourage your users to run your application in an environment where the resources that the application is going to need will be available. That isn't always the case, and there are quite often situations where they'll want to do something else. They are willing to deal with some degradation in the performance of your application.

But if your application is destroying the system by consuming too many physical pages, or it's just performing impossibly badly, that's not necessarily a great user experience. So it's important to think about how you go about mitigating the effect of resource shortages on your application. This starts with understanding what your application actually needs in order to get work done so that you can detect when those resources aren't available.

If you have critical performance criteria with regards to resources, try reserving those resources. In certain very specific situations, and I want to layer this particular point with as many caveats as I can, you can wire down physical memory. That is, you can guarantee that it will not be reclaimed, that those pages will always be present.

Obviously, that's fairly uncivilized, but if you can't afford to take a blocking fault for an absent page, wiring down is your only alternative. You can use the thread scheduling facilities we provide to ensure that you will get adequate CPU time, although bear in mind that other people can use this technique as well. You can also obviously pre-allocate disk files, as I mentioned earlier.

If you can't meet your acceptable performance target, It's good to have a strategy to fall back on. Users don't much like applications that spin the beach ball. They don't like applications that take inexplicably long to do things. Having a fallback strategy that perhaps performs half as well, but uses a quarter of the resources-- one possible way to deal with the situation. Having a dialogue that says, look, you're just doing too much. I can't actually get this done. You're not going to be happy. Do something to the system. Deal with this. Maybe another way of going about it. At the very least, you're being honest.

I mentioned before that power is a resource. Traditionally, power has been considered a resource that the operating systems really responsible for, but This goes all the way up and down the stack. The hardware itself strives to be as power-efficient as possible. The operating system strives to be as power-efficient as possible. But the biggest consumers, by far, of power in the system are applications, applications doing work. So you need to, once again, consider efficiency.

The system's power management both helps and to a degree hinders your application's performance. As I've already mentioned, the system may elect to reduce the performance of certain components in order to meet power and thermal guarantees that are made by the system. That being said, these algorithms do understand that your application needs to get work done.

Generally, again, not a good idea to try and second-guess them, but one key point that I can make is that if your application can continue to do work, if it can avoid performing work in a bursty fashion, it will interact better with the system because that gives the system a chance to recognize that it's trying to get work done.

Change is a constant, what can I say? If you're building your application to be aware of changes in its environment, to actually detect its performance, to understand the availability of resources in the system, You're building in a good deal of change-proofness in your application because you're no longer making assumptions about the fact that you're running on a 1.8 gigahertz MacBook Pro. You are looking at your throughput and the availability of resources to you, and so you can say, whatever the system that I'm running on happens to be, this is how I behave. This is another thing that will come back to making your customers happy.

The key to improving performance in your applications is to measure them. Measurement isn't glamorous, it's boring, it's a whole bunch of numbers, it's occasionally some interesting looking graphs. but it's really effective. And it's effective because in order to actually get anything done when it comes to tuning performance in your application, you're going to need to focus your efforts. You can't afford to spend However many years it will take you to go over every line in your application, line by line, contemplate its overall impact on the grand scheme of things. It's not a practical way to go about optimizing performance.

That being said, if you do understand the interplay of components in your application, that's a great place to start. But be careful. Guessing. Guessing is a great way to make mistakes and spend an awful lot of time doing something that isn't actually going to help you at all. Modern software tends to be extremely complicated, and the interplay between components is often very subtle.

So if you're going to measure things, try and measure in concrete terms. Elapsed time for a task, total resources consumed for a particular task, that sort of stuff. If you do this, and if you do this in a repeatable fashion, you can actually integrate it into your development workflow.

It means that from build to build you can monitor your performance, you can actually justify the time that you spend on performance improvements if you happen to be in a management situation that requires that sort of thing. And you can also say, "Look, here is our benchmark here, here is our benchmark here, we've improved performance 15%, that's a marketing bullet."

What you actually choose to measure is pretty much open to debate. Obviously, you need to measure the things that your customers care about. If there are key tasks that are performed by your application, those are great things to measure. When it comes to efficiency, consider the things that we've been talking about already: CPU cycles, pages used, that sort of thing.

Measurement sounds like an awful lot of work. And it is, I guess. But we give you a bunch of tools to make life much, much easier. I'm not going to talk too much about Shark. There have been other presentations here. There is an enormous amount of documentation on the web. Shark is a great tool for understanding what your code is doing in the system.

It's also fairly helpful in suggesting what you might do about things at the lower level. Obviously, it doesn't know enough about what your application actually does at the user level to give you high-level architectural suggestions. But when you're looking for visibility and visibility into the system that understands the system, Shark is a marvelous tool.

This week we've been introducing a new thing called DTrace. This comes from the good folks at Sun. DTrace is not, in and of itself, a performance tool. It's an architecture and infrastructure for building tools that can be used for both debugging and for performance measurement. If you're developing-- if you're interested in developing a specific performance evaluation harness for your software that considers its use of resources and its interaction with the system, DTrace gives you visibility from the top of your application all the way through to the very lowest levels of the system.

And there are a whole bunch of more traditional tools, and it's in fact the traditional tools I'm going to address because if we're talking about focusing effort, the first thing that you want to be able to do is to start with a more or less clean slate and narrow in on the one or two worst offenders in any particular situation.

Anyone who's worked on a Unix system for any length of time will be familiar with Top. I'm not going to ask you to read those numbers, although they're actually pleasantly large. Top gives you a good view of instantaneous overall system activity, which processes are doing things, This is the one I like the most because it gives me a rapid summary of page-in, page-out activity, which is very relevant to the overall system working set, total I/Os, system calls, and it sorts the worst offenders to the top.

Understanding the use of the file system by an application is actually fairly difficult. If you haven't instrumented the application to tell you exactly what it's doing with the file system, it can be quite difficult to actually-- well, without this tool, I should say-- it can be quite difficult to understand what's going on. FS usage will tell you what your application is actually doing with regards to the file system.

Paging, page add activity, how long the I/Os actually took, how large they were. If you are beating on the file system very hard and you don't understand why something is taking too long, you just simply want to understand, FS Usage is where you start. Likewise, S Usage tells you the same sort of thing about system calls.

It's a little more specific than Top, in that it actually breaks down the system calls by frequency, gives you some other useful statistics. If you're not quite sure what your application is doing because you don't necessarily have a great deal of visibility into it, this is another good one to look at.

I talked about virtual memory usage. VM Map will give you a detailed breakdown of all of the map regions in your processor's address space. If you're trying to work out where your virtual space is going, start here. goes a little bit further and actually breaks down the internal data structures used by the system default malloc.

and it can also break down Objective-C object allocations. If you've decided from your use of VM Map that malloc is the primary offender in consuming virtual space, this is where you go next. There are a number of other malloc-related debugging tools. Malik history in particular that can be used to track that sort of thing even further.

But typically I would say that once you've ascertained that you have a mal-correlated issue, what you actually have is an object management issue inside your application. And so you're going to need to look at the way that you manage objects. And at that point, you might want to consider either building a custom harness or using DTrace in order to better understand how you're actually allocating objects. You may also simply be leaking. Okay, so you've taken a bunch of measurements. Now what? Oop, come back here. There's a whole bunch of numbers. What do they actually mean? How do you use them?

The first and most important thing is to actually understand how these numbers relate to what your application is doing. Once you've done that, you can start looking at particular portions of your application that are responsible for those numbers and understanding how the code is being used in order to in order to reach those numbers. Once you've achieved that sort of understanding-- and you'll note that this is an understanding of potentially a relatively small portion of the application, rather than understanding the whole thing, which is the whole objective of this focusing part-- you can start thinking about how to improve it.

Unfortunately, it's at this point that performance tuning becomes very application-specific. We can talk about general rules about resources and so forth, but the actual flow of data inside your application, the flow of work inside the application is what largely determines how all of that is consumed. And so it's at this point that you, the owner of the application, really have to come up to the front.

But there are some things that you can think about. The numbers will typically tell you that you are doing something a great deal. What are you doing? Are you doing it too often? Could you do it less? Are you doing this in an efficient fashion? Specific things that you can think about these actions.

You can also use the results of your performance measurements to understand, if you're not necessarily suffering from performance problems, what resource expectations you actually have. In order to perform a particular task, how much of whatever this particular thing is measuring are you consuming? That leads, of course, to the ability to deprive the application of those resources and to understand its behavior in a degenerate case.

I'm going to pull together a really brief example here. This actually occurred while I was writing the slides for this. I thought that I could use some time off, and so I went out and bought a spiffy new game and installed it. I'm not going to finger the developer of this game because they were very responsive without any action on my part in producing a patch that addresses this issue.

Having installed the game and, of course, wanting to minimize it so that I could get back to working on my slides, I noticed that the system performance was terrible. And so the obvious conclusion is that this application is consuming resources, it's not necessarily being a good citizen. So I start with top.

Minus DUX, the arguments that I like to use. And right there at the top is my new game. It's using a great deal of CPU. And it's making an inordinate number of BSD system calls. In this particular sample, there are 474,000 BSD system calls. This is on a one second sample, by the way.

So, BSD system calls, lots of them, huh? So, let's have a look at what sort of system calls it's making. Someone's taken away my screen resource. There we go. Wrong button. I always do that. And so we can see there, about halfway down the screen, that in this particular one-second sample, we've made nearly 680,000 k-event system calls. This is really not good for performance.

Fortunately for me, the developer of this application left all the symbols in it. I sprang out another one of my favorite "What on Earth is Going On" tools, which I didn't mention before, called Sample. Some of you will be familiar with the sample application functionality in the Activity Monitor. This is the command line foundation for that tool.

It basically runs I've got a core graph sample on an application on all of the threads in an application for a period of time. In this case, I asked it for 10 seconds worth of sampling. There's an enormous amount of other completely irrelevant rubbish in the backtrace. Like I said, there are all the symbols in there, so I can see their direct 3D emulation on top of OpenGL and all sorts of other stuff going on.

But all that looks like real work. It looks like stuff that's actually germane to the game running. But there's also this particular one. We're looking for k-event system calls, right? So of the 930 samples, yeah, 930 samples for this particular thread, about 751 of them were inside k-event, and about 150 of those were downstream of k-event in a function called findChangeHandle.

So something is going on that's causing events to be sent to this application. And this bit more to the point to this particular handler. At this point, my ability to actually work out what's going on has pretty much run out because I'm going to need to get the source to the application, or I guess I could disassemble it. But I'm going to need to know what this findChangeHandle function actually does in the larger context of the application.

But in just a few commands, I've narrowed it down from "my system performs really badly because I'm running this game" to "what is this function doing?" And this is the beginning of performance analysis. Incidentally, the OT Atomic-- Was it? There we are. Yes. OT Atomic Test Bit, a function that's called out there.

Not really a great example of performant coding. There's a function, a regular function. It's not inlined or anything like that. It masks a single bit in a byte. But in order to do this, despite the fact that the access to the byte is atomic, and all the systems that this code is likely to run on. It takes and releases a lock.

Plus, it also calls out to another function just in case that lock hasn't been actually initialized in the first place. So this is a good example of things when you're actually hunting in your code, looking for low-level performance things. If you see functions that are implemented like this with very short amounts of work, bracketed by locks, these are great low-hanging fruit to attack.

So onto the, I believe this is the fifth step, taking advantage of work that other people have done, or in the case of system provider services, work that people will continue to do in the future. Before you can actually take advantage of this stuff, obviously, you have to actually know what it is. We do a pretty good job, I think, of advertising the functionality that the system exposes to make life easier for you.

It's not too hard for you to browse the documentation that the good folks in DTS provide. And they're always willing to answer your questions about how to actually improve the performance of your application, leveraging system services. These are things that we expect you to be using, so we're prepared to deal with you asking questions about how to actually take advantage of them.

The critical point I brought up talking about the previous slide is that these services will continue to improve. Because we ship them, because we use these services ourselves, we constantly optimize them for new platforms. We constantly improve them on existing platforms. They're basically a free performance upgrade for your application if you're using them.

In adopting a system service, there is always a cost. I talked before about intermodule costs. Here I am advocating modular software design and code reuse after just bashing it earlier on. You've got to understand there's a trade-off involved in all of this. Typically, though, leveraging a system service that does something even moderately complex is going to be easier than implementing it from scratch yourself.

On top of that, these services typically maintain stable interfaces for long periods of time. So your adoption cost is usually only paid once, provided of course your application's internal architecture remains likewise relatively stable. If you duplicate some particular piece of performance-sensitive functionality and the system changes, your implementation will need to change as well. And that obviously imposes an ongoing maintenance burden that you probably don't want to adopt. It is worth bearing in mind though that system services with nice, stable, generalized interfaces often impose some penalty in order to achieve that sort of generality.

There's also potentially a loss of critical differentiation. If there's a system service that does something that you honestly believe and can measure that you do better, if you adopt the system service, you are potentially losing that critical differentiation. That being said, if what you're doing is something that is heavily used by other applications, we're going to be optimizing that as well. And so the cost that you're paying in order to maintain that critical differentiation may ultimately come to nothing.

I started off with about eight slides worth of examples of system services, which would make for a really boring portion of this presentation. So I culled this down to a list of just a few high-point examples. I've seen people re-implement far too often. I really wish they wouldn't.

MemCopy is right at the very top of the list. I talked about copies being evil earlier on. There are times when it is simply inevitable that you must copy information. You're about to make a change to it. You're going to throw away the old version, and the old version is going to change, whatever. There are any number of reasons why copying something is actually a necessary thing to do.

It's also, from a performance perspective, incredibly sensitive to the architecture of the system that you're running on. And in order to address this particular concern, we go to some fairly extensive lengths. There are custom implementations of MemCopy for virtually every system that we've shipped. It's becoming a little more stable with some of the Intel systems, but by and large, those implementations of MemCopy have specific knowledge of the architecture of the individual machine that they're running on. And they're implemented in such a fashion that there is little or no overhead in determining which version to actually use.

And Keeping up with this, I mean, this is a great example of why you just shouldn't re-implement memcopy. Keeping up with all of this work that we've done to optimize this is just a burden you don't need to pay. There's very little to be gained in terms of critical differentiation by having a different memcopy.

XML parsers, all sorts of mathematical functions, image processing, string manipulation really belongs up there with memcopy. Once again, these are very cache and processor architecture sensitive algorithms. We've taken the time to optimize them much, much better for you just to take advantage of that. At a slightly higher level, you find things like XGrid, which allow you to build distributed applications. The list is, as I said, really quite expansive.

We've made a little bit of hoo-ha about 64-bit support in new hardware, and not only that, but in the upcoming Leopard release of Mac OS X. There's a lot to be said both to encourage you and to perhaps actually discourage you from considering moving your applications to a 64-bit environment. Ultimately, you and your customers are going to be the ones that will decide what the right criteria are, when and how to actually make the switch.

That being said, there are some advantages to pursuing a 64-bit implementation at the earliest possible convenience. The large virtual address space that 64-bit applications have available to them does let you do some fairly interesting things. You do need to bear in mind the point that I raised earlier about virtual allocations costing.

But that being said, you can do some fairly neat stuff if you're no longer constrained to a total of four gigs worth of virtual address space. You can consolidate things that you may have previously had to do in multiple address spaces or by partitioning your data. It also means that with more heavily configured systems, you can take advantage of more physical memory in a single process.

For the Intel platform, code performance does change. The 64-bit code generation is quite a bit different. Just to begin with, there are more registers. This has enormous impact on use of memory. Algorithms that previously used to spill into their one cache in particular can now be implemented entirely in registers. There's also considerable improvement in the calling convention between functions, making intermodule calls potentially quite a lot cheaper.

[Transcript missing]

As I just mentioned, maintaining a very large virtual space has a cost that is not necessarily paid directly by your application. A large number of those address spaces will simply exacerbate the situation. There's also efficiency to be considered if you're spreading your data out a great deal more. That effectively decreases the multiplier effect of the cache size. If your data is more sparsely spread, particularly if you're using a stride that is incompatible with the way that the cache is organized on a particular machine, you may find that you have very much negative caching implications.

It also becomes very difficult to, well, very difficult is probably the wrong way of putting it. Leaky applications in a 32-bit virtual space tend to bring themselves to your attention by running out of virtual address space fairly quickly and crashing. This doesn't happen with 64-bit applications. You can leak until the heat death of the universe and you're not going to run out of virtual space. The kernel will probably crash because it'll run out of virtual space to store those mappings, but your app won't.

Some of the code performance changes that I mentioned before are not so great. CodeGen for 64-bit Intel processors is still maturing, and our understanding of that code generation is still maturing. This is something where-- Time will certainly improve the situation. By the time the leopard is actually released, we expect that we will understand this a lot better and that much of this will be ironed out. Something that you need to measure, however.

Every pointer in a 64-bit application doubles in size. If you have large, complex, interlinked data structures, this means that those data structures will grow. Any work that you've put in on a 32-bit application for optimizing the structure layout for cache line size will have to be reconsidered because the association of members inside the structure will change because of this change. On top of that, some opcodes also grow. This means that if you've optimized code for cache residency or for timing related to memory fill, that may also be perturbed by this.

And there is the unfortunate fact that some of our machines will simply never run 64-bit code. If you want to run your application, or if you want your application to be able to run on one of those systems, you need to consider building both a 32-bit and a 64-bit version of the application. We make this pretty straightforward with Xcode, but there are algorithmic and performance-related implications for this. In particular, if you are considering re-optimizing your application for a 64-bit environment, some of those changes are likely to perturb the way that it performs in a 32-bit environment.

So going back over all of this, I can really sum it all up with, there is no magic bullet. Making your application faster involves you understanding it and doing real work. Sorry. But the five-step plan will help. Because doing work takes time, you should try and do less of it. When you have something to do, get on with it.

Treat other applications in the system like you'd like them to treat you. Don't just blow them off. Performance tuning is not a black art. It's a science. It's a science that you can master. We provide a bunch of tools to help you with it. We provide some great folks who will help you if you need help.

Don't ever despair of understanding what your application is doing, why it's performing the way it is. If you can't understand it based on the data you have, look for more data. Consider in particular the enormous advantage that DTrace gives you in understanding specifically what your application and what the environment around your application is doing.

And take advantage of the stuff that we're offering you. The real value that you give to your applications is the neat stuff that you guys do. Reimplementing code, algorithms, functionality that Apple provides to you is just wasting the time that could be much better spent making your customers happy.