Carbon Performance Tuning - WWDC 2001

Mac OS • 1:13:07

Want to dramatically increase the performance of your Carbon application? This session demonstrates techniques for optimizing file I/O, memory management, and event handling. You'll also learn how threading can improve the responsiveness of your application.

Speakers: John Iarocci, Nitin Ganatra

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Welcome to session 121. You know, performance is an important consideration for all of us. And at Apple, we're certainly doing our part, working hard to improve the performance characteristics of Mac OS X. Many developers have told me that They've seen measurable performance boosts in their Carbon apps running on 10, but we know there's always room for improvement. And so it's my pleasure to introduce the manager of the Advanced Mac Toolbox team, John Iaroci, to tell you about how you can improve your performance running on Mac OS X. Welcome, John.

Good afternoon. Perhaps you've noticed there's been a little bit of an undercurrent, a theme here throughout the conference on performance. And basically, a lot of that revolves around A realization at some point, you've carbonized your app, you've taken advantage of some of these latest features, and you're comparing your app on 9 and 10. And you realize, gee, this part of my app is slow. Why is it slow?

And there's been other talks that have kind of gone over the high level description of why it's slow. But basically, we've seen time and time again, as more and more people bring Carbon apps to 10, that there's this basic assumption that certain code, certain code paths, certain calls, will have the same compatibility on OS X as they do on 9.

We looked at all of the APIs going into Carbon. We studied them carefully. We knew right up front that there were several APIs that just would not be--

[Transcript missing]

Before I actually get into some of the specifics in this session, I just want to mention that this session is coming ahead of the tools talk, 7:05, which is later on, 5:00 this afternoon. And this session is going to refer to tools that are described in depth in that talk.

Ideally, we would have actually been able to schedule the sessions,

[Transcript missing]

And finally, there's no one answer for performance, and there's going to be all sorts of different performance problems with your apps. So look to other sessions as well, some of which have already happened, for tips on performance in your app.

Okay, so these are the topics that we're going to go through today. And they're pretty much ranked in terms of The way I would prioritize them for you. If you're only going to do one thing with your app regarding performance, I would definitely look into application launch. I'd highly recommend the first three, going through launching, file system usage, and CPU usage.

That's where you're really going to get a lot of payback in terms of the time you put in. But all of them have interesting performance benefits, and I really would encourage you to really try to get performance into your planning, into your scheduling, and into the way you develop your app. So let's start with application launch.

Perhaps some of you are familiar with the bouncing icon. The interesting thing is that sometimes it bounces quite a lot. Sometimes it doesn't stop bouncing. There are pretty good reasons for that. First of all, the bouncing is there as visual feedback to the user that something is actually happening, that a launch is occurring, right?

Some very legitimate reasons why the app may take a long time to launch is perhaps the app is actually off on some network volume and the network is sluggish. Perhaps it's on a disk that is spun down. It may be on a CD drive. There's real legitimate answers to some of these launching performance problems. But those aren't really interesting and they're not really under your control. They're kind of environmental. The ones that we really want to talk about today are the things that you can do something about.

So when talking about app launching, I like to refer to two different kind of launches, two different environments in which you launch your app. The first one being a cold launch, and the second one being a warm launch. Cold Launch is your app launching on a bad day. Everything is going against your app. All the files that it has are not readily available or cached in the system. All the memory that it needs has to fight for some of that memory if some other app is using it. This is kind of a worst case scenario.

And it's actually-- it's an extreme that you're not really likely to hit as the system is actually being used. But what we do is we mimic this. A simple way to do it is to basically write a tool that allocates all of physical memory, then touch that memory, and that will flush out all the memory that's in the system, and then see how your app launches after that.

That doesn't quite cover all of it because there are cases where, even though the memory is no longer cached, you actually potentially have files that have been cached in the file system and other kernel objects that affect the performance. So it's actually hard to get to your real worst case scenario.

Warm Launch, on the other hand, is basically things are going well for the app. The libraries that you depend on are all loaded. They've already been instantiated. The best case is basically it's an app very much like yours or another instance of your app has just launched. And the reason I distinguish between the two is because they're really different in terms of how you optimize.

And the biggest point is the cold launch is predominated by what I would call low-level I/O, and I'll explain that in a little bit. So my challenge to you is to get your warm launches for your typical app, and this will vary depending on the memory and disk and configuration, to launch in one bounce. There are apps on OS X that launch in one bounce in this situation, and there's plenty of improvement for even the apps that we've shipped in the first release of X.

The other thing I would encourage is two measurement techniques that help constrain or help give you boundaries as to how fast or how slow your app can launch. The first one, what I refer to as the Do Nothing app, is Take your application, and the very first thing you should do, right after, in your main entry point, is just put in exit to shell.

Leave everything else the same. Exit will work as well. What you're trying to do here is launch an app that basically does nothing. But not just any app, your app, with all of the libraries that it depends on. Everything else is the same. It's just you're not executing any coded initialization. I think you'll be surprised when you actually measure that with either a stopwatch or any of the tools that we have on the system. I usually use, for this case, I usually use the time command at the command line.

It's an interesting data point because it's-- When you first look at it, that's your best case, right? And a warm launch when you just do absolutely nothing at main, that's as fast as you're going to get. Well, actually that's not quite true. Because before main is run, there is other code that runs that you actually have some control over.

Particularly, there's the init routines of your libraries that you pull in that execute code. There's the init routine of your app itself that executes code. And the third major category is, Static Initializers for C++. These three areas are things you definitely have to look at because they contribute to your best case scenario. You haven't run any code in the app at all. So you may be doing things in your static initializers that you just don't even remember you put there, right? So make sure you take a look at those.

So that's the do nothing app approach, and that's a good data point to capture. The other one is more like the best case, just to kind of get an idea of how good things could be. And there I would just recommend that you basically have a very high end system, as much memory as you can put in it. Make sure that absolutely nothing else is running. Take it off the net.

Launch the app once, launch it again, measure it. Those are two boundaries that you should keep in mind as you do performance analysis of your app. And what you should be doing as you improve your application launch performance is see how you can get those two to converge, essentially.

I'm not actually bringing up or talking at length about some more conventional techniques that you are already familiar with. The whole area of perceived performance is something that you might also want to look into. I'm talking about real-time performance. I'm talking about clock time. There's still definitely advantages to, for example, putting a splash screen up.

Ideally, you'd want to put your first window up as fast as possible, and if you can't get that to be a second kind of granularity, maybe a splash screen is in order. Some feedback. Oh, the other thing I forgot to mention on the bouncing icon. The bouncing icon starts when you double click on your app. It stops when your application is handling events.

So if you're doing a whole bunch of other stuff before handling events, you're not able to respond to events, that's going to tie into this. It's going to tie into how it's perceived that your application launches. It also is going to tie into how when the user can actually use your application. And then there was one other thing on the bouncing app, is it will time out after some absurd time. And then at that point, the user isn't quite sure if the app is actually launched or not.

Okay, so in looking at launch performance, I was able to profile two word processing apps. This is a typical profile of an untuned app. You can see the time is dominated by low level I/O. By that I mean the virtual memory system, paging, the dynamic loader, doing library loading, initializing libraries for the first time. This is largely completely out of your control. This is something the OS takes care of. But the other two sections are really interesting. That being the file I/O and the CPU time during launch.

By file I/O, a typical example of file I/O is when you're actually going and reading preferences, maybe you're enumerating plug-ins. That's the kind of file I/O I mean. On the CPU side of things, it could be as simple as determining, you know, maybe you've read your preferences in and now you're sanity checking them. Anything that's typically compute bound.

Now, what's interesting here is those two, file I/O and CPU time, compete with the lower level I/O. The ideal case, we minimize the file I/O and CPU time and we can make a much better use of low level I/O. The other thing that sometimes shows up in untuned applications is pauses.

By that I mean either an explicit call to a call like delay or a sleep call, left in accidentally, left in to work around some bug. Those are kind of hard to detect. Usually you have to see that basically the app is running, but it's not doing anything compute intensive and it's not doing I/O.

There's a couple tools that I'll get into a little bit later that'll helped find these kind of problems. The other anomaly we see sometimes is writing during launches. There's really no good reason that your application has to write to the file system during a launch. Now I'm not talking the first time your app ever launches on that system for that user. It's perfectly okay to go ahead and write out your preferences for the first time, but statistically speaking, your typical launch should not have any file system rights.

The reason for that is first, writes are much more expensive than reads. And the whole launch facilities, the low level I/O has optimizations for reads. It's basically geared at reads. And a write right in the middle of that will interrupt it and will essentially discard some of the optimizations.

Here's the profile of a tuned app. Now both of these are what I refer to, what I explained before as a cold launch. This one you can see has a lot more low level I/O and that's good because that's the best case for us. We can optimize that into the largest chunks of I/O that we can do and we can do them as efficiently as possible. And of course, the file I/O and the CPU are minimized in this case. If this were a worm launch, all that low-level I/O would go away. You might see some more compute cycles, but the profile is quite different for a warm launch.

Okay, so what does that mean? For both cold and warm launches, you should concentrate on CPU usage and file system usage. That's the areas that'll pay back the most. The best way of doing that is first do only what you need to do. Look at what you're doing in the launch of your applications.

If it's the typical app, you're probably initializing a whole bunch of stuff that you may or may not use during the life of that app. Look at deferring some of that initialization. This might be a very good use of setting up a Carbon event timer to send yourself a one-shot timer to defer some of this initialization.

Or don't even do it when the app is up and running events. Do it when the user first uses that feature of your application. Particularly if the feature in question is a somewhat optional feature. I'm not saying it's a bad feature or anything. I'm just saying if your typical user base isn't going to use that feature, why pay for it? Why pay for it up front in your initialization?

Then the second speed-up tip I would have for you is eliminating some of these things that you see during launch. Just outright eliminating. Writing and pausing of some sort, that's a good example of that. Dead code. Make sure your tools are working for you in this regard. I'm talking about dead code that, you know, may be in there for debug reasons, may be in there for whatever reasons you have, for tracing, profiling, things like that.

Make sure that that doesn't end up in your final product. Just a little bit of code sprinkled around means it affects your locality, means that code that would not be on the same page. Code that should be on the same page would potentially be split up across two pages, and that can make a difference.

Then there's another kind of dead code that I would encourage you to go after. That's the dead code that you've inherited over time. Now is the time to get rid of that kind of code that's checking to see if Quickdraw supports color. No need for that anymore. You probably still have the check in your code. You probably still have the code base that supports that check. At least, pound to find that out of your app.

And then of course, redundant I/O. And that shouldn't be understated. Redundant I/O is where you can actually get a lot of time back from your launches. And now I'd like to bring up Nitin Ganatra, who's going to go through some of the details of file system performance and help with that redundant I/O.

OK, it's on now. Good afternoon. So as John mentioned, redundant I/O, and in fact, any kind of file I/O, is a big burden on clock time of your application. Doing anything to get rid of this file I/O or minimize it at launch will pay off immediately. And it can either pay off at least minimally in reducing system call overhead in the best case scenario where you've got buffer caches that have got your data and they're already hot, all the way to if you're reading something off of a network disk and you solve because it's on a network.

So here are the areas that I'd like to cover. First is file iteration, metadata, volume iteration. Well, actually, no. You can read those. Let's just get right into it. First one, the venerable pbgetcatinfo. You're all familiar with this call, I'm sure. And when we were creating the Carbon API, it was pretty much no question.

We couldn't get rid of pbgetcatinfo. It would have just caused a huge upheaval in people's source bases everywhere, ours included. And so there was just no choice. We had to provide it. It had to be compatible. Unfortunately, we couldn't make it performance compatible, but that was a secondary concern in the interest of getting your apps onto 10 quickly.

The bad news is that PBGetCADinfo is non-optimal on any file system. That goes for 9 and for 10, if you have file sharing on 9, for example. And for the most part, it's overkill for all clients. PBGetCADinfo just returns a huge amount of data, and the developer code out there that uses PBGetCADinfo ranges from using none of it, in other words, just checking an error code, to maybe one field from this enormous data structure. Let's take a look at that data structure.

In fact, I couldn't even fit everything that GetCatInfo returns to you. It's just an enormous amount of stuff. And, you know, when it came time to, or back when PBGetCatInfo was first created and exported as a system call, it made perfect sense because it was a great reflection of the underlying volume format, right? On HFS disks, the catalog information is stored in one section of the disk.

It tends to be hot in the caches because everyone is using the catalog files. So PBGetCatInfo tends to be free. And, well, while you've paid the trap overhead, you know, on a classic Mac OS system, while you've paid the trap overhead of making the call and getting into the file system and what have you, let's just return back everything that we possibly can. And guess what? We did. Of course, and everyone uses this call. It's plenty fast on 9. There aren't any real problems with it. Problems slowly started creeping in with file sharing again, and things got much worse with 10.

Sort of as a graphical example of how pbGitCatInfo works, this is an optimal case right here. This is with file sharing turned off. This is on an HFS or an AFP disk. In other words, all the data that's given to you in one pbGitCatInfo call is in one contiguous part of the disk. And lo and behold, it fills in the param block in one shot.

Again, this is optimal. Let's look at what happens with pbgetcatinfo on other file systems. With Mac OS X now, we have the opportunity to support plenty of other file systems than we ever did before, and it turns out that pbgetcatinfo is just not a good reflection of the underlying volume format. In order to get some data, we have to go to parts of the disk, different parts of the disk.

And in fact, a lot of those different parts of the disk are completely disjoint, which means you make one pbgetcatinfo call on one of these file systems, you're doing numerous I/O operations. And I don't think I have to say that that's bad, potentially if this is a network-based disk. And as we move forward, more of them will be.

[Transcript missing]

A good way to see exactly what's going on as, you know, when you make a request, an FS Get Catalog Info call to see what's going on under the covers is to use FS Usage. This is a tool that's on your systems and it will be covered in the Performance Tools talk.

I believe it's session 705 that John talked about earlier. It's great to actually just write a simple little app, call FS Get Catalog Info with the various bits that you are interested in and just see what's happening. Particularly on these other file systems, you know, something that's not HFS or AFP, NFS or UFS. UFS is probably the most readily available.

So here's a quick little sample. Given an FS ref, tell me if this item is a folder. Notice that the only bit that's passed to the FS get catalog info call is the FS cat info node flags, because that's the only bit that we're really interested in. So on an NFS or a UFS file system, that's all we have to worry about. And we can do that in one system call at most. In fact, in a lot of cases, we can do it in zero system calls.

And then, of course, the field is anded and returned. The next topic is volume iteration. So when it came time to create the Carbon API, we were going through and pruning out a lot of areas that we just couldn't support in Carbon on OS X. One of them was the low mem to get at the VCB pointer.

A lot of Carbon apps, or a lot of Mac OS apps, saw this as a free way to get access to all volumes, to enumerate all volumes, with zero I/O. And, you know, I don't have to tell you, just in-memory copies are very fast, very efficient. However, when we created the Carbon API, it was pretty clear that we couldn't support direct VCB access. And our recommendation was, and continues to be, to use one of the get volume info type calls. Specifically in the documentation, we mentioned hgetvinfo.

However, the problems that we have with pbgetcatinfo are the same problems we have with pbhgetvinfo. And it tends to be very expensive. It returns a large parameter block. For most uses, you probably just don't care about a lot of that information. And exactly analogous to fsgetcataloginfo, there is an fsgetvoluminfo call. Again, pass in the minimal bitmap that you require, and we will do the minimal I/O. In a lot of cases, it will just be an in-memory copy for us, out to your param block, and you do zero I/O.

The FSRef APIs are the primary APIs and the preferred APIs in Carbon on OS X. In fact, all of the major clients of the file manager on OS X today, the file manager, navigation services, and Cocoa OpenSave, all use the FSRef API and have actually seen huge performance gains in doing so. We've seen performance gains not just on NFS and network-based file systems, but in fact we've seen them even on local HFS disks. So it's definitely something worth investigating.

So on to file I/O. First thing I'd like to plug here is what some of you probably recognize as being a relatively old tech note. I think it was put out in '93 or '92, I'm not really sure. File Manager Performance and Caching. Turns out a lot of the lessons that are taught in that tech note are relevant today, and I highly recommend that you go back and read that.

Some of the biggest points in it, of course, are to use large page-aligned I/O wherever you can. If you're picking through little bits of data in a file, don't push that down to the file manager or the file system and don't pay the system call overhead just because you're doing little bitty writes. Do larger page-aligned I/O into preferably page-aligned buffers and you'll get the maximum throughput from the file system, a lot of times without even having to do copies from the kernel buffer if you pass those page-aligned buffers as well.

Again, let me put in another plug for FS Usage. This is a great place to look, and you can actually see exactly where reads are coming through, writes are going through, and how much data is actually being read per read or write. This will help you identify quickly where you're spending a lot of your time doing a lot of small I/Os. The next point is don't pollute the cache. This is covered in the Performance Tech Note, and it's something that a lot of clients overlook, and it really shouldn't be because it tends to pay off in big wins.

What I mean by not polluting the cache is you as the app developers know exactly what your usage pattern is going to be for any bit of data or large chunks of data. you know that if you're streaming in, if you're importing a file that you're just not going to look at again, say you're importing from some foreign file format into your own internal representation, you're just not going to go back to that file again to do the read.

And if you're talking about a multi-megabyte or, you know, it doesn't even have to be that big, multiple hundreds of K file, if you just do reads without passing the no-cache mask, what you're actually doing is filling the buffer cache with data that you know that you're never going to read.

Turns out that by passing the no-cache mask, you're not actually hurting yourself, you're not hurting your throughput by doing those fresh reads from these files, but by... bit to these writes you're not going to pollute the users buffer cache and in fact you're not going to pay any performance penalty by passing the no cache bit you're just going to make it an overall better experience I think a big reason why this isn't used so often is because it doesn't look like it's a performance game in other words when you go and change your code it's really hard to see the benefit of this the reads that you're doing are just as fast the rights that you were doing are just as fast and if you're not if you haven't evicted pages from the buffer cache that you really cared about you're just not going to notice but it is still very important and I strongly encourage you to to look at to look hard at where you're doing IO and what kind of IO you're doing and pass the no cache mask where you can internally just a couple examples of where we use it are when we're doing finder copies the finder is doing a copy of the no cache but it's not going to be able to see the no cache but it's not going to be able to see the no cache but it's not going to be able to see the no cache of a folder from A to B it knows the finder itself knows it's never gonna or it's very likely that it's not going to look at that data again unless the user requests it so there's no reason to flood the entire buffer cache with these copied blocks instead the users data can stay intact and the finder copy can execute just as quickly as it did before the other area is in iTunes when it's encoding a file or ripping a file from CD and writing out to disk iTunes itself knows that the chances are very slim that it's going to actually go back and read those pages again so it passes the no cache mask and it turns out that because of that the use the user experience on 10 is a little bit better even though it's kind of hard to really quantify that and look at it you kind of just have to know that it's better and know that you're doing the right thing Okay, writing large files.

One common technique that has sort of been passed down from generation to generation is when you're writing a large file to first do a set EOF or an FSSet fork size to the final length of the file that you're actually writing and then back up and start filling in the data.

This has a couple of advantages and this is why it's been done over time. First of all, it is a good pre-flight to allow you to know whether or not you have space on the disk to actually do the IO. And then second, it's also a good way to reserve a portion of the disk, a hopefully contiguous portion of the disk, so that when you go back and you're actually doing your writes, you know that you're writing to contiguous parts of the disk and subsequent reads of that document or that data offer. for the disk will be fast.

The problem is on OS X, for security purposes, when you extend a file, what we do is we zero fill the entire file from the current EOF out to the very end where you extend it. And the reason we do that, of course, is security. We don't want some malicious program to run on your disk, reserve as much space as possible, and then potentially sniff through looking for social security numbers or credit card numbers or what have you. So this is why we do the zero fill. Of course, it has the downside of producing double I/Os in this very common usage of the file manager.

The double I/Os come from first when we do the zero fill of that extended area, and then later on when you actually do your write. If you write a little app on OS X right now, that all it does is just create and open a file and then just do a set EOF of a gig. You'll notice that before that set EOF returns, your disk will be buzzing away, and when you go and do a subsequent read, you'll see that it is all zero filled, and that's exactly what I'm talking about.

We're looking at ways to fix this in the near future, but the truth of it is we've shipped this way. This is already on customers' disks, so it's something that you should probably address now. And fortunately, there are a couple of ways that you can address it. You can use the PBAllocate call. This does not have the zero-filling behavior. However, it does allow you to reserve a portion of the disk for a contiguous file on the disk.

Or the other thing you can do is just write. Just start writing. If you're not doing, as long as you're not doing set EOFs followed by a write, you won't get this double I/O. If you're just doing writes past the end of the file, then that's enough of a trigger to the file system that any subsequent reads are just going to pick up that data that was written, so we don't need to zero fill, and we don't.

Finally, file assumptions. I couldn't think of a better topic for this, or heading for this slide, so I just put this. Since the beginning of personal computers, we've been able to make some assumptions about the layout of disks and the layout of hardware and usage patterns and things like that. And as we move forward, more and more of those assumptions will prove to be false or can prove to be false under certain situations.

One of these assumptions, of course, is that your disks or the user disk is locally attached to the machine, that all user data is coming off of a local disk, all preferences are coming off of a local disk, and in fact, document directories and things like that are on local disks. Well, networks are getting faster all the time.

Pretty much right now, they're fast enough that in some situations, you can actually create an environment on the disk where user preferences, user documents, and various other bits of data are actually stored on a network-backed disk, and it provides lots of benefits. In fact, we have this all set up at Apple right now, where users can log into their machines. You log in in one workstation, let's just call it. And you log in with your username and your password.

All of your preferences come up on that one machine. You can use your documents just as you were maybe in your office, or do whatever you want to. You get all your same preferences. You go back to your office, all of your preferences are updated, because everything is on the network. It's a beautiful thing, this sharing. And as we move forward, it's going to be one of those things that a lot more users are going to be exposed to. Thank you.

But it does mean some serious considerations for your applications. In other words, preferences and documents and things like that are no longer going to be backed by local disk. And this can have impacts on your code base that you're probably not even aware of. Just because when you're coding or designing with some assumptions in mind, a lot of times you're not even aware you're making those assumptions.

And you'll tend to do things like, let's say, "Oh, I know that I'm bringing this data down off the network and I need to cache it somewhere. Let me cache it in the preferences directory." Or, "Let me cache it to a temp file in the documents directory." Well, if those directories are backed by a network volume, you're really not buying much by caching something off the network to another network volume.

Or, this is a lot more common scenario, when you launch your app, you're doing tiny little I/Os to the preferences file, and this has never been a problem because it's a local disk. It's very fast. Well, if it's a network disk, that's a big window that you can stall in and your users will definitely notice.

Trust me, we've noticed at Apple, and we've been working with developers where we can to point out what's going on and help them work through it. But the best thing that you can do is try to set up one of these hostile test environments in your own offices and see for yourself.

One of the best examples of this is to set up an NFS-based user directory and log into that user and just double-click your app and see what happens. Or double-click your app with FS usage running alongside and see what happens. You'll notice that if you're logged in as either a local user or as a network user, a lot of times you'll notice a great variation in the performance of your app, and a lot of that can be attributed to some of these design decisions.

The good news is that anything that you fix for the network case will also benefit the local case. So if you're working off of maybe slower media, or maybe not, say you're just working off of fast media, you can reduce the number of system calls that you're making and speed things up even in your local scenario. So it's definitely a good thing to do. Look into doing and check the Mac OS X server documentation for more details. And with that, I'll bring John back up on stage.

Talk about watching the application bounce a lot. Those network users, when you have your system set up that way. We typically see two or three times the number of bounces we do in a local directory. It's really something I advise. Take Nitin's advice on that one. Okay, you've learned all of the details about what you can do to help with your file system performance. I'm going to talk a little bit now about your CPU usage.

The first thing I'd like to say is You're running on a preemptive multitasking system, but it's not magic. It doesn't give you more than 100% of the CPU on a single CPU system. It can't give you free cycles. Matter of fact, basically the gain is that one single thread on that system is not going to take over the whole system. It's not going to bring the system to its knees.

So if you have a hundred threads that all need to run, they're all sitting there, have something to do, even if it's very little, the scheduler has to take them into account. That's why we talk about making sure that your threads are blocked. That CPU is still a limited resource.

So make sure when you're using threads or timers, cooperative threads, that you're taking this into account. The best tools on the system to really look at this are a top time and CPU monitor. CPU monitor you've probably seen in some of the demos. I would advise just keep that thing running as you're doing development. Just keep it off, maybe on a second monitor. It'll show you very easily when there's a little bit of a CPU peak and you can go in and see, is that your problem or not? Typically, that is a great indicator for when you have CPU bound problems.

Top is another one that you could run because it shows you a little bit more than just CPU usage. Both of those I would encourage as you're just doing your ongoing development on your app. Keep them running in some window as clues to the possibility of a performance problem.

So, responsiveness. This is the next area after launching file system and CPU usage that I would encourage you to look into. Mostly things to do with responsiveness. You should be able to fix up fairly quickly by just taking a quick look at what your app is doing.

[Transcript missing]

The biggest indicator that you're not doing event handling right is probably that you're pegging the CPU. You've seen this in some demos. The best and simplest workaround for that is maybe look around. If your app is showing this behavior, if it's CPU bound, during tracking, during interaction with your UI.

Take a look. Use Sampler, which is a tool that lets you actually pinpoint where in your code the problem lies. Search your code. Search your code for still down and button. And look at how you're using -- how you're calling these older calls that we really would rather you get off of. Track mouse location is your friend. That's what you want to be using. That's the basic primitive for letting you do all sorts of tracking in the UI that blocks intelligently.

So after event handling-- oh, in addition to those tools, I would encourage you to look on a developer CD. There's an application called Appearance Sample, which has almost every widget the toolbox supports, every control you've ever seen. Go and play with that app. Run CPU Monitor. You'll see all of those controls block well. That's what your app should do. If you're seeing different behavior in your app, it's either a problem in that you've done your own kind of handling, your own custom control, or potentially in the way that you're using the toolbox.

The next area that contributes to your app's responsiveness, you know, maybe it feels a little sluggish. Maybe everything else is looking good. Your file system performance in your launch is good, but when you activate Windows, things don't appear as snappy as they do, say, online. That's probably an indication that you have a drawing problem. The best tool for that is Quartz Debug. You've probably seen it in some of the other sessions. It should be in the Performance Tool session as well. Because it'll let you see when you're doing redundant drawing, when you're drawing the same things over and over.

The other typical pitfall that we've seen is people back buffering, doing their own double buffering for their drawing when the system is already doing that for them. So make sure you check the port using QD, is port buffered, and if it is, then you don't have to do that buffering yourself. That's being done for you.

The next area on responsiveness has to do with flushing. Because you have a back buffer, it means there has to be a time when you actually get those bits in a back buffer to the screen. Generally, you should try to avoid flushing. The system will do flushing for you basically on event boundaries. It'll try to do that as intelligently as possible. So you shouldn't have to flush.

The two examples of exceptions are when you're doing some kind of animation, you want that to get to the screen right now, or when you're not really involved with events at all, the splash screen case. Those are good uses of explicit flushing. Otherwise, let the system do it for you.

Another common performance problem area is with regards to window resizing. This also has to do with the back buffer and the design of the window system on OS X. Basically, our advice there is to try to do this all in one fell swoop with set window bounds, instead of trying to use size window and move window in combination.

That's what that call exists for, to optimize those cases. Also, the other thing that we've seen with regard to Windows is pervasive or heavy use of invisible Windows. Windows are generally more expensive on OS X. The back buffer, in addition the overhead, the interaction with the core graphics system in the Windows Server. You may have invisible Windows.

That doesn't mean that they don't cost anything. That manipulating invisible Windows doesn't come for free. As a matter of fact, I would look into why you're using invisible Windows. It is often the case that you can ditch that window, dispose it, create a new one, and redraw faster than you can by twiddling that invisible window.

The last one I wasn't going to put on here at all, but I figured I would try some of the tools and look at various apps a couple days ago. And I just happened to notice that one of the apps that I use every day was doing file I/O when I activated a window.

And I'm sitting here scratching my head, trying to figure this out. And it's just a bug. But this is one of those things that you probably wouldn't notice unless you're running a tool that tells you that that's happening. FS Usage is perfect for that. FS Usage lets you basically see the file I/O that's going on in the whole system and a particular app. You can filter things out. Another use if you're really going after file, anything in particular, is Sampler, which lets you tie the usage pattern back to your code.

Okay, pulling versus blocking. I'm sure you've heard this a lot in various talks. I'm going to talk a little bit about some of the more atypical situations in which you find polling affecting performance. WaitNext Event is actually pretty typical, but the fallout of using WaitNext Event 0 is where we sometimes see some problems.

So just to make sure we're all on the same page here, you really shouldn't be using WaitNext Event 0. The very simple way to get rid of the WaitNext Event 0 is if you have something that you want to do periodically, set a Carbon event timer to do it with the frequency that you need and use WaitNext Event. Very long time now.

Time and tick count. It's something that surprised a few of us, basically, that we did some performance profiling and various apps show up that tick count is taking up a significant amount of time. And one of the reasons is that tick count costs more than Adonis on OS 9.

But it's also used all over the place in a lot of UI in places where it really doesn't have to be used. First of all, we're talking about something that's fairly coarse-grained, right? Ticks, 60ths of a second. Calling it more often than 60 times a second doesn't make a lot of sense.

So the best advice I have for you here is to try to use the event system. Try to use the timestamps that are in events and look at events like that. There's often comparisons made to the time now. That can get you out of polling essentially for tick count and have it show up in being a performance problem.

Another often asked for bit of information is the volume list. Nitin went through earlier how to do that as efficiently as possible. But I think it's a pretty rare case where you actually need the volume list. I would suggest you just get rid of that code altogether or figure out what you really need.

If you're trying to find out about new volumes or if you're trying to find out about volumes that have just been unmounted, register yourself for a Carbon event for volume mount and unmount. Ask the system to tell you about it instead of periodically going out and looking at all the volumes and trying to figure out what happened.

Same kind of thing goes for preference change notifications. There's a theme change app event. There's various new Carbon events that let you know about things. Actually, up on the volume slider, I should probably also have another bullet for processes. If you're trying to find out what process just got launched or what process died, there's a Carbon event for that as well.

Generally, we've been really looking at the system to try to find out if there's any legitimate need to do polling. The answer should be no. We're trying to notify you of everything that you might find of interest. It's a much better solution on OS X. So if you guys see things that you still think you have to poll for, let us know about it. We'll figure out a better way to do it.

And on this final note, maybe some of you have heard this bit in the application packaging and document binding presentation yesterday. We need notification too, particularly in the parts of the system that present the file system visually. So in the finder, in nav services, in the open save panel, those are showing you the file system objects. They're not polling. We don't poll. So we need you guys to, if you participate in this kind of thing, if you're an installer or if you're copying files to a place that is likely to be visible, to use the FN notify call.

FN notify is a 10 only API. It's in files.h. Basically says something happened, something changed in this directory. It takes an FS ref. It lets us know that something changed and we should refresh the contents of that directory. in any UI elements that care. Do this intelligently. If you're copying a whole bunch of files to a single directory, let us know when you're done with that copy operation, not at every file.

Okay, resource manager use. The resource manager is very tied to the file manager. So in essence, I could just repeat what Nitin said earlier about the file manager. But, um, It's actually worse in that the resource file format was designed way before VM systems were really commercial like they are now. And really the file format is not designed for a VM system in that there's a resource map in one part of the file and resource data in the other part of the file. And there's no way around going to both places every time you need data.

You're asking for a resource. You have to go look up where the resource is in the file in the resource map. That's at least one I/O. Then you've got to go to where it told you the resource was. That's a second I/O. Okay, that's bad enough. Then you look at what's in a typical resource file and you see lots of little resources.

[Transcript missing]

Coalescing of your resources. If you have resources that, for example, stir resources that could be combined in a stir pound, it's a much better use of the Resource Manager. Going and reading eight bytes out of the Resource Manager is about one of the most expensive kinds of I/O you can do.

The other thing that we see with regards to resources, particularly in the use of plug-ins, is enumerating your plug-ins, opening up their resource files to find out something about them. Opening and closing the same files is a pattern that we've seen and we'd really like you to avoid that. Perhaps what you can do is cache that and open, you know, cache the results in one file and open that. Minimally, make sure that you just do that scan once when you actually have to find out about your plugins.

And then for historical reasons, just lots of calls to update res file, which writes out the map and the data for your resource forks. Just kind of call it willy-nilly. It's more like a flush from people's minds, and that causes I/O. Okay, the last bullet item on here is something that we added to OS X, basically a feature in the Resource Manager to kind of help out with these sets of problems.

What we did is add a new key to your Info.plist called CSResourceFileMapped. And if you set this, it's a Boolean key, if you set this to true, it'll change the behavior of the resource manager with respect to your application's resources. What it'll do is it'll open them up, read only, okay, you can't write to them, and it'll file map them. And then there's some support in the memory manager.

to support file mapping so that we don't have to allocate for the data that's in your resource fork, which saves on your memory footprint. And because it's all file mapped now, we get a lot better characteristics of I/O because, yes, we're still hitting that resource map and we're still hitting the data, but there's some locality there.

When we go to the resource map the second time, it's likely to be on the same page. And when we go for data, if you're going through data of a certain type, this will depend on the organization of the data, it's likely that we're going to get some good win there.

The only caveat to this, and the reason why we didn't turn this behavior on by default, is that it'll break some of your code. At the point where you say, "Yes, turn this on," then all of the resource handles that you get back, essentially the pointers in them, point to read-only memory. If you try to modify that, your application is going to crash.

So there's plenty of folks who have just turned this on and don't write back to the resources, right? I mean, in general, particularly the application resource file, you probably don't want to write to because it might be living on a CD, it might be reading on a network volume that other people are using. In general, it's bad practice. But let's say you were writing things to the resource but not actually flushing them out to file. You can still do that.

By detaching the resource, thereby getting an in-memory copy, mess with the copy, and everything else still works correctly. So this is something that you have to turn on yourself and it's fairly straightforward to debug because it basically leads to crashes. And if you have any questions about exactly what the Info P-List is, I would recommend looking at Tech Note 2013.

Okay, the next section is memory usage. This can be a real big problem on 10, largely because of the big difference between the memory models, in that you just have a very large and sparse address space. For example, just right off the bat, you could accidentally allocate, you know, order magnitude off what you intended to allocate and not even know it. The system will give it to you. Memful errors are relatively rare on this system.

And that could be a problem. So in order to keep on top of this, I really would recommend getting familiar with both the leaks and the malloc debug tools. Leaks in particular you want to keep an eye on. You may not notice the performance necessarily so much in your app.

It may be that a slow leak over time, but it really does affect what's going on underneath the covers in that, you know, you don't get reuse of the same memory blocks. It'll lead to paging. It'll lead to general bad characterizations. It'll lead to your app generally feeling sluggish.

Aside from leaks, I would really recommend that you get a good handle on the size of your application. Particularly, make sure your tools are doing the work that they should for you. Make sure that Things that are actually constant in your application end up in the right section so that they don't get-- so that the OS takes the maximum advantage of that.

We went through a lot of the Carbon frameworks early on and got a lot of gains by doing this. Basically marking strings and other constant sections as constant so that they show up in a text section that gets shared across the system. Same thing goes for your app.

The third thing on memory usage is there's really been a reversal in terms of handles and pointers. On OS 9, the handle was really the first class citizen. It was designed to work with that limited application partition, that heap, and it was designed to be reused inside that limited space. On 10, the reverse is the case. Pointers are really the first class citizens and there's some cost to handles. So in performance critical code, look at rewriting to use pointers instead of handles.

We found one case where Just removal of HLock and HUnlock in this code path made a big difference in terms of performance. The reason there was that The locking costs are sufficiently higher on 10 than they are on 9. And what work had to be done was really impacted by that.

And this is something that you should do kind of carefully. The OS itself-- I mean, if you're looking into this kind of an optimization, the OS itself doesn't rely or doesn't go and purge and move handles out from under you. That's really under your control on 10. And so if you know that you're not resizing the handle somewhere else in your code, if you know that you're not looking at the handle to see if it's locked, then you're likely able to make this kind of an optimization.

Then lastly, I hope it's pretty obvious, is there's really no purging. The calls are still there. They're largely there so you can run the same app on both 9 and 10. But your purge procs are not going to get called. The application heap is not going to fill up. And if you're relying on basically allocating, allocating, allocating until you get called in your purge proc, that's not going to happen. That's the biggest leak you can ever have. So take a look at that if you're in that kind of a category.

Okay, code loading. This is something that kind of referred to earlier on in launching a bit in that I said something to the effect of defer some of the things that you do at launch time to later on. And one of the ways in which you can do this is to factor your app. Most application code bases start off and are organized basically by the people that work on them first.

So you get Kelly's feature and Mary's feature and John's feature and they go off and do those different pieces of it. And then Nitin comes along and he has a new feature to add. And pretty soon you've got one little piece of code that you can add to your application.

And then you add the rest of the app, whether it's a shared library or a plug-in, for each person that's working on it. And soon those features grow up and there's a whole team around each of those features. And before you know it, the organization of your app looks a little bit like the organization -- your organization, right?

And that's rarely the best organization in terms of performance. You really want to look at the features in terms of what the app really needs. Probably want to look at layering and dependencies. So factoring in your app in terms of performance is something that I would advise you to do. It's usually not something that you would do quickly. It's something you probably converge on over time.

Look at plugins for things that are truly optional. Again, I don't mean this is optional in the sense no one would use it. An example of this in the RealLive OS is nav services and printing. Those are both good categories from the OS's point of view of services that are purely optional in that the application, your application, can run fine, do lots of good work, and never interact with nav services or printing. So why should they pay the cost up front?

The answer is that it shouldn't. So look for those kind of opportunities in your app. Maybe there's a plug-in that has all the bells and whistles that you could ever want, but you only use it once in a blue moon. Factor that out so you don't pay any costs for that. Don't give that plug-in an initial load and say, are you happy with things? That'll cost. Then finally, look at your libraries, look at the number of libraries that you're using.

Libraries, when backed by files, are costs. They're costs all the way down to kernels. The kernel has a fixed cost. There's a per process cost. If your tools support it, take libraries and combine them together. Merge your PEF libraries together into one big file. That's infinitely better from a performance and resource use point of view.

Okay, in the async I/O space, we've seen some problems that are kind of interesting in the combination of async I/O and threading. Asynchronous I/O on 10. By that I mean deferred task, file manager, time manager. Those are all implemented based on running the operation in question synchronously on a thread that the OS creates for you.

Particularly when used with cooperative threads or were used combined, there's additional costs there. In general, this is not performing as well as the equivalent on 9. This is a case where I would recommend continuing to do async I/O and chain completion routines on 9, factoring your app and dynamically checking, and doing something entirely different on 10. The simplest workaround or the simplest solution on 10 is really just to use threading in a synchronous I/O model.

And then just a data point we ran with one of-- there was an app or two that we've seen that implement threaded packages-- threading packages on top of the Time Manager. The Time Manager is itself implemented as a thread. Which means you're threading on top of a thread that's scheduled by the kernel, which is you're trying to run threads on top of something that's already being managed by something else. Not a good performance solution.

In the context of cooperative threads, there's one basic flaw with cooperative threads. And that is that in order for cooperative threads to get scheduled, you have to yield, and there's no blocking going on. So by their very nature, cooperative threads are compute bound. That's the biggest problem. They're still there because we know you have code that depends on it. I would really look at not using cooperative threads or potentially using timers, Carbon event timers instead, or moving your code off to MP threads.

Often there's a performance problem with regards to messaging between threads. This usually has to do with messaging as opposed to, or polling to see if some message-based thing is complete versus true messaging. And there I would just encourage, if you're doing things with multiple threads, even across processes, make sure that you're not getting in a situation where both of the threads are competing.

The case in the data point that I have was basically one thread was doing a lot of file I/O and was reading and writing to a file. The first thread was basically looking to see if it was done. And the cleaner solution to that and the better performing solution is basically have the second thread just block and when the file I/O thread is completed, just send it a Carbon event to get that whole thing to work well.

And then finally, seeing situations where basically people just go thread happy. They have just way too many threads for no real apparent reason. And just bear in mind that each one of those threads has a real cost. There's a wired memory cost in the kernel and they're not free. So use them diligently.

And then finally, threading in general can be used to really help out with performance. I would say look, particularly look for things like when you're trying to do a safe save or a fast save kind of feature. That's a very good use of a thread. You can create the thread, do the work on it, and dispose it. A network listener is another model that works really well where the thread is just basically listening for incoming activity. And occasionally there's good use for threading when you're doing low priority idle kinds of computation. Maybe you're indexing something in the background, something like that.

Okay, so we're finally at the summary. Really want to encourage you to factor performance into your planning. Try to really make it be a feature of your application. We want that killer app to be that much better by performing well. Performance isn't a one-shot deal. You really have to keep it in your workflow. You've got to keep on top of it. Ideally, you would, you know, with different builds of your app, try to capture data about that, about how it performs and see and pinpoint where performance problems were introduced.

And then I really encourage you to get into the tools. The tools talk is later on this afternoon. And all those tools are on the system. You should just become experts at them. Those tools allow you to look at your app in various different ways and they're really helpful in pinpointing these problems.

And lastly, just go after those performance problems. All right, so now, Let's see. Oh yeah, one last thing. So the first one, the Carbon Developer Documentation, you should just generally know. The second one, if you're not ready to do anything with performance at all, you're stuck behind a whole bunch of, you know, couple months worth of features on your app, you're still carbonizing, anything like that, at the very least, remember this second URL up here, that performance PDF file has a lot of information on performance. A lot of what I've gone over and what we'll be going over in other sessions is in that one document. Okay, I'd like to bring Mark up and then we can, he'll do the road map and then we'll head off to Q&A.

Thank you, John. There we go. That one worked. Okay. So as John mentioned, we'd like you all to attend the Performance Tools session at 5:00 today if you possibly can. We're going to talk about the various tools that he introduced you to. And also, because we don't have a lot of time right now to take questions, I'd ask you to take your questions to that session, and we'll have some of the same people there to answer them.

But we'll just take a couple right now, so if we can bring up our Q&A panel, we can do maybe two or three questions. Oh, by the way, this is me. If during or after the conference you have any questions or comments about Carbon or Carbon Performance, send them to me at this address.