Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2008-412
$eventId
ID of event: wwdc2008
$eventContentId
ID of session without event part: 412
$eventShortId
Shortened ID of event: wwdc08
$year
Year of session: 2008
$extension
Extension of original filename: m4v
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: [2008] [Session 412] Cocoa Perfo...

WWDC08 • Session 412

Cocoa Performance Techniques

Essentials • 1:06:37

Recent developments in Cocoa can help improve your application's performance profile. Using demonstrations and example code, we'll highlight topics and techniques that help you create Cocoa applications that get the most from all the memory and processor cores in the Mac. Deliver the great features of your application with the performance your users crave.

Speaker: Ali Ozer

Unlisted on Apple Developer site

Downloads from Apple

SD Video (710.8 MB)

Transcript

This transcript was generated using Whisper, it may have transcription errors.

And welcome to Cocoa Performance Techniques. My name is Ali Ozer. I'm the manager of the Cocoa Frameworks team. Okay, so what are we going to discuss today? About half the talk is going to be about concurrency. As you heard during Bertrand's talk and during other talks, concurrency, multi-core, is an important aspect of Snow Leopard, and I do want to cover it in some detail during the first half of this talk.

Beyond concurrency, we're also going to talk about computational complexity, memory, and drawing, and also point out a number of other topics. Now, we were not going to cover all of the performance techniques and topics that we usually cover, But I am going to point at some documentation and make sure I outline them at the end to make sure everybody is on the same page.

Okay. So, the first topic is multithreading. Multithreading is use of multiple threads in multiple threads of execution in one process. And this is in one address space, so which of course has benefits but also has some downsides. We're going to see some of these. And these days of course, thanks to multiple cores, multiple threads are really executed concurrently. Two threads are executing at the same exact instant touching the same exact memory locations. Now, multithreading can have many surprises, and I'm going to show a demo for this.

In the demo, we have a model object with an integer property, and we have various activity objects which are running threads, and these threads are updating this integer as fast as possible. So, they're just setting it with incrementing values. And we can run one thread or multiple threads, and we're going to see how this behaves. So, can we switch to the demo machine, please?

So, I'm not going to show you the source code very much, just this piece here. This is basically the one step of our little thread loop, and it's basically incrementing a value and setting it on the model. That's all it's doing. So let's go ahead and run this.

So, first -- oh, let me just show you this. In the number of tests we run today, number of demos we run today, you're going to see this pattern where we have a button up here that lets us switch between various tests we're doing. And then we can specify the number of seconds to run the test, in this case, three, and number of threads. So, let's go ahead and set a single property from one thread for three seconds. And as you can see, we're getting about 300 million updates. Okay. That's fine. We're running on a pretty nice, fast machine here. So one question is what happens if we go up to two threads?

Now the optimists among you might be thinking we're going to get 2X and some of the realists among you are thinking we're not going to get 2X, we're going to get slightly less than 2X. So let's go ahead and run and see what happens. Now remember we got 300 million with one thread. And with two threads, we get 108 million. Now consider we're running on a machine with eight cores.

We have two threads. Presumably each one is running on their own core. And we're getting with just one thread. So that's already pretty surprising. Now, you might be thinking, you know, you didn't show us the whole program. There's something shifty going on. Maybe the OBJC runtime is slowing you down. Let's go look at that program a bit. I'm going to show another sample. I'm just going to show a sample where instead of setting a value, all we're going to be doing is reading the value back. We're not going to bother setting the value. But again, we're doing a message send. I'm using the exact same infrastructure for the application.

Let's go ahead and run this. So reading a property, and with one thread, we get 360 million, and with two threads, we do get twice the throughput. So the problem here is clearly in the setting of that value and it has nothing to do with the rest of the program. So if we can switch back to the slides, please.

Okay, so multiple threads aren't always better. Even with N threads running on N processors, and without any sort of locking or anything, we're not, you know, we weren't doing any locking in there. There's nothing to stall us. We're seeing a slowdown. So the problem in this case is what's called contention. Multiple threads are writing into the same memory location, and the hardware is trying very hard to make sure that memory location, and it turns out other memory locations on the same cache line are visible to all processors. And that update ends up stalling all these cores, And that's part of the problem here. And it's not really obvious when you look at the program, what's going on. Now, what are possible solutions in this case?

Well, one is to not use the same shared state. For instance, you might decide to do the work locally, increment your value all locally, and only once in a while update shared state. In some cases, though, you might have to deal with conflicts. And we're gonna talk about some of these topics in a few seconds.

And one other thing to note, in our example, we didn't care about correctness at all. Our threads were just blasting that value in there, and we didn't care what happened to the value, whether it survived or whether it made sense and so on. And we're gonna talk about that right now.

So atomic properties, when I say atomic properties here, I'm referring to OBJC2 properties. And you may know or you may not, but by default, they're atomic. And you can make them non-atomic by putting this non-atomic keyword in there. Now, what atomic guarantees is that in the presence of multiple threads, some are setting, some are getting, the value you're trying to set or get is set or retrieved fully. So it doesn't come back to you as garbage. So you either get the previous value or the after value, but you never get some intermediate state. Let's see what this means. Can we switch to the demo machine?

So, in this example, what I'm going to do is I have a bunch of model objects and I have a bounds property here. And in one case, it's declared atomic. And in another case, it's declared non-atomic. So we're going to see the effects of atomic versus non-atomic. So if you run this test case, first let's do the atomic struct as we saw in NSRect.

So you can see that with one thread, we're getting some updates. Rather than number of updates, this failure rate is what we're looking at right now 'cause we're tracking correctness. With two threads, Again, no failures. That's great. And our rate has slowed down a bit, again, because of the contention problem. Now, let's go ahead and use a non-atomic struct property with two threads.

And in this case, we are seeing a failure rate. And the way we're detecting a failure rate is we're setting a rectangle where the x, y with height are all the same. And when we read it back, we see if they're all the same. And once in a while, they are not. So clearly, we read back an intermediate state. Let's see what happens when we do the same thing with NSStrings, which is, of course, an object. With an atomic NSString, no failure rate. Again, atomic is good. And let's see what happens with non-atomic NSStrings. Okay, crash, and don't panic, that was all expected. In fact, that's the purpose of the demo here. With non-atomic, in some cases, you may crash. Okay, so let's go back to the slides. Slides, please.

Let's look at the struct value case first. By default, multi-word structs are not copied atomically. Here's what our -- I'm using size here instead of rect, just because it's simpler, but as you know, a size and then a size has a width and a height. Now, when you see size equals new size, that looks atomic enough for you, but it's really assigning two values. So if we were to look at what happens here, let's say we have two threads. One is doing doing a set size, one is doing size, thread one sets the new width on the size, at the same time thread two is trying to read the size, it happens to fetch the width, it happens to fetch the height, and thread one happens to set the height.

These can happen non-deterministically. And as a result, what you got back is the new width and the old height, and that's the failure case we're seeing. And of course, not good. Fail. Let's look at the object value case. Now, you might be thinking, aren't objects atomic? pointers, they must be atomic. Well, here's what set name and name methods look like, the non-atomic variants. We copied the assigned name and we released the first one. So, again, looking at the two threads. Thread one fetches the name into old name in the first line. And at the same time, before or after this, it doesn't really matter, thread two fetches name with the intent of returning it.

Thread 1, in the meantime, ends up making a copy of the new name and releases the old name. And notice that what it's releasing is what Thread 2 is about to return. So in Thread 2, the name is returned, the client tries to use it, and they got the crash we just saw. So that was -- you know, that doesn't always crash. Sometimes it doesn't. You know, if you're lucky, if you're a bug like this, it will crash for you during development and not when your users use it. So it's good. Anyway, so this is the reason why object values properties also, when they're non-atomic, they're dangerous.

So atomic properties, the whole notion of atomicity is not really overall thread safety. Just by slapping atomic or not making non-atomic, you're not saying you're thread safe. But it does provide a low level of thread safety for a single property. So it does not provide consistency between properties, and we're gonna talk about this in the next section. It's often a good idea to leave properties atomic. As we introduce properties in Cocoa, our intention is to make them atomic. When would you consider not making properties atomic?

Well, performance-critical usages. It does have overhead. If you ever see objc.getProperty or objc.setProperty in your samples or your SHARC analysis, it's time to see whether atomicity is the reason. Now, of course, if you're calling properties millions of times, you are gonna see these. Just see if the atomic makes a difference, and if it does and it's critical to you, you can consider eliminating it.

Another case where, you know, or a case where it might be reasonable to eliminate atomicity, especially if it's performance critical, is in cases where you have some higher level of synchronization. For instance, collections, NSArray, NSDictionary, et cetera, do not do atomic behaviors. That's because we expect you to have some other external locking if you're adding items to an array and extracting items to an array. It's a mutable collection, and as long as you're mutating it and reading it, you really should have some external synchronization. Another example is NSManagedObjectContext out of CoreData. It, of course, conforms to NSLocking. If you want to use it for multiple threads, it's a good idea to lock and unlock the whole object graph when mutating it, and that will provide that synchronization.

So, speaking of synchronization, earlier I talked about consistency between properties. The problem is to provide consistent view between multiple related properties. Examples are first name, last name. Even if we made two properties, first name and last name atomic, if one thread set first name and set last name and then some other thread read first name and last name, they might get an inconsistent view. They might get somebody else's last name with somebody else's first name. Same with day, month and year.

One solution to this is to use properties that combine related state. You may have a person object that combines first and last name. You would set your first and last name there and assign the person object. Instead of day, month, year, use an NS state. Earlier we used NS rect and that combines four properties, X, Y, width, height. By using an X rect and making the whole rect atomic, we avoid the problem. You can use RGB, I'm sorry, instead of individual RGB, you can use an S color. And of course, note that using these abstractions is better anyway because you're getting a higher level of abstraction. For instance, by using an S color, you're not only using RGB, you can now use CMYK, pattern colors, et cetera. But this is just a solution that sometimes applies. Sometimes you do need locking or some other form of synchronization.

Okay, let's go into a case study. We're gonna look at a linked list. And in case you skipped out on your college classes or you just don't remember, a linked list is a simple data structure like this. You know, there's next and previous pointers. Don't dwell too much on this.

And here's what the API looks like. So let's say the scenario is your boss wants you to create a linked list class. And you work really hard and write the linked list. And here's the API you create. And as you can see, it's simple and efficient. And by efficient, I mean the whole API here-- first node, node after node, insert node, remove node-- these are all off-one behaviors. They're constant time. No matter how big the linked list, these operations will execute very fast. And of course, you have a node object, which just has a value. It carries the value that's inserted at a certain location. Pretty straightforward.

Okay. So, let's say you write this, give it to your boss, and your boss says, great, but I also want indexed access. I would love to have a node index method. Now, of course, you realize this is an O of N method, meaning it will -- the execution time depends on the length of the linked list. Here's what the implementation looks like. You start from the first node and you iterate with that while loop until you get to the node you want. So, know, your heart sinks, you know, it's like this is a crappy method, the rest of the API was so good, what can you do? And let's go take a -- let's go look at this behavior here. A demo machine, please.

Okay, so what we're doing here is we're basically, our analysis in this case is to run through the linked list, starting at some location and through incrementing locations so our threads are gonna be running through the linked list just querying for values. And at every location, we put in an S number with the same value so we can also detect failures if there happen to be any. So let's go ahead and run this. We have our linked list test with 1,000 objects. If we run it, you can see that we're gonna get about a million, 1.1 million updates.

But if we go up to 100,000 objects, you see that the number of updates we can do goes down drastically because it is all of end behavior. So we're only getting 33,000 updates, and that's really bad. You may get fired for this. So if we can switch back to the slides, please. So you get an optimization idea. What if you saved away the most recent access and reuse it if possible? This is a common usage pattern. Often things are accessed sequentially or in clumps. So this might actually help.

So what do I mean here? So here's what node at index looks like. You sort of make some room, and then you go ahead and insert some code. You have this getCachedIndex method, which I'll show you in a second. It gets you the last node in node index. And then if the node is nil or the index is not applicable, you go ahead and fall back to the slow way of doing things, meaning looking from the beginning. Otherwise, you can look from this interim location, which is closer to your goal. And then you run through the while loop again. And if the optimization matches, the while loop will be much faster. And then if you did find a result, before you return it, you'll go ahead and set cached index, meaning you set that as the new result for the next time around.

And here's what setCachedIndex and getCachedIndex look like. You add two new instance variables, cachedIndex and cachedNode, and these methods set and retrieve those values. And we'll see why we just added one method here instead of two distinct properties like we might traditionally do. Okay, so let's see the behavior with this new optimization.

So running the linked list with cache, and I'm just going to run the 100,000 element case. Remember, we're getting 33,000. There you go. Now we're getting 30 million. We speed it up by, what, 1,000x right there. So that's great. And of course, you're thinking, if I run it two threads, it will be even better.

It is running pretty fast, but notice the failure rate. There is a failure rate here. And again, given the topic we've been talking about, it's due to lack of synchronization on those two properties we're setting. So if we go back to the slides, So there's a lack of synchronization. While one thread is setting-- now, by the way, I didn't describe the scenario here, but we mutate the linked list up front, and we don't even mutate it anymore. We're just fetching values out of it. So just the act of fetching the values is, of course, setting the cache index.

And as one thread is setting the cache index and node, and another thread is reading it, they're getting inconsistent results. They happen to catch it halfway. So we need synchronization. Note that having the combined API set in get prepares us for adding that synchronization fairly easily. Let's see how that works. First, we're gonna just do the traditional locking solutions.

One is to use @Synchronized, which is an Objective-C synchronization mechanism. It's fairly simple. You go into your set, you go into your get, and you add this @Synchronized around the critical section. So by doing this, you've guaranteed that the index and node will be set or retrieved together. No other initialization is needed. This is all you need to do. The other solution, another solution, is to use an NSLock.

And again, the code looks very similar. You lock and unlock around your critical section. In this case, you have to declare cache lock as an instance variable. You also have to initialize the lock, and you have to release the lock when you're done with it, when your object is going away. There's a little more initialization here. The third solution is to use a spin lock. We have spin locks at the POSIX layer. It's called OS spin lock. Again, the code looks very similar. You lock and unlock around your critical region. In the case of spin lock, you still have to declare it as an instance variable, and you have to initialize it. but it's not an object, so it actually does not need to be allocated or released. So there's still initialization, but a little less. So if we were to go back to our demo, Let's see what performance we get. First with running with at synchronized.

So we're getting about 6 million of these a second. With NSLock, 10 million a second. 20 million a second. So we've actually almost doubling our rate with each one. So that's pretty good. We saw three locking techniques with drastically different performance. And it's pretty good. 21 million a second is not too bad, or 21 million over three seconds. So if we go back to the slides, please.

So just to recap those three, at synchronized is a recursive and exception-safe mutual exclusion, and it's very convenient, as we saw. NSLock is your basic Cocoa mutual exclusion lock. It's fairly efficient. What I mean by that is when the lock is locked, the thread that's locked is put into a sleeping state. On the other hand, with SpinLock, it is also very efficient and fast, but when it's in a locked state, it's actually looping, and hence why it's called SpinLock.

So although it's the fastest, it's also got sort of the most caveat as well. If I were to do an analogy, Imagine a traffic intersection and there's red lights and a green light and the cars at the red light with OS spin lock, with NS lock, the cars at the red light are just standing there with their engines idling, but the cars with spin lock are actually sitting there with their engines revving as high as possible. And you can imagine that all these cars are just stopped there with their engines running. It's really not good for your CPU, not good for your computer. And so you might be wondering, okay, wise guy, what's the AdSynchronize like in this world of cars? Well, the at synchronizer is not a car you own. It's basically like a taxi or a limo you hire once in a while. It's very convenient, but you pay a lot for it. So there you go. Hopefully you'll notice. So the lesson here is at synchronizer is convenient, and if you must use spin lock, make sure you only use the small bits of code where there's very small likelihood of contention.

Now I'm going to look at two lockless techniques. Because the locking ones most people have heard about, there are also some lockless techniques we can apply. And the two techniques I'm going to talk about are, there's just a truly lockless technique and one which is a caller managed cache.

Now the lockless one is one where we access and update the cache locklessly, but we detect and react to inconsistencies. Now this is a tough one, so I'm going to use some graphics, and you might want to take a sip from your coffee or lemonade or whatever you have. So let's see what happens. So you have two variables, your node and your index that you want to update atomically.

So we put four additional values. Let's put these in the stack. Doesn't matter. These are not IVARs. These are just on the stack. And then we take the high word of your node, and we put it into the high word of that first word we created. Then we take the low word of your node, and we put it into the high word of the second. And then we do the same with your index. We put it into the high words of the remaining two words. Now we have another element, which is a generation count. And this is really a half word size.

This generation count is the only thing that's incremented atomically in this scenario. Whenever a thread wants to use it, it does OS atomic increment 32, and it gets back a unique value. So each thread gets back a unique value for this. And then the gen count is placed into the low word of all the words.

So now we have these words. We copy these into the cache. And by cache here, I mean the instance variable. These four words are now in your instance variables. So these four words are shared by all the threads. So note that when we copied those four words into here, we didn't worry about synchronization, locking, anything. We just copied them in there, which means some other thread might have been copying at the same time, and these values might be all messed up.

However, when time comes to read these values, we copy all to local storage. And then the thing we do is we check to see if the gen counts are the same. If the gen counts are all the same, that means that this value, this cache values, were not corrupted as a result of all that copying and such. So if the values are the same, the cache is good, we can reassemble node and index and profit. That's great. And note that if the gen counts don't match, we just dump the cache, we don't use it, we go back and do the slow way. So the way out here is actually pretty simple.

The other lockless approach we're going to do is caller managed cache. This is one. Actually, this is fairly simple. The cache is owned and provided by the caller. In this case, instead of know that index as our API, we're going to go ahead and use an API like know that index using cache. And note that the using cache is an argument which is a struct pointer. And the caller is expected to provide a cache where they provide us with the results from the previous call. So this is this means that we're just letting the caller manage the cache. Okay. In this case, no locks or synchronization are needed.

So if we can switch to the demo machine, please, and let's see what these last two techniques do. Okay. So remember, with the spin lock, we're getting about 20 million. Our lockless solution, we Okay, that's getting us 25 million. We're doing a little better. And let's go ahead and look at the caller supplied cache.

And this one is actually the big winner. This one is getting us 93 million because it totally avoids all locks and synchronization. There's one more advantage. Now, note that all the other solutions are writing into the same memory locations. This one has much less of that. This one has none of that because all the caches are per caller. So with this solution, if I were to switch to eight cores, instead of 93 million, we're actually going to probably see close to 400 million. So we are actually getting a good increment from the course. So again, eliminating contention, working as locally as possible are the good lessons out of this. So if we can switch back to the slides, there are some caveats, of course. The lockless solutions are hard. That first solution I showed was hard to explain. We needed a lot of graphics. We needed coffee. And the example we used was simple. We didn't even care about consistency because if it's inconsistent, we just didn't touch But if you really cared about those values being consistent, you have to do a lot more work in there. And most lockless approaches require use of what are called memory barriers. We didn't even show that. Typically, when you use locks, you have these things called barriers that guarantee a certain ordering to the reads and writes that the processor is doing. When you don't use locks, you don't have your barriers, and you might have to use these barriers explicitly.

So most lockless approaches use that. If you want to experiment or implement a lockless approach, please be sure to read about memory barriers. There is a function called OS memory barrier on OS 10 that does this. The caller managed cache issue is that it's inconvenient for the caller. We're putting the burden on the caller. And on mutation, all caches need to be invalidated. In the first approach, if there's a mutation, we can just invalidate the cache. But in the second approach, all these callers have their own caches which somehow have to be invalidated and that could be tough.

The cache is also not shared between callers, which could be a good thing and a bad thing. It's actually maybe a good thing, like in our example, because if each caller has their own access pattern, they will benefit from their cache being just their own state. So the last one in this case turns out to be the best, really.

So let's talk about multi-threading alternatives. So far, we've just been using NS threads explicitly, but there are some alternatives. Now, one set of alternatives are just not to use multi-threading at all, like use background processing. For instance, you can use NS notification center with post-man idle, which means in the main event loop, whenever there are no events, you'll get notifications. You can use run loop observers. You can use timers. Perform selector with object after delay is also a favorite. These all allow you to do background processing on the main thread. Of course, these are not true concurrency. You know, your machine might have 64 cores in it, and 63 of them will be sitting idle, so it's not very good. But it is useful nonetheless, and it does give you responsiveness. For instance, the Cocoa Text System uses background processing to lay out text, so when you open a large document, although the document is still being laid out, the user is still able to type and scroll and so on. And it can do this without using any sort of locking, because it knows everything is happening on the main thread.

Now, for true concurrency, without using threads, Another solution is operation and operation queue. We talked about these last year and previous year because we introduced these in Leopard. These allow you to achieve concurrency without explicit use of threads. And the system basically decides how many threads to throw at the problem. And if the system is very busy, you might get less threads, so on.

NSOperation is an object that encapsulates work units. You can specify priorities, dependencies, et cetera. And Operation Queue manages operations. And you can enable specifying the amount of desired concurrency. By default, you would not do this. And NSOperation Queue will use as many threads to execute as many simultaneous operations as it decides is appropriate. But if you do specify desired concurrency, like don't run more than four at once, then it will use that as a limit. Now let's look at a special case of this.

which is using operation queues for synchronization. And by this, I mean serial queues. And what's a serial queue? That's a queue which has a max concurrent operation count of one, meaning it will only run one operation at any given time. And at first glance, such a thing does not look very useful.

So what you might use this for is something like this, where you have a model. It's either a model object or it's an object graph. And then you slap on it an operation queue. And then basically all accesses to this model go through this operation queue. So all the changes you want to do are basically operations in this model. And if this operation queue has a max concurrent operation count of one, this means that the model does not need any locking or synchronization, because it's only getting one request at any given time.

For instance, one of these operations might say, set the first name and last name, and that'll happen automically, and the model doesn't have to lock itself at all. Now, this looks interesting, but you're thinking, you know, but we've eliminated concurrency here. We're not using multiple cores. Well, one thing that happens here is that all the operations will be performed one after another. They may still be performed on different threads, so that's important.

That by itself is not a big win, but if you have many such models in your application, and they all have their own queues, suddenly you introduce concurrency in your application by having every model be processed on a separate thread. So this is actually a pretty powerful paradigm. It's not one we use in Cocoa very much yet. If you go to session 419, which is Doug Davidson's talk at 3:00 in Russian Hill, he will actually show using this technique to optimize a program that he will be demonstrating.

Okay, now we're going to switch gears a bit and talk about computational complexity. And here, what I'm referring to is the computational complexity of some foundation objects. It's good to know performance characteristics of your APIs. For instance, the linked list class from earlier, you know, it had the original API was all O of 1. Everything was very fast. And then you were forced to go add the slow method, which is O of N. But then, you know, adding that cache provided O of 1 behavior for some usage scenarios. Namely, the usage scenario of incrementing through the items perhaps or accessing items in bulk.

So before we talk about computational complexity of foundation objects, let's talk about primitive APIs. What we refer to as primitive APIs are the funnel points. These are usually the minimal APIs to implement a new subclass. And that's not something we do for all Cocoa objects, but the most widely used ones, such as the collection, string, data, et cetera, all-- this applies to those objects. So these primitive APIs, the interesting thing about them is they indicate to you the overall computational complexity of that object and what the expected behaviors are.

These are often listed in the base class. So if you were to open NSDictionary.h and look at it carefully, you'll see that there is an interface declaration for NSDictionary, which only has three methods. So this looks like NSDictionary is a class with just three methods. However, if you scroll down further, you'll see the rest of the methods declared in categories.

The interesting thing here is that the first three methods are the ones you'd have to override in a subclass if you want to create your own subclass of NSDictionary. And all these other methods are implemented in terms of of these first three. And that's how it will just work.

And these first three also give you a hint of what the performance characteristics is. So, count and object for key and key numerator. Now, if you look at mutable dictionary, you'll see that there are only two primitive methods, remove object for key and set object for key. Pretty straightforward.

And if you scroll further down, you'll see many other methods that are extension methods, meaning implemented in terms of the first two. So, talking about the performance characteristics of an S-dictionary, Those methods we saw, object for key, set object for key, remove object for key, these are all constant time and fast. And by constant time here, I'm not saying O of 1 because big O purists will get mad at this, but what I'm saying is it's fast. NSDictionary is a hashing collection, and when the hash function is good, it's really a very fast collection, of course. If you have an NSDictionary with a bad hash value, the objects have bad hash values, The performance really takes a nosedive. The access becomes OFN.

and edits our O of N squared. And what are bad hash functions? Well, a constant value No matter how interesting of a value, a constant value is a bad idea. A random value is also a bad idea, because your NSDictionary will take the object, put it somewhere, and when you ask for it, it won't be there anymore. So a random value is, actually it will be there, but the dictionary will look somewhere else. So that's not a good idea either. And here's one that looks better, but it's also bad. Because we're taking a self name, presumably the object has some name, which is perhaps an NSString. we're somehow, for some reason, extracting just one character out of it, and then we're multiplying by 10,000 for some reason. And again, we could have used the whole name as the hash key or something based on that, but instead we decide to just go use one character out of it and sort of multiply it by 10,000, which gives us a limited set of values. That's not good either. So if you do implement your own hash values, be sure to return dispersed hash values. Use as much of your state as possible. And one other thing to remember is that if two objects are isEqual, their hash must also be equal. So a piece of advice here is if you ever implement isEqual on any object, go ahead and implement hash as well.

Now, NSRA is computational complexity, and NSRA is basically like a C-array. Therefore, access is O, appending and removing at end are O, replacing an object, taking an object out and putting another one in its place is O, but inserting or deleting in the middle where some objects move out of there and the rest have to slide is O. That's just like a C-array.

At very large sizes, NSArray switches to a tree implementation. And the reason for this is we find that very, very large, big monolithic arrays are quite unwieldy when you start having to reallocate tens and hundreds of megabytes. So we switch to a tree implementation. And this is really truly for very large sizes. At that point, all operations become O log N, which is what a balanced tree will give you. However, using the trick we learned from linked list, we have a last access cache. And so for iterative, you know, sequential access and edits, operations are still constant time on an SRA.

NSString is much like an NSArray. Axis is O, appending and removing are O, but edits in the middle are O. Again, very similar. However, The string you get back from the Cocoa text system, as you might know, the Cocoa text system uses an NSString as its backing store. It does use a tree implementation, again, for the same reasons we do an array, but in this case, it uses it all the time. And again, there is a last access cached. And this actually fits very well with the way a text system is used. Typically, users tend to click somewhere, they type, type, type, type, then they click somewhere else, they type, type, type. So, the first click might be off login, but the rest of the accesses are fast because they're all very localized and take advantage of that cache.

Now let's talk a bit about bulk operations. One way to speed up accesses to these objects is to use bulk operations. Now note that this doesn't fundamentally change the performance behavior. If it's O of N, it's still O of N. But bulk operations reduce the cost of access. It's like going to Costco and getting a huge bag of rice rather than getting teeny little bags of rice at your grocery.

So one way to do this is to call the highest level API. For instance, in NSString, do not call character at index repeatedly to do something. Call method like getCharactersRange, which allows you to return a number of characters with one call. Or you can do the higher level things. These are even better.

HasPrefix, or range of string to see if a string occurs anywhere in the other string, or the freshly added enumerate lines using block. With one line, you're doing this high level call. You don't have to worry about characters, line endings, et cetera. And with array, instead of object at index, go ahead and call these high level methods, array by adding objects, objects at indexes, so on. And in Leopard, of course, we added the new 4-in fast enumeration, which is very fast, very efficient, and it's a great way to enumerate through arrays and some other collections.

While we're on the topic of collections, we're going to talk about concurrency. And you've heard in other talks that we've introduced block-based APIs for enumeration, searching and sorting. One thing these APIs allow you to do is specify a concurrent flag. When you do that, the operation can be done concurrently, meaning it can be done on multiple processors. However, the overall call is still synchronous.

So I call sort, the sort happens, and by the time it returns, the sort is done. But while it's happening, an SRA might choose to use more than one core to do the job. Of course, what this requires is that the block you're providing, like the compare function or whatever you're doing, is thread safe, and that's really the only requirement. So here's a sorting API, just to take one example. Sorted array with options using comparator. One of the options you can pass is that highlighted sort concurrent flag. So this is how you use it.

And let me show you what gains we're seeing here. Now, one warning, our implementation is pretty preliminary, it's very fresh, and the measurements are also just very quick measurements. But I just want to give you an idea of what sort of winds we see. So in the bottom, you see the number of cores, and along the side, we're seeing speed up.

4,000 objects, an array that has 4,000 objects, you'll note that with two cores, we're actually seeing 2X this performance. So that's pretty good. But with eight cores, we're seeing only 2.4X the performance. And that's because with eight cores, the overhead of divvying up the thing into eight cores sort of overwhelms the wins we see.

But with 128,000 objects, you see that the curve is a little better. At 4 processors, we're seeing 3x. And with 8 processors, we're actually seeing 4.x speed up. And with 4 million objects, the curve is slightly lower. However, we are still seeing a 3.5x win for 8 processors. Again, very preliminary numbers. We're going to be fine-tuning these. And NSRA will actually not do the job concurrently if it thinks it's not going to be a win.

However, this is a great way to sort of get up to four or five, maybe even more speed up pretty easily, assuming your sort function is concurrent. parent. Okay, switching gears again, we're now going to talk about memory. And you can have whole WWDC sessions about memory. In fact, in the past, we have given whole sessions about memory. And often the thing we say, the conventional wisdom, and that's still very applicable, is don't use too much memory. Memory is really the number one performance problem in most applications. And that is true. However, we have this new thing called caching, which you've heard about in other sessions.

So the thing is, if the user has a lot of memory, why not use it? Just give it back quickly if the system needs it. So that's sort of another approach to using memory. And you know, most apps already do this. You know, it's one observation is that if you run an app for a long time and then you go look at the usage, memory usage of the application, you might see that, you know, maybe the app is using, say, 10 megabytes memory. You can You quit the app, restart it, get it back to the same state, and if you look at the amount of memory the machine is using, you might often see that it's only using, say, two, three megabytes of memory. So, clearly, the machine -- the app can be in the same state. However, in one case, it had accumulated a lot of stuff throughout the runtime, and if you restart it, it somehow gives that back up. So many apps already do this caching as a way to speed up their operations, but one thing they don't know is when to give up these caches. So caching is good because clearly if the user is using your application and the machine has four, eight gigabytes of memory and the user is not doing anything else, your app should be able to use a little extra memory as a way to get some speed gain. However, the trick is to give it back to the system if the user switches to another app and starts doing other things. So to that end, we've added this new class NSCache.

It's key value-based access to objects, sort of like a NEST dictionary. It has a strong reference to the objects. And what happens is if there is memory pressure or some other settings match up, the cache will evict the objects, which means it will release them, and they are then evicted from the cache. And let me just show you the API now.

There's an object for key. So the objects are fetched using keys. You can set an object for key, of course, just like a dictionary. And you can set an object for key using a cost. And this is the cost of existence. So higher the cost, higher the object will be evicted from the cache. If you use the first method, the cost is assumed to be zero. And this cost is, you know, you choose your units, whatever they might be. And there's, of course, ways to remove it. And we have two other sets of methods, two other properties, cost limit and count limit. By setting these, you're setting approximate limits that you'd like the cache to follow. Now, these are very approximate. Just because you set a cost limit of 50, it doesn't mean the cache will wait until 50 before it starts evicting things. Another thing it will mean -- another thing it doesn't mean is the cache will not always maintain a cost of 50. For instance, there might be 50 objects in there and you go back, you go away, come back, it might turn out the objects have been evicted in the meantime. So this is not setting a limit, an exact limit, this is just setting a boundary and it's pretty rough. So let's just see a quick demo of this. We can switch back to demo machine.

Okay, I have a simple application here which shows images and it fetches the images from a folder, a file stem location. And this method here is the one, let's just take a look at it. This method returns the image associated with every file. And this image is fetched from a cache, an NS cache image path. If it is found, it's returned, great. If not, we'll We go ahead and create the NS image and we insert it into the cache. Now the fact that we are going into our cache to fetch it every time, of course informs the cache that there was interest in this image. And it allows the cache to update its most recently used statistics and so on. So anyway, that gives the idea of the cache, which images are more popular than the others. So let's go ahead and run this application.

So one thing you'll notice here is the background is turning red. Not very good UI, but it's good for a demo here. The idea is that as these images are fetched, if they're not in the cache, we highlight the background to show that there was a cache miss. So if I scroll, every time a new image is revealed, the background flashes red. So you can see that it's also a bit pokey here because it's loading these images. But if I were to scroll up, you can see it's much faster until we get to the top because those images at the top have been evicted from the cache. Again, go all the way to the bottom, we're seeing a lot of red, and then red, red, red, but if we were to go back to the region we're in, again, turns out these were evicted, but those aren't, and so on. So it's all automatically being handled. I think in this demo we're setting a cache count limit of 40, and if you don't set the limit then the system will decide it. And also note that even if all these images might be cached, and as I said, you go away and come back an hour later, they might all be evicted in the meantime since the system was using memory in other ways. Okay, if we can switch back to slides.

Another memory topic is this, another thing we've added is this new protocol called NSDiscardableContent. This is a new protocol, and it allows arbitrary objects to automatically free their memory when not at use. When an object responses protocol, to access its contents, whatever it may be-- it'll be different things for different objects-- you call this method beginContentAccess. And beginContentAccess pins the contents down, and it's sort of like a count. You're using it.

If this returns no, that means the contents had gone away and it could not be pinned down. So the return value is important. And when you're done using that memory, you just call end content access. And these are used in a paired fashion and there might, of course, be multiple of these in existence, like retain release. There might be multiple people interested in the contents of a given object.

If an object's contents are not being accessed, it's free to go ahead and free it at any time. We add a new class called NSPurgableData, and this is a subclass, a concrete subclass of mutable data, and it implements the discardable content protocol. Now, the interesting thing about NS-Purgable data is that it uses, it manages this thing called purgeable memory. Purgeable memory is a special variant of malloc memory which the kernel can simply take back if it's not in use. And the kernel is very sneaky about it. It'll just take it back. It won't even tell you. You won't hear about it. So it's like, you know, a good pickpocket trick, very stealthy. You wake up in the morning, oh, where are my keys?

They're gone. Well, the kernel took it. And this is, of course, very nice because the last thing you need when your system is running low on memory is for all the processes to be woken up to be told, "Free all your memory," and all those processes will be stepping all over each other trying to free their memory. You just want to take the memory away from them without even telling them and hope that when they wake up, they notice the memory is gone. That's their problem. But anyway, so that's what purgeable data does. It's basically to access the contents you would use the protocol I showed you. In fact, let me just show you a usage pattern.

Now one interesting thing about purgeable data is that it starts off with the content access enabled. So when you first create it, it's like as if you've called begin content access. So here might be a -- here's a common usage or, you know, maybe not common, but we'll see. But anyway, one usage pattern. You have your data. It starts off nil.

Let's say you haven't allocated it, my data is nil, or else you have allocated it previously, but you want to access its contents, so you call begin content access, and it returns no, meaning it's been purged, the contents have been discarded. In this case, I go ahead and free the data, if I had it, and I go ahead and recreate it. And let's say we created it from a file with contents of URL. Now, if it happens to fail, something really went wrong, the data is no longer around on the file system, so you return or whatever you do. But if this succeeds, then that means you now have the data and its contents are accessed. Now, in the case of NSData, of course, the content is the bytes. You go ahead and use the bytes to your heart's content, whatever you want to do, and when you're done, you call end content access.

Now, next time you come back in this loop, you already have my data, so you call begin content access, and if it returns yes, you're in good shape, you just go use it. If it returns no, again, you discard the data and you create it from scratch. NS purgeable data is your primitive object, primitive Cocoa-level object for implementing the discardable content protocol if you wish. Of course, you can do it any way you want.

So let's talk about the connection between cache and this discarding stuff. And this cache is aware of this protocol and its discardable content. If you enable this property, evicts objects with discarded content, if NSCache notices that an object inside it has discarded content, it will evict it from the cache and return nil to you when you ask for the object. So what this gives you is typically when you're using discardable content protocol, the object's contents go away, but the object is still alive. By putting such objects in the cache, you're allowing the objects to go away when their their contents go away. So it sort of automates the process even a little further.

Our last major topic is drawing. And we have talked about drawing before. It's a fascinating topic. It's one of the reasons why -- it's another major reason why programs tend to be slow. And we did say we're going to talk about a lot of new stuff, but I just can't help talk about an old stuff. This is from 2001. It's a demo we showed back in 2001 at WWDC. How many of you are here? Oh, that's about 10%. Great. Well, it's good to see you back. Thank you.

And all the new people, welcome. Okay, so let's go ahead and run Worm. So this is, yeah, this is a fairly old app and has some good lessons about drawing and we're gonna show a few more lessons about drawing in conjunction. So let me go run this. By the way, we ship this as an example on your systems. We've been shipping it for seven years. It hasn't really changed very much. The worm game is this game where you're a worm. You go around trying to eat this dot. And every time you eat it, you get bigger. And the bigger you are, the harder it is for you to move around. You might have seen this in movies such as "Tron" or the Snake app on Unix. Anyway, it's pretty lame, really. So one thing, though, we do use this for is to show animation speeds. And Worm has a bunch of modes here to show you different animation techniques. And it's really showing drawing, not animation, because really for true, fast animation these days we have core animation and so on. But let's go ahead and run the slowest mode. And when you run it, you'll notice that it's going ahead at a zippy 60 frames a second.

And back in the day, it was running, I think, 12, 15 frames a second. So we're already seeing a big win here. This app is doing great. So let's go see what good does. It's also going along at 60 frames a second. Better. That's also going along at 60 frames a second. And it begins to dawn on you there's something fishy here. And many of you will know that this is because of what we call beam synchronization. And beam synchronization is the system's attempt to make sure people don't draw faster than what's needed and so that apps do not hog the display, the screen, basically, which is after all a shared resource. Now we can disable beam synchronization. This is a good thing to do just for testing debugging purposes. And when we do that, and if we go switch okay and run it, we see that Worm is actually capable of running at 490 frames a second. So, yeah, try playing that. I'm just going to enable the good mode. And in the good mode, we're We're seeing 500 frames a second, the better mode. We're seeing 12, 1,000, 14, it's up there. And let's look at the best mode. And the best mode is giving us 13, 11, etc. It's actually very similar to the better mode. Now I won't show you the new mode right now, but let's talk about what we just saw. Go back to the slides, please.

So what are the drawing tips from Wurm? What do those settings do? Well, the first setting is our classic. We say it once at every WWDC. Override is opaque to return yes. If your view draws everything in its bounds, overriding is opaque assures that views behind it are not drawn. So this is a big win. This is a good thing to do. The better mode basically gets us to redraw as little as possible, and we'll see this in action soon.

But instead of using set needs display, which dirties your whole REC, just compute the smaller REC that needs to be displayed and dirty that. So you call set needs display in REC and in your draw REC pay attention to the REC argument you have been given. This is again a classic and it again helps. As you saw, this is what get us more than 1,000 frames a second. The best mode is not really applicable these days. It used to be that instead of using string drawing, methods such as draw string at point, et cetera, and a string draw at point, draw in rect, et cetera, if you use the text system, the Cocoa text system directly and cache away the components, you could get huge wins. These days string drawing has been speeded up so greatly that you often do not have to do this. So you use a text system when you need a text system, but otherwise, you know, just use string drawing and it will do the job most of the time. So it's not something you have to go the extra mile for these days. But again, if your string drawing seems a bit slow, this is something to consider. Now, one other tip that we saw is that the system limits you to 60 frames a second. You should not ask to be drawn at higher than 60 frames a second. In fact, if you do, your program might actually stall waiting for the display to become available. So it might actually even have an adverse effect on your program's execution. Now, let's switch back to the demo.

And I want to show you the last mode. And the last mode is running on this machine. It's again running 13, consistently giving us 13. It's slightly faster than some of the previous modes. It's actually consistently faster while the other one was jumping around 1,000, 1,200, so on. So, okay, let me explain what this one is. Let me actually show you what the okay mode does. This app, Quartz Debug, with which we turned off BeamSync, also has flash screen updates, which shows every part of the screen that's been updated by drawing operations. So in the OK mode, you'll see that Worm is drawing the whole view, that yellow is a flash every time we update that view. In the better mode, you see that we actually take time to compute one rectangle. And in this new mode we added, we actually update just the rectangles that are updating.

So if I start playing here, you can see that it's really doing a very nice job. Ah! Anyway, so you can see that it's doing a very nice job of computing the update area. OK. And if you switch back to the slides. So to minimize drawing even further, in drawRect, instead of using the single union rect you were given, call these methods, getRectsBeingDrawn, count.

It returns you the currently list of rects that are dirty, or you can, if you have a rect of yourself, your own rect, you can call needsToDrawRect and see if it's in the dirty region. And this gives you this nice behavior of, instead of that, it gives you that little region. So in this case, it helps. In other case, it might help even more.

Now, more drawing tips that are good to know. Flashing screen updates, you just saw that. That's a very good way to see if you're overdrawing. You can optimize live resize speeds. If your drawing speed seems fine, but when the window is resized, it's a bit slow, you can check in live resize and do some rougher, quicker drawing as a way to make live resize faster.

You should eliminate usages of NSCachedImageRep. This is now a mostly deprecated class. One thing we're doing is we're eliminating window backing stores in NSImages. And that is something that's happening automatically. But if you use NSCachedImageRep, for compatibility, we usually have no choice but to use window backing stores. So it's a good idea to just get off this and get off window backing stores, which gives NSImage huge performance boost in many cases.

You should hang on to NS images, Bezier paths, colors, and gradients whenever possible. As you reuse these objects, we can maybe cache some state down in the graphics subsystem, and you might get some benefits that you wouldn't see if you were recreating these every time. Of course, you might be asking, well, if I hang on to these, am I not hanging on to all this memory? Remember, NS cache is your friend. You can just throw them into an NS cache, and they'll go away automatically if needed. One other thing to consider is consider enabling QuartzGL. QuartzGL basically can be enabled on a whole app basis by putting in your property list, or it can be enabled on a window-by-window basis by calling send window backing memory or something, and you can put your window backing store in video memory. If you do this, you basically get many of the benefits you would if you're using OpenGL to do your drawing, OpenGL to do your 2D drawing. So, basically, much of your quartz drawing and some of your core image operations, et cetera, will be speeded up. But this also increases your memory usage. So, this is something you might want to consider if your window does a lot of animations or heavy-duty drawing to see if it helps. Thank you.

Now, there are other facilities available for drawing, of course. Quartz is what Cocoa uses as the graphic substrate. It's used heavily by Cocoa objects, but you can use it yourselves directly if you need to. Core animation is also indirectly used in NSView's animator, but you can use standalone CA layer trees, assign them to views. So inside a view, you can have many, many CA layers if you wish and manipulate them directly. Core image gives you fast image processing, filtering, so on. Quartz Composer is great for creating animations. QtKit, of course, is Cocoa-level access to QuickTime, media files, et cetera. And OpenGL, of course, gives you very fast 3D engine. And note that most of these also do their job in separate threads. So they introduce behind-the-scenes concurrency into your app in various cases. That's another benefit of using some of these. Of course, don't go in and use them as your first thing, but when the situation is right, when you need that extra boost or when you need that subsystem, these are very appropriate. Thank you. One other thing we added in Snow Leopard is this concurrency in drawing. You heard about this perhaps earlier or perhaps yesterday. We now allow views to be drawn concurrently just by setting this property, can draw concurrently.

Now, this doesn't mean the view will be drawn concurrently all the time. It gives AppKit permission to do so. And if the situation is right, the AppKit will draw the view concurrently along with its siblings. What happens is the display mechanism kicks in, and as your window is being redrawn, if NSView notices that a certain number of children have this flag, those children can be drawn in separate cores. But, however, by the time the display method returns, all drawing has been done. So this is not equivalent to views being drawn randomly here and there in separate threads, totally concurrent with your app. This is still being done in a very controlled fashion. Let's see a quick demo of this.

Ah! Let's go back to normal state here. Okay. So again, we just call setCandDraw concurrently yes as a part of this demo. Let me run it. This is a simple little demo which has three table views with a bunch of data. As you can see, it's already fast enough, nothing to worry about. We're going to give it a big kick by slowing it down. Every fetch of data source is going to introduce a.01 second delay. This is going to give us slightly a behavior here. Let's go ahead and bring up this frame meter that Quartz debug has. Let's put it over here. And let's go ahead and resize our window in an automated fashion. So as you can see, the window is resizing. It's not great. It's getting 20, 22, 24 frames a second. But if we enable concurrent drawing, you can see that it's a bit smoother and we're going up to about 40, 42 frames a second. Let's go ahead and do that again.

44, 45. So, again, concurrent drawing really helps in this case. We've enabled concurrent drawing for this view, this view, this view, basically those three views, and they're all table views. And concurrent drawing is really good for these cases where we have big sibling views. You know, mail is this way, Xcode is this way, iCal is this way. And many applications tend to have these big views that are siblings so they can benefit from this behavior. And Since these views aren't being drawn haphazardly in concurrent, you know, in background threads, the threading requirements of your model remain pretty much the same. It's just that instead of your model being called from the main thread, it is going to be called from a background thread, but it's not going to be called while, for instance, the user is doing some action, because it's only being called in this limited window of a draw, the whole display machinery kicking in. So this might be fairly straightforward to turn on in most situations. So, you should go ahead and experiment to see if it's a win. And, of course, you can even turn it on in your leopard or tiger-based app just by checking to see if this method exists and calling it because, you know, it's a dynamic runtime. You can do these things dynamically. If we can switch back to the slides, please.

Okay, so I just want to now enumerate a bunch of other topics because this is a Cocoa Performance talk, I do want to make sure, you know, this is the first Cocoa Performance talk you've come to, I want to make sure you realize some of the topics that you should really be aware of. Responsiveness. This includes topics like app launch, how fast the app resizes, you know, scrolls, response to user events. This is -- there are many things you can do here. There are many techniques for speeding up app launch and reacting quickly. Be sure to pay attention to these. Fast shutdown is a part of this.

Mark Petrelli will talk about this in session 425 this afternoon at 5 o'clock in Russian Hill. This is basically how to get your app to quit as fast as possible so the logout or shutdown experience can be quick. Memory usage is a huge topic. Auto-release and how much memory do objects use, et cetera.

There's a lot written about this. especially with garbage collection versus explicit memory management, things to be aware of. 64-bit, you've seen good CPU benefits to 64-bit. Turns out, of course, 64-bit does have a memory impact. Pointers are bigger. The integers that we use in Cocoa are bigger. So you need to be aware of tradeoffs. Avoiding impedance mismatches. Impedance mismatch is a term we use whenever you're calling through APIs and you have to convert one data type to another. Like let's say you have an NSString that you use for a file reference. you have to convert to an FSREF and so on.

And every time you do this, of course, it's sort of wasted energy. And if you're doing this a lot, it does show up in program samples. It's good to avoid impedance mismatches by keeping data in your program that's, you know, of consistent types. Now, don't convert from char stars to NSStrings or RGBs to NS colors back and forth all the time. Try to stick to high-level abstractions whenever possible. And finally, of course, you know, I didn't show you many of them except Quartz Debug. debug, we have a wonderful suite of performance tools. There are other sessions about them. Instruments, Shark, Maloc debug, Quartz debug and a bunch of others. Be sure to use these and check for hot spots in your applications.

Our documentation, there's a whole subsection dedicated to performance. You should go ahead and look at that. And also last year's session, session 141, boosting responsiveness and performance in your Cocoa application, talks about a good number of the responsiveness issues. You would find this in, I believe, iTunes, ADC, iTunes. There's a number of other sessions you can get to from last year's talk. And you should go try to grab those before they might go away. I don't know if they will, but, you know, sometimes they do. So, in a summary, it's multi-core world. Multi-threading is hard to get right, to get fast, but if you do the right things, avoid contention, you know, there are a lot of benefits. You might want to consider using NSOperation, Operation Queue, and, of course, the GCD stuff, which has been added in Snow Leopard, is also very powerful. We are looking at integrating that under Operation Queue and under Operation as well. But using these facilities instead of threads gives you more of an abstraction, more leverage.

Know the performance characteristics of APIs, and, again, you know, use memory. if it's going to help you, but know when to let go. And as far as drawing goes, if there's one thing we can say, that's to minimize it, you know, just minimize drawing and be aware of what you're drawing. We have two sessions left today for Cocoa. And both of these sessions have performance topics. They're both in Russian Hill, 419 and 425. They're definitely worth seeing.