Mac • 1:04:51
Concurrency is a powerful but complex opportunity for modern applications. Explore challenges and solutions for designing multi-threaded applications. Examine strategies for working with notifications between threads, processing data asynchronously, and maximizing your application's efficiency with the file system and network.
Speaker: Ben Trumbull
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
Good afternoon everyone. I'm Ben Trumbull I'm an Engineering Manager in Core Data and Developer Tools and I'm going to be presenting about designing Cocoa applications for concurrency. So we're going to talk a little bit about the different design challenges faced with writing multi-threaded code and in particular we're going to put up with a lot of this complexity because we're interested in better performance.
So we want more responsiveness from our applications. We want more raw speed and we also want to use the resource of the computer all these extra cores more efficiently and we're going to take that time saved and we're going to spend it on new features of a better user experience, unless you just like making things faster.
And one of the particular challenges is faced is how big to split up your application into the separate pieces. If the pieces are too big you're not going to get very much concurrency. Obviously if you only one piece you're not going to get any. At the same point in time if you flip over to the other extreme and the pieces are too small then you're going to have more overhead and more complexity.
So in particular the focus of this session is on finding the right granularity for the tasks that you're going to be offloading to other threads or other queues and so in this case we're going to be a little bit like Goldilocks, not too big, not too small. And the perfect task granularity varies for everybody but it's going to be limited by the amount of overhead each of your individual tasks incur and this is particularly due to the coordination mechanisms that you're going to use to manage these tasks and the amount of resource contention these tasks start incurring as they operate. So in this session we're going to overview a cast of characters so define some terms. We're going to review some of the coordination mechanism that are typically used in most programs.
Examine a variety of resource contention issues where this is really going to impact how fast your tasks can operate independently and then sort of sum that together in looking at how to balance a task granularity. And then we're going to explore two particular issues that impact a lot of Cocoa applications as multi-thread notifications and general DUI layer and these are issues that tend to trip up a lot of Cocoa developers.
So for our cast of characters there's some basic introductions and the terms that I'm going to be using throughout the presentation I assume you're probably familiar with most of them but a little quick over view on tasks, threads, the difference between dispatch queues and NSOperation queues, locks, atomic operations and this thing that pops up every once and now again called futures. So, tasks in computer science mean a lot of different things and for the purposes of this presentation I'm just going to refer to a unit of work and there's a distinct grouping of code.
It might be executed either by a thread or a dispatch queue; this would be for instance a block on a dispatch queue or an NSOperation object within an NS operation queue. And for the thread this is going to be the main function typically. So we're not going to refer to something like a process or an NSTask. Threads: I assume you all know what threads are but they're something of the protagonist or perhaps the antagonist of our presentation today.
So this is your thread of execution, an independent flow control and its key distinguishing feature is that it has a shared address space with all the other threads and this is what makes them different from processes and this is a really powerful feature of threads but it's also their Achilles heel and we're going to spend most of the presentation resisting the urge to leverage this feature too much. So dispatch queues: This is Grand Central Dispatch and this is new in Snow Leopard. They are easier and more efficient than threads. It's basically a way of coordinating different work units.
There are three separate presentations on dispatch queues this week so I'm not going to go in too heavily on what exactly they are other than they're a way of managing and enqueuing different tasks and in addition Grand Central Dispatch has a library of common operations that you would have to write your own code for if you're working directly with threads so they're really very convenient. Operation Queues: This is NSOperationQueue.
It's a Cocoa API for work queues so you're basically enqueuing tasks similarly to Grand Central Dispatch and on Snow Leopard it's built on top of Grand Central Dispatch. One thing to note is this is available actually on 10.5 and also on iPhone OS without Grand Central Dispatch and it builds some additional functionalities and dependancy analysis and some conveniences that the Cocoa layer on top of Grand Central Dispatch on Snow Leopard.
So locks are one of our primary mechanisms for coordinating between tasks. They're a simple way of excluding resources from being over used so they typically prevent just one thread at a time and there's going to be a wide range available for you to use. Atomic operations are basically hardware supported primitives. They're typically basic arithmetic or bit shifting instructions, plus compare and swap is another very common one, and one way to think about it is if the destination address of your complication basically had a teeny tiny encapsulated lock in it and that's kind of how they behave.
So when you add something together atomically the address of that summation is going to act like its own lock. And then futures are place holders that get passed around for an unknown value. They're typically returned by asynchronous APIs and they're used to improve the responsiveness of these APIs. They're a little bit like a coat check ticket. You don't necessarily need something back right away but you do need a handle back to it.
So there won't be any blocking until you absolutely need them. So that brings us to the first section in which we're going to use these terms and this is the coordination mechanisms that are going to help us coordinate between all the tasks we create within our application and they're also going to represent some of the cost of having separate tasks. And the reason why we have all of these different coordination mechanisms is to address some problems we're experiencing with threads.
First of all as you're no doubt aware of threads are very non-deterministic. The execution between threads is interleafed pretty chaotically and sharing resources and data between threads without extra steps is going to have a lot of side effects in your application. So this is just with two threads and this is just with 2 to 32 instructions and you can see that the number of different interleavings between threads is growing exponentially. I don't know about you but most of my threads do more than 32 instructions and already this has gotten out of hand.
So our goals of coordination are going to be to prune the complexity back from our application to reduce the amount of non-determinism that using all these threads induces and to safely share resources between threads. That brings us to locking which is going to be the simplest and most frequent form of coordinating between tasks. For preventing simultaneous access either to a resource or to a second of code and it's it's really the most prevalent, it's very easy to use.
Unfortunately, it does have the disadvantage that we have a lot of locks together you find yourself prone to deadlocking. Some of the common locks that you'll find on Mac OS X are the @synchronized keyword and the objective sealing with itself, NSLock in Cocoa, OSSpinlock in the kernel offered by OS Atomic and the ptthread library which has a reader writer lock and a bunch of other things. Now from a design perspective most of the locks are going to behave fairly similarly in terms of a strategic level. Tactically they have very different performance characteristics and some of them have different features.
Obviously ReaderWriterLock is doing something a little bit different than a spinlock but for the most part you can learn about these things in the Mac OS X Threading Programming Guide which will go into a lot of detail about the different locks you have available to you on Mac OS X. So I mentioned that having lots of independent locks is prone to deadlock and so what do I mean by an independent lock?
And this is when you have locks that you can acquire and release in any order and you can call out to other pieces of code while you hold these locks. So these locks don't really have any dependencies between each other, they stand apart and a thread can accumulate multiple of these locks. So this thread for instance can have both Lock 1 and Lock 2 at the same time and can be used in code that depends of those locks and it can release them later whenever it feels like.
This is contrast to an encapsulated lock where an encapsulated lock is something that's hiding behind API and it's going to be released when that API returns. So it's released at the end of the scope and one of the key features is there's no call out to arbitrary code. So this makes it impossible for a thread to hold onto multiple different encapsulated locks at the same time and as you might imagine this is going to make it much easier to avoid problems like deadlock.
Specifically, if you end up being able to take different locks and then calling out to arbitrary code which might need more locks you find yourself in the situation where multiple threads might have say half the locks they need to get their job done. So this is a fairly text book deadlock where 1 thread has the first lock and it needs two locks and the other thread is holding on to that second lock.
As an aside, for extra style points it is possible to deadlock yourself on a single thread and this is a common problem with spinlocks because they're non-recursive if you hold a spinlock and then you call out to arbitrary code. If that code just starts calling back into you, which is fairly common in object oriented programming, you have managed to deadlock yourself.
So the most frequent way managing all these independent locks is something called lock ordering. If the problem with deadlocking is we're requiring multiple locks in an inconsistent order and we need to define an order and sort of build a protocol in which sort of on the honor system if you will.
We're going to acquire the locks in a fixed order and then we release them in the opposite order and in this way these two threads now, are both using the two locks but actually cannot deadlock each other. So this works and we can build up more locks but there's some obvious limitations to it.
You have to use this locking protocol all the time consistently and it can be a little fragile when you're maintaining large code bases; your adding new features. For example one thread might think it only needs the second lock. So it doesn't necessarily grab the first lock all the time and then later on you add more code and that piece of code is calling out to your new feature which it does in fact use the first lock and so you you've broken the protocol.
So in this way you probably want to avoid having more than two independent locks participating in any kind of lock ordering scheme just to keep in mind some of the limits as you develop your application, add new features and what not. As an aside, the reason why it's two as opposed to three is because three is too many.
Yes and so some some years ago I worked on a different project and I was tasked with adding multi-threading support to an existing frame work which is also something you probably shouldn't do because there there's a bunch of API already in place and you have an architecture that's already in place. So it can be very difficult to design properly multi-threading in that environment.
But one of the key issues was that there were three separate resources that all needed independent locks and because they were different layers in this API it was very difficult for these locks to participate in any kind of fixed ordering in particular different layers didn't necessarily even know what kind of locks the other layers had. So it it got to be very difficult and I spent many months, six months or something, this was a long time ago so my memory is a little hazy. But I basically spent an entire release cycle debugging this this one set of issues, these multi-threading issues.
So three is definitely too many locks. And here's sort of a graphic that I'd wished when I was younger that I had sat down to think about and here we go from four different permutations of locking with two locks and two threads to eighteen permutation of how all of these locks might be acquired with three locks and three threads.
So you can also see again and this is sort of reoccurring theme with multi-threading that the complexity of your application is going to grow exponentially if you don't take steps to bring it back. So the three locks have eighteen permutations and just for fun about a third of them are going to work correctly or well, correctly may not be the right, they will function. So again this is going to make it more challenging to debug multi-threaded code when you've over extended yourself.
Alright so there are going to be a bunch of different locks available to you. The Mac OS X Threading Programming Guide has a wide list of them. In general you're going to want to prefer encapsulated locks where a lock is hidden behind an API boundary and it doesn't allow call outs so it will control a particular scope. But when you need to use independent locks you're going to want to have an order and you want to have a protocol that you're building on top of them.
So if all of these independent locks aren't going to scale very well to having lots and lots of them we have another technique to use and this is sort of the antithesis of locking and it's called isolation. And the goal here is like locks to turn a very complex set of interactions between tasks into something else. In particular this is something fairly simple.
If you imagine a single thread spawns off four subtasks which then rendezvous back with a result this is something I feel much more confident about debugging and at the same point in time we can see that you can actually get some real concurrency out of it. And one of the key points to note is these subtasks are not talking to each other. So they're not locking. They they really have no particular interaction. They're quite independent. But in the real world we do have some constraints and sometimes that's not always possible.
So at least here we have four separate clusters and well the subtasks are not necessarily perfectly isolated at least these clusters are isolated and we have the opportunity to focus our debugging efforts within a cluster and not have to worry so much about what other threads in other clusters are doing because they're not going to be changing our state.
So we're going to focus on sub dividing a multi-thread application into set a serial pieces to reduce our complexity and in this way isolation to concurrency a lot like encapsulation is to object oriented programming. This is a pretty fundamental technique when basically what we're talking about doing is trying to keep all these threads from interacting directly with each other.
It's also called confinement and we're going to and the reason for that is we're going to confine resources scope to a specific thread. So other threads are not going to have direct arbitrary shared memory access to those resources. They're going to have to enqueue requests formally by a specific channel.
Now this comes up to if we're not going to be doing locking between these tasks what are we going to do with the shared structures is the task we're working with and the first approach is more tractable than a lot of people might think and that's we're going to give each of these subtasks their own copy of the data to work on.
That's not always practical but it actually turns out to be more reasonable than you might imagine. The other alternative is to basically create a resource centric view of a particular task. So instead of having a task that's owning a resource you actually give a resource its own task and people who want to use that resource are going to have to enqueue requests specifically for that resource.
So those are two alternatives to doing a lot of locking on a shared data structure. You can see an example of this kind of pattern with the web. Here clients and servers own their own resources. I can't go poking around whatever NFS mounts news.yahoo.com might be using and at the same point in time the servers don't necessarily know everything that my client is doing.
There's the formal communication channel that's very well defined; http and this really allows us to encapsulate a lot of complexity between the clients and servers. So this is a really it's probably the most commercially successful example of the isolation design pattern and it's true this is for distributed computing more than concurrent programming but this is the kind of pattern that we're going to want to consider bringing into our own applications. And the nice feature about this is the servers and the clients can themselves be very complicated.
They don't necessarily have to use isolation and we we created clusters that are fairly independent. So in a single web address might be backed by huge server farm or Safari might be doing a new JavaScript engine and it might be talking to multiple different websites simultaneously and the servers don't need to worry about what Safari's doing and Safari doesn't need to worry about the server farm backing some kind of DNS striping.
So some typical examples mechanisms for this kind of isolation: Dispatch queues themselves, Grand Central Dispatch; each of these queues is going to default to being serialized so each block registered with a dispatch queue will execute in order one at a time. So if the queue owns a resource and nobody that is outside of that queue gets to access it then you basically have an isolation mechanism where the blocks will take turns using that resource.
NSOperationQueue defaults to being fully concurrent. However you can turn the concurrency down to one and now you've recreated the same kind of pattern that Grand Central Dispatch is using and so in this way any kind of resource that's owned by a specific NSOperationQueue is going to be isolated from concurrency that are occurring on other queues.
In Snow Leopard with Grand Central Dispatch we're discouraging people from using thread local variables because dispatch queues are going to be managing a lot of these things for you but sort of the classical old school case thread local variables were used to hold state that we wanted isolated from other threads.
And well not nearly as fancy, very functionally just having a plain old context pointer pass off to the work functions is another great way to isolate segments of a data structure and this is the kind of thing you can see with re-entered functions like queue -r which is taking a context pointer and then operating on them. So a diagram now of isolation with queues; we have some serial queues and you can see that each of these queues owns a particular resource and the tasks on each of the queues are basically unable to really leverage the other resource.
So for those of you who aren't quite as familiar with queues within a particular queue these tasks will execute and order. So task 1, 3, and 4 are going to execute in that order and there's no particular concurrency there. However, there is concurrency between queues so all of those tasks can execute concurrently with respect to task 2. However most queues do offer the option of run concurrently and in this case task 4 and task 6 can run at the same time whereas task 1 and task 3 can never run at the same time.
And in this case in order to provide isolation each of the blocks on that queue would need their own resource. So isolation is going to work like an API boundary between tasks and resources are going to be confined to particular tasks kind of like private ivars. One of the nice features this has is it avoids any kind of locking overhead or the potential for deadlock.
So isolated tasks do upon time actually need to communicate with each other and in this case they're going to use a pattern called message passing. This is the formal communication panel I was talking about and it's avoiding using shared memory to change another thread state directly. Instead we enqueue our request and that other thread will integrate those state changes at its leisure. So this is the kind of pattern you see with the NSRunLoop.
It's a pattern you see in the http response cycle and the key point is that the receiver controls message queue so arbitrary state changes to its own objects aren't happening out from underneath it. So, in Cocoa, just sort of to sum up the message passing. The main thread has a bunch of special support for you that's very convenient; we'll see it working later with the DUI. In NSOperationQueue you can get a reference to the main queue and then you can add operations to that.
With Grand Central Dispatch there's a function to get the main queue and you can dispatch blocks to that and then more old school all NS objects have perform main, select, or un-thread which also works quite conveniently and the NSRunLoop has a way of getting the main run loop that you can add additional tasks to. By in large most of these are actually fairly well integrated. So performing a selector and adding an operation to the main queue are going to be fairly similar.
So you'll want to use whichever one is a little more convenient to your tasks. So, overall our goals for coordination are going to be to prune away this complexity that adding all these threads generates, to manage our resources safely and typically we're going to use locks and isolation as our as our main bread and butter for achieving these results.
But in particular this time isn't spent doing our real work. So we have some additional overhead that we've added whether it's to copy data to isolate tasks, the communicating channel or just having locks. And so it's worth bearing in mind that locks and communication channels are themselves resources and this is going to introduce problems of its own.
So this is this is going to be one of the key getting factors on how small a task we can make and that extra set of problems that we're adding by generating more coordination between our tasks is basically resource contention and this is something that a lot of people who are new to multithread programming underestimate. So in particular, application level tasks typically have a lot of different resources that they're using.
This isn't quite like say open CL, data parallelism with graphics. Cocoa applications typically do a mixture of a various different things. Such as they'll have CPU intensive tasks and I/O all sort of together bundled up and so we're going to need to be mindful of the different kinds of things that our tasks are using. In particular I/O and network latency are going to create bubbles in our tasks and we're going to need to treat locks as if they were resources as well.
So the nice thing about this is CPU scheduling gives us a lot of support from the operating system. Their threads which we've been using for a long time and in Snow Leopard, there's Grand Central Dispatch and dispatch queues. We have in NSOperationQueues on Leopard and iPhone OS and there's also processes which is sort of more of an old school web server kind of way of actually achieving a fair bit amount of concurrency in a single machine.
So the OS gives us a lot of support here which is nice. Unfortunately there's a lot less support for these other kinds of resources like I/O or locks so when we find ourselves having to manage resources by hand one way to think about it is sort of the airline reservation model.
Here we want to overbook a resource, at least a little bit, because we don't want to leave resources idle. This is time we could have spent using that core for something else. Or if the resource is the hard drive it's something that another task could be doing. So we want to keep them overbooked but we also want to avoid overwhelming these resources.
Similarly, we can invert the problem a little bit and we can actually leverage some of the CPU scheduling by pairing the resource with a queue or a thread directly and then we have that queue is basically managing a resource and not doing anything else. So this is again as I mentioned a sort of a resource centric version of a task. And this allows us to leverage the CPU scheduling resources that the operating system provides in order to manage a particular resource.
And the reason why I'm focusing a lot of time on resources here is because we have some constraints for these resources and as we have more and more cores we still have some limitations for instance on network bandwidth or seek time latency on the hard drive. But something that catches a lot of people off guard is as we overload a resource many important resources degrade very, very badly. So they don't degrade linearly where you have twice as many people using it as they should instead of being two times slower it's much slower.
So when we look at I/O which is the beginning factor in a lot of applications we want to keep in mind as we break up our applications into a variety of different tasks we don't necessarily want to chunk up the I/O in a way that has lots of small, random access I/O.
You want to remember that sequential I/O is going to be much faster than random axis I/O and fewer larger requests are also going to perform much better. So if we create a lot of tasks and they all do some small I/O requests we're going to start getting hit by the fact that the hard drive has a fairly large seek time.
So if these tasks are working with different files for instance spread out over the drive then we're going to add a lot of extraneous bubbles into our pipeline. And also something to keep in mind as you debug your applications where your performance between them is if you're doing that on desktop the laptop drives are really much slower.
So, to give a more concrete example about these problems with I/O and subdividing your application; a few months ago someone came to me with a problem in particular regarding data base I/O and they were experiencing a really egregious slow down. It was much much slower than it should have been. So actually they gave me a sample.
The SQL used to talk to data base and some files and steps to reproduce and of course it works fine on my machine, naturally, multi-frame problems always work fine on my machine which is why whenever possible I give them to my engineer with the worst karma, so I suppose in some ways that's a vicious cycle.
So, it turns out that a wee bit of detail had been left out from the context and that was how many other threads were performing I/O at the same time that this data base work was going on. So in this particular case we had thread 1 is basically the main thread and it's responding to the user interface and at a certain point in time the user executes a task that causes a new subsystem to need to be easily initialized and at this point in time there are actually two new subsystems that need to be brought on line.
So in sort of a good design pattern these are offloaded to separate threads so they don't block the UI. Unfortunately these two separate patterns are both doing a lot of I/O so what ends up happening is not only are thread 2 and thread 3 contending over being scheduled on the CPU but they're also interfering with each other's access to the I/O subsystem and basically causing the drive to seek around additionally.
So with a two line code change which is let's not spawn the third thread at all and let's call that main function on the second thread once we're done, its 40% faster. So this is about basically improving the locality of reference that the drive is using, less seek times and fewer contentions.
So keep in mind that just because you have eight cores does not necessarily mean that you have eight hard drives. If you do have eight hard drives in your RAID system I'm very jealous and in particular for a lot of Cocoa applications I/O is going to be one of the more expensive tasks you perform.
So you're going to want to be careful as you subdivide your application to be very mindful of this and that many small random access I/O's will be bad for sort of an inefficiency between each of the tasks that you spawn off. So just as we're talking about trying to find the the just right granularity for the tasks you divide up in your application you're also going to want to be mindful of this similar kind of pattern for I/O and here's the thing that is again also surprising to many people new to multithreading and in point of fact actually I had an Apple engineer that he didn't believe me. So this is something that becomes a surprise to people. This in particular sample data from the Kernel Team doing work on performance of locks so I responded with whatever and so here we have one lock.
This is in fact a spinlock so it's a very, very fast lock and it degrades quite rapidly as you have more threads competing for it and in fact the eight thread case is not eight times slower, it's 50 times slower. So this is this is huge. This is really orders of magnitude orders performance than we would really want. So as we divide up our application locks are very, very fast when you have only a couple of threads using them. But if they're going to be a bottle neck in your tasks then you're going to start paying a really large performance hit for that.
So you want to keep that in mind that you don't want to engineer your tasks in such a way that they all are starting to hammer on the same locks. In particular locks without contention. So that single thread without using a lock is only a tiny bit slower than no locking at all. So locks are really quite useful. Those contended locks get very slow, very quickly and these contention issues are actually generated by hardware as the different cores need to keep things in sync with their cache lines and they generate memory bus traffic.
So this is very slow and something else to keep in mind is as I mentioned earlier atomic operations are kind of like that destination address is its own little mini lock. So for that reason atomic operations will actually degrade in this same way, which is very badly, if you have lots of threads all containing of the same address. The nice thing about atomic operations and the reason why a lot of people have sort of this this feeling that atomic operations are a lot faster than locks is because they naturally kind of encourage you to be locking many different addresses.
So an atomic operation is kind of like you're using a separate lock for each address. But you'll see a similar performance problem if you start having many threads call an atomic increment for instance than a same global. So, in addition to I/O and locks we're going to be using memory and any pending tasks that aren't actually getting executed are just sort of lying around consuming memory.
Now, blocks and queues in particular are cheap. So that's very nice, but they're not completely free. So we want to be careful about that. Threads on the other hand, creating new threads is actually very expensive, so we definitely don't want to do that. So in a sense we obviously want to create a bunch of separate tasks to leverage you know eight, sixteen cores but at the same point in time if we only have sixteen cores there isn't much point in having thousands of pending tasks all queued up.
This is very inefficient use of memory. So one way to think about this is sort of a producer consumer style problem where tasks are being consumed by the cores as they get executed. So we want to avoid overwhelming the consumer's queue. And you know one of the easiest ways to do that is just to have the tasks and queue new tasks as sort of their last step.
So instead of spawning off thousands of tasks all at once up front as each task gets finished its last step is to enqueue the next series of subtasks and this is called recursive decomposition and it's a fairly popular pattern. Sometimes you can't necessarily do that very conveniently and this would be a place where you might use something called a future where you're going to have a proxy for a whole collection of new tasks and you won't actually materialize those tasks until later where they're absolutely necessary. So at that point in time then the future will be responsible for enqueuing all those new tasks.
And as I mentioned a few times a lot of Cocoa applications at this layer are going to use a very wide mix of resources and for this reason along with the thread complexity growing exponentially it's going to be hard to guess up front necessarily how all of these things are going to interact.
So it's going to be very important that you identify the bottlenecks and that you start measuring these things and that way to give you an idea about whether or not more threads are actually going to help you. If you're I/O bound more and more threads is probably not going to help you very much.
So you're going to need to find a way to design your task so that you're less I/O bound and in this way tools like Shark and Instruments are great for design debugging, not just performance debugging, but actually kind of identifying where in your application you're hitting bottlenecks and where it makes sense to divide your application up further to use more cores. So it makes it's actually quite valuable to actually profile your app throughout the design cycle as opposed to wait till the end.
And in this way, thread safety is not concurrency and some of these resources are going to degrade very badly so you'll pay a really high penalty for over scheduling them, particularly I/O and locks. You're going to want to avoid overwhelming them and you might consider something like the airline model where you slightly overbook everything. Ok, so in terms in balancing these two things we want to manage the resources. We don't want to pay too much and inefficiency for coordinating between all the tasks we generate.
We also need to keep in mind some correctness issues that catch people by surprise. And all these things kind of sort of group together into what I consider sort of a a theme for this session is that something that surprises a lot of people new to multi-thread programming is the most efficient task granularity is really quite large and there are a number of reasons for that.
Single threaded code has a lot of advantages; no locking, no messaging, doesn't need to merge results together, doesn't need to spend more memory to represent all these pending tasks, there's no scheduling latency. So in particular the Intel chips can do an awful lot of work before you can actually even get another task scheduled. On top of that serialized algorithms have been around for a lot longer so they're often older and better researched and optimal, not even optimized but actually optimal.
Serialized algorithms are widely available in open source software or as part of Mac OS X in the libraries available to you. So we're all kind of producing this new journey into multi core programming and so in some ways there's a little bit of a learning curve for the entire system.
And one example of this this is concurrent string sorting using Grand Central Dispatch and this is a very simple thing where we have a huge list of strings and we're just sorting them and in particular one thing you can see is that up to roughly about 2,000 strings this is sort of the break even point at which actually the Intel chips can actually sort an awful lot of strings before it's worth spinning up another core.
The other thing to take away from this diagram is the concurrent version has this lower bound around a fixed point and as you add more and more work to the queue to sort larger and larger sets of strings it actually performs roughly about the same which is actually very graceful and you can see that that reaches a point at which the the serial system really just can't keep up and degrades much worse as you add lots and lots of strings.
So this lower bound is what we're really trying to achieve as we can add more features and do more work for the same amount of time. At the same point we need to when we divvy up our application we need to maintain some correctness. So what I mean by that is we have some implicit data dependencies typically in our objects and we need to avoid dividing a task so much that we've broken our own assumptions. So the assumptions typically involve multiple different properties on a specific object.
Most objects actually assume that all their properties are in sync with themselves. So what I mean by this is if the properties were truly independent of them even if you could set and get them safely; if they were completely independent and had no real relationship to those other properties you probably would have written that API as two separate objects.
This is something to know. It's not always true but there's this tendency where objects have a lot of assumptions about the state that their different individual properties are in. So as an example of this, sort of text book example, is first and last names; sending them on some kind of person object.
So we're using locks, so this is all Fred safe, but it turns out that it's not correct. So we're setting each part of the name independently and in this case Thread 1 is trying to set set the person to be Ali Ozer and Thread 2 is trying to set them to be Ben Trumball and we jumble these things around and on my poor poorly karma inflicted engineers machine we get this execution order in which nobody has really come out ahead. So even though we've used locks and we've set each of these properties independently the truth is these properties really should never be set independently because we have this assumption that the person is going to have a consistency between these two properties.
So that leaves us where Ali has become Ali and not only have we successfully set the name incorrectly but we have also implicitly changed the gender of this person to my dear cousin. So, as an overview, course grain tasks are going to have less contention, there are fewer of them so they're going to use resources more efficiently. You're going to amortize the coordination costs between these systems and there are going to be fewer independencies between these tasks overall making them much easier to debug.
On the other hand, it's going to be more difficult to load balance all of these tasks because they are larger grain tasks. So for instance if we only had one really large task we're not going to get any concurrency and if we only had two tasks we're going to be limited to as many as two cores. So as we reduce the granularity of these tasks we have more scheduling opportunities and we can load balance all of these tasks better. Typically there's better responsiveness at the same time we're we're trading off for more coordination costs.
So these tasks are going to have to communicate between each other more and typically do more locking. As your adjusting these knobs in your tasks and in your applications you're going to want to focus on the primary bottlenecks in these tasks. So, it it doesn't make necessarily much sense to optimize for CPU efficiency until you've dealt with I/O bottlenecks. You're going to want to manage the complexity of your applications just as much as the performance.
So having something that is really fast that you can never debug is not going to be very helpful and in the future you'll probably learn about more advanced techniques. Lockless programming techniques lock striping, things like this and in general you're not going to want to sprinkle those throughout your application. You're going to want to reserve them for the most crucial problems.
So that's why something simple like locking or like isolation is going to be your bread and butter for divvying up your tasks and overall you're definitely going to want to avoid guessing up front as your applications tasks for most applications are going to use a whole range of resources. In particular you'll want to keep in mind the break even threshold at which point it actually makes sense to spin up other cores to help you for a particular task and that's going to depend on the resources and again the coordination costs.
So that brings us to something that trips up a lot of Cocoa application programmers and that's multithread notifications. Notifications are a very important piece of Cocoa programming and unfortunately it can be a little tricky to make these work well with lots of threads, or in the new world order, queues. So most threads are delivered synchronously NSNotification, key value observing by default. The observer actions that you register are going to executed in the observee's thread.
So from the perspective of an observing thread that has a registration to observe notifications on a object owned by another thread the visibility of those changes is going to appear pretty random. And the problem that comes up with this with this synchronous notification delivery is if Thread 2 registers to receive notifications for an object known by thread 1 what ends up happening is Thread 1 posts a notification, the NSNotification center or the key value observing mechanism is going to go through the call backs.
It's going to call the method that Thread 2 has registered in the context of Thread 1 and then typically this is where things go array and that callback is now operating of Thread 2's data even though Thread 2 is somewhere else operating on the data at the same time.
So this is where the shared memory approach can be really perilous and you're going to want to exercise a particular amount of self discipline And because of sort of the features of the notification design pattern where there's an API between the observer and the object posting its notifications that abstraction actually causes there to be a lot of dependencies between these two threads because the notifications can come pretty much at any time.
So the behavior is actually particularly non-deterministic and if the observing thread is attempting to implement something like isolation then obviously this will break. And in particular when we talk about the complexity to remind you that notification is happening sort of randomly throughout your application is really going to run into this wall where different threads are going to interleave in many many different combinations. So the natural response to this is to to go to sort of our old standby which is to add some locks and this ends badly. It typically ends so badly that it needs to be the first bullet point on this slide.
And the reason is when we have all these locks we want to set an order to having these locks, to manage multiple different locks but the API boundaries, the very feature of creating a notification system defeats that where the object posting the notification doesn't even know about the observers at all, the NS notification center is the one that knows about the observers and the observers themselves their notification actions can be fairly complicated and might need yet more locks.
So the way notifications work prevent us from really creating a defined locking order for all the different locks we might want to use as part of a notification system. This makes it particularly prone to deadlock. Anybody who has tried to do this will will have experience many deadlocks in this kind of multithreaded notification system and I apologize this last line is a little typo.
The observers often post a new notification during the callback and this is especially problematic because the callbacks are happening synchronously. So when an observer posts a new notification, so maybe it changed a value which creates a new KVINotification, or it runs something, it creates a new NSNotification, the entire chain of locks that are being used in the first notification are still being held. Ok so, locking doesn't work very well with multi-thread notifications.
So we'll go to our other standby which is to use isolation and when we've combined association message passing for the purposes of notifications we call this the receptionist pattern. And the key point here is the observer's indirect so the observing studies is a proxy and the observee and the observer are isolated in the first thread and the observer's proxy then uses message passing to let the second thread know that a notification has occurred. So this looks much healthier.
Basically, the the notification system instead of calling the action you've registered it's going to use a proxy for the second thread and that will enqueue a message and at some point in time Thread 2 will process that message queue and then handle the notification itself. So this is much easier to manage than using shared memory to directly make state changes into objects owned by Thread 2 during notification.
The initial notification is completely isolated into the observee's thread so the observer proxy has a chance to work the notification data and wrap it up in such a way that it can be enqueued for the second thread and at a safe point in time in the future the observing thread will process that message queue.
There's some new API in Snow Leopard, NS NotificationCenter, which actually kind of makes this a little bit easier to do. So here you can register a block to be an observer and it actually takes a queue for you so you can specify that when the notification is generated that you want the block to be executed on a specific queue.
So the notification posting is still happening synchronously but this queue could be a queue that is working with the second thread a little more closely. And because notifications are such an important part of Cocoa programming there's going to be a lot more going on in the Cocoa tips and tricks session which will be on Thursday at 5 and there will be a much more extensive example about using the receptionist pattern.
But the summary is queues make really good receptionists. And once you've set up a receptionist you can keep in mind that actually a bunch of other things you can do now that those notifications are no longer being processed synchronously. You can filter and coalesce them pretty easily as you're processing the message queue on that second thread.
You might even reprioritize them. The most common thing you'll do though is you'll start throwing away still notifications because the second thread is doing its own set of work and they might decide that it no longer cares about a particular notification. It might be say from an object that is now de-allocated.
As we're pushing these notifications around and they have data that the threads want we need to keep in mind that passing data between threads can get a little tricky. We don't necessarily know when the receiver of that data will run. It might happen so basically the sender might lose lose access to the core it might be scheduled out immediately upon passing the message along. So the receiver might finish with the data and release it before the sender gets the chance to do anything.
Or the receiver might experience a problem, either a page fault or some I/O latency and it might start many seconds later in the future long after the sender itself is done using that object. So there's really a very simple rule to keep these objects alive as you pass data between threads and that's the sender is going to retain and the receiver is going to release. This is just as true when you're passing dispatch queues and dispatch objects around as it is at the Cocoa layer.
So the sender is going to retain one's for each of the receivers that pass along too and each of the receivers is going to get sort of an ownership of a retain account. So you're transferring one specific instance of a retain from the sender to the receiver. And this way no matter what order the senders and the receivers get scheduled in these data objects will remain alive until both the sender and the receiver are done with them.
Which brings to key value observing which is a little bit different from the NSNotificationCenter style. Because we're observing specific properties there's just sort of a natural tendency that these notifications get to be smaller and because of their size that can be challenging to fit in with sort of the minimum granularity that we want to pass over to another thread.
We really want to achieve that sort of break even threshold of communicating with other threads and if there's a lot of them, if we're observing a lot of properties and a lot of objects there can be a fairly high volume of these notifications which in many ways is a feature but as we work with different threads it's something to keep in mind. So in that sense KVO works best with isolation.
Here the observers and the observees are all confined to a specific thread and we've amortized all of these coordination costs because well we're not actually passing any of the individual KVO notifications over to a different thread. So that's sort of how things work in sort of an idealized world where most applications actually aren't being written.
So in order to start passing all these notification data over to other threads with KVO typically what you'll want to first is to coalesce all of these things together so you'll have a bunch of KVO notifications happening on one thread and when they reach a certain point of usefulness repackaged as a batch together then you'll pass that over to the other thread to work with a larger granularity of data.
So this is just sort of an example of kind of how the natural tendency to use these two notification systems differs a little bit. The NSNotificationsTypically are a little less frequent and a little larger grained so they often fit naturally into granularity a little better and KVO on many properties on many objects is a little finer grained.
So they have their different systems but they both as multi-thread notification systems have the same overall behavior they're all synchronously posted. So an example of using KVO with multiple threads is in core data. So here the manage objects and the manage object context so each of these manage objects if you're not familiar with core data; these are basically objects representing rows in a database and the context is a transaction scope.
It's observing all of these particular objects and it's required that each thread have its own context and its own peer set of objects. So all the changes that you make to the properties on these objects and all the KVO notifications are happening within a single thread and the manage object context is observing all of these objects coalesces all of these changes together and at the end of the particular event it repackages them up to a courser grain notification and posts an NSNotification talking about sort of a higher level all the things that have happened over the course of a particular event.
And that's probably the right level of granularity for most people to pass off to a receptionist and hand that off to another thread. So each of the context model objects are being confined to a single thread here. The properties are all completely observable within a thread; all of the properties and they can be changed quite frequently without any real performance hit.
And the context is accumulating all of these changes to the end of the event and then it reposts them later as a batch. So in particular as you're working with multi-threaded notifications you're going to want to avoid locking, a lot. You're going to probably need to use something like message passing and the receptionist pattern is a really easy way of doing that.
Queues, dispatch queues and NS operation queues make good receptionists. And as you're working with a lot of notifications if you have a high volume of notifications you're definitely going to want to coalesce them together. So this brings us to some specific challenges talking about the UI. This happens in Cocoa a lot that should be brought up as we're working with dividing our tasks in Cocoa. Specifically the UI is single threaded.
This is actually true for most commercial UI's. Multi-threaded UI's have a tendency to end up deadlocking and the reason for this is the same problem that notification systems have with using locks is the different layers of API make it very difficult to create a consistent lock ordering which is what we need to manage multiple independent locks and there are two different vectors at which multi-threaded UI's suffer this inconsistent lock ordering problem. The actions being generated by an application flow in two different directions. So the user is going to be generating events through the keyboard, the mouse and so forth which get handed over to the operating system which then in turn fields them as events to the application.
At the same time the application is generating its own events which get passed over to the operating system and then to external devices, saving files, drawing to the screen and so forth And the model view controller design pattern itself experiences a certain similar issue where the controller talks to the view and the view might be querying state on your model or the controller might be talking to your model which is generating change notifications and the view might be involved in those too. So it's hard to come up with one lock ordering that works as these things flow in both directions. For this reason most of UI's prefer thread confinement.
They're they're using isolation for the core tasks and in Cocoa in particular the UI is bound to the main thread. The main thread is blessed and this is where most of your work with views and with controllers, in particular Cocoa bindings, is all going to need to be confined to the main thread.
In another way I've seen a number of people, so I want to call this out specifically, a number of people attempt to fix this by adding locks and the problem is that AppKit and Cocoa bindings doesn't know about your locks so if you do something like you add locks to say valueForKey on your model objects it can really be it's pretty much impossible to find all the different places where DUI might be expecting to be working with those objects, confined to the main thread, simply because it doesn't know about your locks.
None of its code is going to be expecting your locks. So that sounds unfortunate however it's not really that bleak. In particular the UI can dispatch all kinds of operations quite efficiently off the main thread to different work queues, to get you more responsiveness. And you know frequently in the past a lot of people have done this kind of thing using runLoops to move I/O intensive operations out of the main flow of control and these days these things are really the most common kinds of tasks that you use separate threads for.
So the key challenges we're sending off these other tasks that are generated by the UI to background operations is merging those results cooperatively back into the UI and that's basically message passing which I talked about a little bit earlier and you're going to dispatch these things either onto the main queue or you're going to do something like performing a selector on the main thread. So here's an example of what this might look like.
This is using blocks on Snow Leopard with NS operation queue. And you'd have a background queue in this case my app queue and it's going to add a new operation that performs a very long task and the last thing that that task does is it's going to get a reference to the main queue and then it's going to enqueue whatever code is necessary to merge those results back into the UI.
So this is a really pretty simple pattern and you're going to see a lot of this with Grand Central Dispatch. If you want to see using the same pattern directly to Grand Central Dispatch here are the functions you'd use to dispatch asynchronously to a concurrent queue. Perform your long task there and then dispatch asynchronously back to the main queue.
Ok, so to summarize, you're going to want to decompose all of your tasks above that break even threshold where splitting up additional cores is going to help you solve specific problems faster. And as you do that you're going to want to consider the resource limitations you have. You don't want to add too much contention over things like I/O or locks and you're going to want to avoid excessive coordination, particularly passing along a high volume of very small notifications for instance, is going to generate a lot of inefficiency. And as you're tuning application for performance you're similarly going to want to keep in mind that you really need to prune the complexity back from these applications. So some, some really good reading material here.
Mac OS X "The Threading Programming Guide" talks a lot about what's available in the Mac OS X environment; Cocoa as well as lower level pieces of the operating system. This is a great resource. It talks about using runLoops as well as threads directly. Another thing that's available to you is an article called "Real World Concurrency" and it's written by two people from Sun and it's it's a great article looking at their experiences with engineering multi-threading systems and some of the same kinds of things that I've talked about today.
Some of it in a little more in depth but these guys have done some really sophisticated multithreading work and again you know they talk about reserving very complicated systems for your core problems and focusing on getting a lot of tasks simply decomposed and working concurrently using fairly basic mechanisms.
Another article which is one of my favorite articles for exploring the complexity multi-threading programming to kind of go into a little more detail about why all of this is really so cumbersome is from Edward Lee at Berkeley and it's called "The Problem With Threads," so it's another article you can get on the internet. And it goes into a lot of detail about what is so difficult with this multi-threaded programming and exploring how to control all of those different permutations that I showed little graphs for.
And that brings us then to two hard cover books. One is "Java Concurrency in Practice" and I realize we're not actually talking about Java, however, a lot of this is relevant no matter what programming language you're in and it's one of the first books that I read that's focused on actual software engineering practices with concurrency which is a nice change of pace from API references or text books for people learning how to you know how to implement locks without actually having any locks and other things which might be academically fascinating but aren't particularly relevant to what we need to do today and finally there's the last book "Pattern Language for Parallel Programming."
And this book is nice in that it gives an overview about lots of other different kinds of ways to decompose tasks so we focused today pretty much on one general purpose thing called task based parallelism and that's great for a lot of Cocoa applications but as you start working with this you'll find that you'll you'll be interested in exploring other ways of decomposing your application and to have a nice survey of lots of different techniques and design patterns for subdividing an application and specific problems. So there's a lot of good advice in there. It's a little dry but it provides a nice overview.