Mac • 1:11:37
Grand Central Dispatch (GCD) offers you unprecedented power and ease in optimizing your code for maximum clarity and performance. Build on your knowledge of GCD by exploring the architecture and design patterns that will make your application perform at its best. This deep dive is highly recommended for anyone looking to master GCD.
Speakers: Kevin Van Vechten, Dave Zarzycki
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
Good morning everyone, and welcome to understanding Grand Central Dispatch in Depth. My name is Kevin Van Vechten. I'm the manager of the GCD team and I'll be followed shortly by Dave Zarzycki who's our technical lead for Grand Central Dispatch. So today, we're going to go over a brief technology overview.
We'll do a real quick recap of some of the topics that we covered at yesterday's session, but then we're going to dive deeper into APIs that weren't covered yesterday. We'll follow that with an in depth discussion of the GCD event sources which we feel is a very powerful part of our API. And then Dave will be presenting some advanced topics and a section about performance and tuning with GCD.
So, our technology overview. First off, I'd like to say that GCD is part of libSystem. When you compile your program, there's no special link constructions you need to perform. All the symbols are there. And it's available to all applications regardless of where they are on the technology stack, so Cocoa applications as well as UNIX level applications. And all you need to do in your code is include the dispatch header file.
There's a single header file that includes all the API and that is dispatch/dispatch.h. So, yesterday we covered a variety of API, we covered objects specifically that all GCD API operates on, simple polymorphic object type. And we discussed a little bit about retain and release. We're going to go into the rest of this today.
We also talked about queues, which is one of the fundamental aspects of GCD that we work with queues of blocks. We talked about groups which is a way to group different blocks together and figure out when they've completed so you can move on to a new task. And we talked about something that wasn't an object, and it's the notion that GCD has some time keeping capabilities. So today we're going to cover the complete set of these API.
We're also going to talk about how to run blocks only once per process. It's classic counting semaphores and dispatch event sources which is layered on top of the kevent and kqueue interface, and it's really powerful for you to monitor external events from your application. So we'll start off with queues, since that's really the main entry point to GCD. GCD is all about taking blocks and queuing them on a queue and having the system process them asynchronously in the background.
So, asynchronous blocks are what get put on to queues. And it looks something like this. It's very simple, there's a call dispatch_async. The fist parameter is the queue that you want to target the block on. The second parameter is the actual block of code that you want to run.
So in this example, the system will start up the thread or potentially reuse an existing thread depending on availability and will print "hello". And that whole process looks a little something like this. So you might have a thread, you might create a GCD queue, you instantiate a block, the dispatch_async call, and queues the block onto a queue.
An automatic thread is created, runs the block and then everything goes back to a steady state. This is pretty much the same code example, just written slightly differently. In our block, we're calling a function and passing a context parameter. And the point of this example is seen in the next slide and that is we have API that is suffixed with _f. So if you have existing code that's already structured using callback functions and context pointers, you can use all of our _f variants and naturally fit into that. The _f functions are actually kind of the basic primitive of GCD.
All of the blocks work is really a wrapper around that. And when you use _f, it's not necessary to perform the block copies, so there's actually a little bit of performance that can be gained if your code's already in that model. So like I mentioned, all GCD APIs have _f variants, and these variants use a context pointer and a function pointer instead of a block object.
Of course we think block objects are a great convenience and we use them a lot when we're writing with GCD API. So another concept which we touched on briefly yesterday were completion callbacks and we're going to recap on that again because it's a very fundamental part of GCD design and we'll be talking about it later in today's session and again in this afternoon session. So fundamentally, completion callbacks are all about continuing the code flow of your application. Taking a block of code, putting it on the queue, running it asynchronously is great, but you need to get the results back somehow.
It's very rare that you actually perform some work and you forget entirely about it and you don't care about any of the side effects. Most likely, the work is going to be doing some sort of computation and you need to get the results of that computation to update your UI or move on to the next stage of work to do. And so the model of completion callbacks are all about propagating results back to the caller, to the part of the process that started the work and then knows how to deal with the results.
And here's a little diagram of what that might look like. Perhaps you're running on the main thread and we, of course, have an interface to the main thread called the main queue. Any blocks submitted to the main queue will run on the main thread. And let's say this code that's running on the main thread creates some sort of block which we'll term the call, and the call is what's going to be executing asynchronously in the background. Well, it gets submitted to the queue and run on an automatic thread. And at some point, the call finishes and it's going to instantiate a block that will call the call back and that will get submitted back to the main queue.
So that's how you can get your results back from the asynchronous operations. When the call finishes, the callback block runs with the result, updates your UI or whatever it's doing, and then it finishes and everything returns to a steady state. And that can be done with a very small amount of code.
All it requires is 2 dispatch_asyncs. The first one is taking some queue that you've defined, we're naming it queue here, running a block in the background. Then that result from that first operation can actually be captured by the nested block, because remember each block is inheriting all the scope of-- or all the variables in the outside scope. And so then we can take that result and send it back to the original queue.
In this case, we've hard wired that to be the main queue, where we process it and then we can free the memory when we're done. And we think this nested block model is really, really powerful because you don't need to define any structures, you don't need to marshal and demarshal data.
You can just capture any of the side effects that you're interested in and get them back to where they would need to go. So now let's move on to the concept of target queues which is something we didn't discuss at all yesterday. Every queue in GCD has a target queue. Any of the queues that you'd create with the dispatch_queue_create API default to the global concurrent queue.
And as we mentioned yesterday, the global queues connect the queue blocks concurrently. All queues are FIFO, but the global queues don't wait for one block to finish before starting up the next. It will just keep dequeuing and keep running on threads as they become available. The private queues that you allocate to the dispatch_queue_create API are serial. They're going to wait for one block to finish before they start the next block.
But multiple queues can run concurrently with respect to each other and that's because they, by default, target the global queue which is concurrent. And then the global queues themselves, you can think of them as targeting themselves. Well we also have a few other global queues. We have a high priority one and a low priority one in addition to the default queue.
And these priorities reflect the order in which the queues are processed. So any blocks submitted to the high priority queue are going to be dequeued before any block is submitted to the default priority queue which in turn will be dequeued before any block is submitted to the low priority queue.
So you can have some granularity on priority and usually the default priority queue is good enough but there's some cases where you might want to think about using the high or low priority queue. And so to give kind of a graph of what this looks like, yesterday we talked about the main queue and the default priority queue and now we've just introduced the low and high priority queues.
And whenever you create a queue, it's going to target the default. But you can change that, so if we create a couple of queues, we can change one of them to now target the high priority queue and we can change the other to target the low priority queue.
And of course any changes are going to take effect in between block executions. Once a block is already started running, it's going to run to completion on whatever thread it started running on. But the next block that comes off of the queue is going to see the update of the target queue. And the threads that back the processing of these queues, their pthread priority will be adjusted either up or down a little bit depending on whether this is on the high or low priority queue.
And this is very easy to do in code, so here's an example. The first thing we do is create a private queue. The first argument there is the label, we've strongly recommend reverse DNS style labels. In fact if you see crash reports for your application, hopefully you don't but if you do, or if you're running sample then these labels actually appear in those diagnostic tools, and it really helps track down what part of the code is running with what queue because you can see the labels there nicely.
And after we've created a queue, we'll simply get a reference to the low priority global queue. And then with a single call we can set that to be the target queue of our new queue. So now any block submitted to the queue are going to be run at a low priority because that's where the target of the queue is.
But it's much more expressive than that, we actually support arbitrary hierarchies. So you can take any collection of queues you want and you can set one queue to be the target of another and create a tree essentially of queues. And it's going to be FIFO scheduling through all of them because all of our queues are FIFO. And I do want to point out that creating loops is undefined if you create some crazy graph where one queue points around target queue all the way back to itself, we don't even know what's going to happen there so don't do that.
So here's what this might look like in terms of an object diagram. Let's say we create a queue and it's going to target the default priority queue of course. And we create a couple of others. Well, we can take one of the queues that we created and actually have a target to an intermediate queue, so we might have a diagram like this. And it is going to be FIFO, so not-- no concurrency in this case, everything is going to funnel through that single queue and its serial and its FIFO completion.
So you might be asking yourself, why would I want to do that? Well, there's actually a few cases where this can be pretty beneficial. One practical example is maybe you recognize that a hard drive is essentially a serial device. There is only one, well conceptually only one read/writehead on the disc seeking around. So perhaps you've, you know looked at the devices that are attached to the machine and you've crated a queue hierarchy that models these physical discs.
And you have one queue per disc and perhaps you have one queue partition-- per partition and so what you're trying to do is optimize your I/O and say "Well, I really don't need to be writing or reading for more than 1 partition at once, I'm doing a bunch of bulk data and I don't want to be thrashing the disc.
And in fact all these partitions live on the same physical spindle, so I don't really want to be thrashing that either. And so you could create this queue hierarchy, submit work logically to all the leaf node queues and then it will kinda funnel in a FIFO and serial fashion. And of course, if there are multiple discs on the machine, you might have multiple intermediate queues and so you can actually get some concurrency there, but you're avoiding thrashing on the disc.
Another advanced concept for queues is that of suspend and resume. So suspend and resume basically affect the queue's ability to dequeue blocks. If a queue is suspended, it's not going to dequeue any additional blocks. However if a block is already running on a queue, we don't interrupt it. It's not preemptive. So it's really going to take effect at the next block dequeue time. And you can use these target queue hierarchies to enforce some ordering in how queues are suspended and resumed.
So, as I've mentioned in just a moment ago, it is asynchronous. We don't interrupt a block, so you can't truly know if the queue is stopped. However, if a queue is suspending itself from within a block, then you know the next block isn't going to run because it can't until the current block is finished, and by that time it will for sure be suspended.
Similarly, if you do a suspension of a subordinate queue in the hierarchy, you can know with certainty that no further blocks are going to run because the scheduling of the subordinate queue is going to come up through the current field. So you do need to pay attention to kind of where you are in the hierarchy when you suspend and resume in order to get deterministic results.
And I'll discuss that in a little more depth in a second. So here's an example of using suspend and resume. Alright, it's reference counted. Multiple parts of your application can suspend a queue and it's not going to be resumed until all of those have independently resumed it. And so in this case, if you were to have a queue that you suspend and this of course has to be one of your privately allocated queues. If you try to do this on a global queue, it's just going to be ignored. There's no way for an application to kinda silence all activity in GCD.
But if you suspend your queue and then submit a block to it, that block is not going to run until resume is called. And since this is linear with perspective to the caller, the suspend happens first and then the dispatch_async. You can actually know with certainty in this example that the block will not run until the queue is resumed.
[ Pause ]
So in the second example though, I'd like to point out that all 3 of these dispatch_async calls are happening upfront. And this only enqueues the block on the queue. We don't know what real fine granularity exactly when it's going to run. So by the time this dispatch_suspend happens, maybe all of those blocks have run.
Maybe only A has had a chance to run. I mean it really depends on the length of the block and the business of the system. So the main point here is it's nondeterministic whether A or B or C is run. There is no coordination with the serial queue itself, there's no waiting for completion, so you know this is a case where it's really nondeterministic, no guarantees are being made.
Perhaps suspension will happen before any of the blocks, perhaps all of them will have run by the time the queue is suspended. So be sure that if you're relying on some sort of deterministic behavior that you do properly enforce that, either through a queue hierarchy or through some technique like a semaphore, which we'll discuss in a minute. So, now let's talk about objects in general in dispatch since we've recapped what cues are all about.
Primarily, objects have reference counting. We have retain and we have release. And that is very important with a lot of asynchronous and concurrent code design. If you have one object that's being referenced by multiple threads, then it's very important that you retain that object before handing out a reference to each thread, so that each thread or each queue or each concurrent subsystem of your application knows with certainty that that pointer is going to remain valid for the lifetime of the operation.
When your subsystem is finished with the object, it can call dispatch_release and that's going to decrement the reference count. And once that reference count goes to 0, the object will be freed. And of course it's not valid to use the object after you've released it because you don't know if anyone else has released it. You don't know after any call to dispatch_release if the object is truly gone or not. So make sure you have balanced calls, make sure you retain before passing the object out to another party and then release when you're done.
Oh, the last bullet on this slide is that objects captured by blocks must be retained by your code. So in other words if you're doing that nested block example, so far we've only shown that with the global queues, similar to suspension, global queues don't really support retain/release. They're global, they're always there.
You can call retain and release on them, it's just going to have no effect. But if you're passing a privately allocated queue to a completion callback, you do need to retain it because you need to know that when the asynchronous operation completes, the reference to that reply queue is still valid and hasn't been released by the caller. And so this is what I mean in code. Before the dispatch_async, it's necessary to do a dispatch_retain of that reply queue. So now, it's retained. We know there's a valid reference.
When the inner block does the dispatch_async to the reply queue, that's going to succeed and then we can release it after we've done the dispatch_async. And you might be asking, well why are we releasing it in the outer block as opposed to the inner block? And that's because dispatch always does the right thing and it retains any parameters that are passed to it when necessary. So any call to dispatch_async or any of our other asynchronous APIs are going to retain the queue that's passed to them for the lifetime of that asynchronous operation. And any block parameters that are passed to our API are also going to be block copied.
On the other hand, for efficiency, our synchronous API does not perform any retains and does not perform any block copies, and that's because asynchronous API isn't going to return until the operation is finished. And since we're assuming that object must have been valid at the time you called the GCD API, we can kind of borrow that reference count of the caller for the lifetime of the asynchronous call. And here's a simple code illustration of what I mean by this.
If you were to create a queue and then you were to call dispatch_sync, putting this block in the queue, well dispatch_sync is borrowing the reference of the caller. So if you were to actually release inside of this block, that's not valid to do because it will potentially have freed the object before dispatch_sync has had a chance to return and there might be some, you know, final processing that dispatch_sync needs to do before returning.
And if the memory's been freed, well, it might have been overwritten. This could lead to a crash. So, be sure to retain and release in front of any asynchronous operation. Be sure not to over release in the middle of a synchronous operation.
And if you want to play it safe, if you're not sure what to do, just always retain/release. It's going to be, you know, a little less efficient because you're doing some extra operations that aren't strictly necessary, but it will always be correct and it will always be safe.
So another universal property of all dispatch_objects is that of application contexts. We wanted to make it very easy to use GCD in all of your applications and so we've also made it easy to associate any arbitrary data you want with the GCD object. So any context data, it's just a void pointer from our perspective.
You can attach it to a GCD object and we've also provided an optional finalizer callback that when the object is finally freed because it's been released enough times that its reference count goes to 0, we'll call the finalizer function that you provide if the context data is not known.
So if you provided some context data, you provided the finalizer function, that'll get called, you can free your data there. And we always guarantee that that finalizer will be called on the target queue of the object. So you actually get some determinism as to where this finalizer runs. It's not just a pure side effect of something happening on whatever thread it happens on.
It actually runs deterministically on the target queue. Of course for a lot of objects, the default target queue is the default concurrent queue, so it's not like we're bottlenecking the system with all sorts or serialization. But if you do need serialization in your finalizer, you can set the target queue accordingly and that's where the finalizer will run. So here's a code example of using the application context and the finalizer. The first thing we're going to do is create a structure, so it's just some X, Y, Z coordinates, something pretty simple.
And let's say we have a dispatch_object like a queue. Well we can set these X, Y, Z coordinates as the context of the queue. And in fact, this is pretty powerful because one of the models we'll be talking about a little bit more this afternoon is the use of queues to protect resources in place of where you might traditionally have used a mutext. And what better way to know that the resource the queue is protecting than to set it as the context, and then it's just there, it's always right. We're going to set a finalizer that just frees the context.
When the queue goes away, the context goes away with it. And then this is a very simple example of using dispatch_async, submitting a block to the queue and all that block is going to do is call dispatch_get_context and that gets a reference to the application context, and then we can dereference that and bring out those coordinates. So again, we could use this pattern instead of a mu text to protect the X, Y, Z coordinates.
We can post blocks to the queue and that update these. And then with respect to any part of your application, X, Y, Z will all be updated atomically. So semaphores are a different type of dispatch_object and they're basically a very classic synchronization technique. It's a simple counting semaphore that you've probably read about in a CS textbook. Their initial value determines their behavior. We allow you to specify any arbitrary initial value.
And there are two API, there is a signal API and a wait API. And signal simply increments the value and wait simply decrements the value. So signaling is what you do when you want to signal that some sort of resource is available or some sort of event has occurred and signaling is the wakeup function. It's going to wake up the other end that's maybe waiting on the semaphore.
But it's only going to do that if the result is less than 0, in other words, if we know for sure if somebody's waiting. Otherwise, it's just a simple increment. But when the other half of the equation catches up and does the wait, if the wait is already above 0, well then it knows it can just proceed immediately. But if it's going to be decrementing the count below 0, then that's when that thread actually has to wait. So here's a code example of using a semaphore.
We're going to create it with an initial value of 0. So this is a good way to actually synchronize on the completion of some sort of event. So here's grading the semaphore, initial value of 0. Performing some operation in the background, we're using dispatch_async. We're doing something and then we can indicate the event's happened, the background task has completed. After dispatching the block, the first thread can continue doing some other processing.
So here's a potential for some concurrency. And then it says it's willing to wait around forever for the completion of the background operation, and dispatch is very efficient at doing this, especially in light of certain race conditions. So if the background task happens before we ever got to the part where we're waiting on the semaphore, then we don't even need to drop into the kernel, this can all stay in user space, it can be very efficient.
If on the other hand the calling thread does reach this wait first, then it will drop into the kernel because it needs to put the thread to sleep as a hint to the schedule that we don't want to do any busy waiting here. And then that thread will get woken up when signal is called in the background. And once we've waited on the semaphore, we can release it by calling dispatch_release and indicate that we no longer need this resource. A different example of semaphore usage would be to pass a nonzero initial value.
And this might be useful if you're using a semaphore to kind of constrain the width of execution. In this example, we're saying our application should only use half of the available file descriptors. They are a finite resource, there's only so many spots available in the file descriptor table.
So, we're willing to use up to half of them. And here we've inverted the logic. We're actually going to wait first and basically what we're saying is we're waiting for this resource to be available, potentially forever although it's possible to specify an arbitrary timeout, which we'll talk about in a minute.
And if a resource is available, so actually, you know, if the file descriptor table let's say it's 1024 and so we built, initialized our semaphore to 512, well the first 512 time something goes through semaphore_wait, it's just going to return immediately because it's decrementing the semaphore's value, it's still above 0, returns immediately.
Great, so we know we have a file descriptor available to us or at least we have some assurance that there will be. We go ahead and open a file, do some processing. We close the file descriptor and then we indicate that the resource has been placed back in the pool by calling signal, and this will increment the value to say, yes, there is-- there is another resource available there. And if any threads were sleeping, waiting for the resource to become available, this is where they'd be woken up.
So, really the whole point of this example is that the code will not enter a critical section until the resource is available, and you can use semaphores in this way. So just to recap, signal increments, wait decrements, and the initial value will dramatically affect how the semaphore is used in practice, so you either probably want to pass 0 or you want to pass some well-thought out value that's going to constrain the width of execution.
And yesterday we talked a little bit about dispatch_groups which are related to semaphores in the sense that they allow you to synchronize events across multiple parts of the application. So, you should see yesterday's talk for kind of more of a high level view of what groups are all about.
But to quickly summarize, they allow you to track multiple blocks submitted to multiple queues, any combination you want. And once all those blocks have completed, you can either wait for that completion or receive yet another block on yet another queue to signal that additional work that needs to be performed and kind of enter the next stage of your pipeline.
And it is possible for blocks added to a group to in turn add additional blocks to the group, so we call this recursive decomposition. You might be able to iterate a tree or something like that. And as you go in each layer deeper and deeper in the tree, you're adding more and more blocks, all to the same group and some thread is waiting for the completion of all that processing, that's going to work just fine. But there's a slightly more advanced interface to groups as well. It's possible to manually manage the entering and leaving of a group, or when a block enters and leaves the group.
I want to point out though that when you manually manage this, it does not implicitly retain or release the group. That's a convenience that's provided by the dispatch_group_async function which we covered yesterday. And the code to do this is pretty simple. There is a dispatch_group_enter function and a dispatch_group_leave. This is essentially what dispatch_group_async does with a little extra retain and release magic. So you're actually-- this would be a bad example because we should have retained the group before doing the asynchronous operation.
Remember to always retain your objects. But when the block completes, we can signal that it's left the group and this is going to adjust the count to the number of running blocks in the group accordingly. And so another primitive that we've provided in GCD is the ability to execute blocks exactly once.
So, this is very similar to pthread once if you're familiar with that. We're going to execute a block exactly once and we're not going to let any thread finish or proceed until that block has been executed.
So, we're guaranteeing that it's executed only once and we're guaranteeing that it has run by the time we get further on the code. And you might use this to safely initialize some global variables. It's particularly useful for lazy initialization. You know, maybe you'll need to use that global, maybe you won't, maybe it's expensive to initialize.
Well, there is an API that will help you with that, and that is the dispatch_once API. So here's a very simple example where maybe we have created our own queue in our application. It's a serial queue. It's going to protect some resource, it's going to be used to run stuff on the background. And we have a once predicate which keeps track of whether this block has run or not, and it's very important to declare the static or global in scope.
If it's an automatic variable, we'll get reset every time the function is called, and then you're not going to get much of a guarantee. And so when you call dispatch_once and pass that predicate and that block to run, dispatch_once again will verify that that block has only been run once for application, and that it has actually run to completion before the function returns. And so once we know that queue is initialized, we can go ahead and return it. So this might be a good access or function for some sort of singleton object in your application.
And dispatch_once is a very efficient API. So as we alluded to a few minutes ago in the semaphore discussion, and as we alluded to yesterday in the group discussion, these APIs where you can wait for the completion of some event, we have a notion of time in GCD. You've already seen the special constants of DISPATCH_TIME_NOW and DISPATCH_TIME_FOREVER. Now literally means right now, whatever the current time is.
So that's a way that you could poll for completion should you need to do so, and then forever means literally wait forever, there is no timeout value. But these are two special constants in an opaque type of dispatch_time_t which can specify any arbitrary length of time for a timeout, and it's essentially either going to be a count of nanoseconds relative to now, or it could be a count of nanoseconds relative to the UNIX epoch, and you can specify that arbitrarily with the structure time spec so you can pick any date and time that you want and calculate the timeout relative to it.
So here's a few code examples, so these are independent examples. The first one is using the dispatch_time API to calculate a timeout relative to now. So basically we're saying 30 seconds from now, and it's 30 times the number of nanoseconds per second. Another return of value were if we were to pass this to one of the waiting routines, that would wait for up to 30 seconds and then return. You can also use a previous time value as the base for a new time value.
So we can take that 30-second timeout that we just calculated and we can say, well we want this to timeout 5 seconds before that so we have a negative 5-second offset from that original base time. And one thing I'd like to point out is that the forever time really means forever and it doesn't matter how much you add or subtract from forever, it's still going to be forever.
It's saturated at that value. So the other usage is an API that we call dispatch_walltime, and this is what calculates the UNIX relative times. So we can use a POSIX API to create an arbitrary date, number of seconds, you know, since 1970. And then we can pass that time structure into dispatch_walltime.
It also can take a relative nanosecond offset. In this case, we're not going to provide an offset but we've set a time for, you know, the end of this session this morning. And you can take this time and you can actually use it as a base time in the previous dispatch_time API. So kind of once the time is created, you don't really need to pay attention to what time it is. You can mix and match between the APIs.
And as part of our opaque type, we know which clock it was created against and we know when that timeout is supposed to happen. So you can get relative timeouts pretty easily, you can also get absolute timeouts pretty easily and we think this is pretty flexible that a single argument can specify both times of-- types of timeouts in your code. So as we mentioned, dispatch_group_wait takes the timeout, dispatch_semaphore_wait takes the timeout. There's also an API that's very similar to dispatch_async which allows you to defer block execution until some later point in time, and that's called dispatch_after.
So it really looks just like dispatch_async, but it takes this extra timeout parameter. And so in this example, what we're saying is that we want to execute a block 90 microseconds from now, and so we can submit it to the queue right away but we know it's not going to run for at least 90 microseconds Of course, we're not guaranteeing that it's going to run exactly at 90 microseconds. If something else is already running on that queue for example, you know, it might be a serial queue, that block has to finish first before this one will proceed, but we are saying it will be enqueued 90 microseconds from now.
And another usage for time is that of interval timers which is going to lead us into our discussion of event sources. So event sources are, we think, a very powerful construct in GCD that really help bridge your application to what we'll call the real world. You might have a bunch of computation that you can do asynchronously, but actually it's more often that you're doing some sort of I/O asynchronously or you're interacting with other processes asynchronously, and that's a real good source of asynchronous events. And so we've provided a mechanism that will monitor these events for you. It has some types that are monitoring events internally to your application.
Timers are an example of that. And there are some types that monitor events external to your application. Maybe you want to know when there's data waiting to be read on a file descriptor. And so one-- when one of these events occurs, what GCD will do is deliver a callback to a queue that you've specified, event sources have target queues. And of course, the callback doesn't necessarily need to be a block, it could also be a function pointer with the context you provide.
We do have the _f variants for this. And just like queues, sources may be suspended and resumed so you can think of that as kind of like masking interrupts. You know, maybe you're not interested in this event for some short period of time because you've got a dialog up on the screen or something like that.
You can go ahead and suspend the event source. And this is all based on the BSD kqueue kevent mechanism which is really flexible, and so any of the events that are available to you through the kqueue mechanism are also available through GCD with this very convenient block callback pattern.
So here's some sample code, pretty straightforward. We're going to create an event source, a type of timer. All timers in GCD are going to fire on a repeat interval, and we'll show how to set that up in a moment. And in this example, the target queue for the timer is going to be the main queue. So that's where the event handle or block will run.
Additionally, if you associate application context and a finalizer like we talked about earlier, that will also run on the main queue. And we can set a block that's an event handler. It doesn't take any arguments. If you need access to the source, just go ahead and use the source, it's going to be in scope at the time that you set the event handler.
And in this case, we're just going to print ping. Very simple. So here's some of the code that's specific to timers, and the first thing we're going to do is calculate some initial time values. So let's say the interval that we want this timer to fire on is every 30 seconds.
The second argument here, leeway, is a special concept that we have in GCD that allows you to specify some amount of leeway for when this timer should fire. So, a practical example of this might be mail. Maybe you've configured mail to check email every 5 minutes. Well, it doesn't really matter that it happens exactly at 5 minutes because it's just about every 5 minutes that you want to check email.
And by specifying some amount of leeway, what you've given the system the flexibility to do is align timers together for power efficiency. So, let's say you have a Mac that's running in a low power state because you're on the battery currently. Instead of going from low to high, low to high on the power states, you know, changing the clock speed of the CPU because all of these timers are firing at different times, specifying the leeway gives GCD the opportunity to look at multiple timers to say, well OK, these all need to run at about the same time and then it can actually line them up so that you get increased power efficiency.
The leeway is an amount of time after the interval that's going to run. We're not going to run before the interval. So if you were to say, you know, a 30-second interval like we did in this case with a 5-second leeway, that means the timer is going to run about 30 to 35 seconds from now.
The next interval will start up on an even 30-second boundary, so the leeway doesn't cause drift overtime, it's just on each instance of the timer, you know, we might use some alignment.
And then finally, we can specify a start time which is allowed to be totally independent from either of the above parameters. So you could for example say, "Well, start this timer a minute from now but then run every 5 seconds from there on out." So once we've created all this time values, we can call dispatch_source_set_timer which is just going to initialize the timer with these values.
And it is possible to change the values on a timer after it's been created. So you can run at a certain interval for awhile and then later change it to a different interval. And this last line is also very important, it's dispatch_resume. All event sources start off suspended by default.
So it's just like when you call create and you have a reference account of 1 and you need to release it when you're done. When you call dispatch_source_create you have suspend account of 1 on your source and you need to resume it in order to start getting activity from the event source. And the reason we do this, is that gives you the flexibility to do things like set the timer, set the event_handler, set the context, any of these other attributes.
There's not going to be a race condition it's not like the events going to fire right away and you haven't yet quite finished setting up all the contexts. So this lets you start up everything that needs to be set up, resume it when it's finished, you know it's fully configured, everything's consistent, and we've avoided all race conditions.
So here's a little animation of this timer in action. We've created a timer. It's targeting the main queue. The main queue of course, runs on the main thread and after a second a new event is delivered. The event causes the source's event_handler block to enqueued on the main queue which means at some point it will run on the main thread.
That processes the data and then everything goes back to a steady state. So as I mentioned before, the target queue of the source is past that creation time. And it is changeable. You can call dispatch_set_target_queue at any point that you want. So going back to our familiar object diagram of the main queue and the default global_queues or the low default and high priority global_queues, we can create a source. In our example, we targeted the main queue first.
We can also create some queues and create more sources. It's possible for a source to target one of your own queues that you've created. It's possible to change the target_queue of that queue which then indirectly changes the priority of the event source. And it's also possible to change the target queue of the event source directly. So it gives a lot of flexibility.
Again, you can build up any type of hierarchy you want. So just to reiterate, the last parameter in dispatch_source_create is where you specify the target_queue but then you can also set_target_queue later to change that. We'd like to point out that the event_handler block of the dispatch_source is non-reentering. And this was done as a convenience to you. You don't have to worry about locking inside your event source because the event source is going to process events in a non-reentering fashion. But you're not going to miss events. Don't worry about that.
We will actually coalesce the event data. So while your event_handler block is running you'll get a snap-shot of whatever data was available. If new data comes in during that time, it will get coalesced. We'll run the block again to let you know that additional data was delivered. This coalescing also happens while the event is suspended. So if you suspended the event source, it will keep getting all of its data.
When you finally resume it, all the data will be delivered to you. And the GCD event sources are very, very high performance. All the coalescing of data is done with atomic operations. It's all weight free. It's all lock free. And once you've created the source and attached to your even handler, which is likely done a block copy, that's it as far as memory allocations are concerned, so all the cost is up front. The event source can keep going through that loop of, you know, receiving events going to the main_queue or whatever queue you want. Receiving events mean going to the target_queue without any heap allocations.
So its' a very fast path. It's very efficient. So here's an example of what this might look like with data coalescing because of suspending a source. So the first step is we're going to call dispatch_suspend, suspend the source. This was a timer. It fires every second. The count of the number of times that the timer has fired is going to accumulate.
At some point, we call dispatch_resume. It goes through. It gets processed. And we're delivered all the data that we had missed while it was suspended. Sources also support a notion of cancellation. So let's say you're firing this interval timer every second, but eventually you want it to stop.
Well there's an API for that. You can stop event delivery with cancellation. It's not going to interrupt the event_handler blocks. So if you're already in the middle of processing event, that's going to run to completion but that event_handler block is not going to be submitted again. It's also possible to specify an optional cancellation handler.
And this handler is only going to be delivered once for any dispatch_source and its only going to be delivered in direct response to canceling the source. This is a great opportunity to deallocate any resource because by the time the cancellation handler is run, you know with certainty that GCD is not monitoring that event_source anymore. And I'd like to point out that suspension defers the cancellation handler. So if you suspend the source and you cancel it, you're not going to get any further events but you're also not going to get that cancellation handler until it's resumed.
It needs to be resumed in order to go through its normal life cycle. So here's an example of using a cancellation handler in doing some of the resource cleanup that I mentioned a moment ago. Oh, no sorry this is just cancellation. We'll get to the cancellation handling in a moment. So here we have our timer. We're creating it. We're setting an event_handler. We're referencing this under, under block variable which as you saw yesterday allows the block to modify the contents of the variable.
So really what we're going to do here is we're going to print ping five times. And once it's printed five times, we can call dispatch_source cancel and that's going to prevent the timer from firing anymore. And as always after we've configured a source we need to resume it in order to fully activate it.
So there are two types of event sources. There are internal which are application defined event sources They don't require any interaction with the kernel to monitor. And then there are external monitor event sources that are kind of giving you events from other processes on the system. And the behavior of each event source is dependent on the type of the event.
So for some sources, the data that's returned to you is the count of the number of times an event has happened. We saw this with the timers. For other event sources, it might be a bitmask of different event flags that indicate what activity has occurred. And anytime that activity happens, that flag gets or'd into the bitmask. And when your event_handler is called you can check the bitmasking and see what's happened.
So the application internal types are timers as we saw, they deliver a count. We have a purely application defined DATA_ADD source which every time you add an event to the source, it's just going to accumulate by adding and then deliver that data to you. And we have a very similar pattern that's using an OR and that's what you can use to deliver a bitmask. So here's an example of using and adding source in practice, very simple. We can create this. We can set an event_handler. And now here's something interesting.
Since we've created it to target the main_queue, we can use to safely update UI on the main thread. So when we get our event_handler called what we're doing is we're calling dispatch_source_get_data and that's returning to us whatever count has been added together. This is a 64 bit quantity. And we can increment our progressBar by that amount. So perhaps we have progressBar that's reflecting how many bytes of a file we processed. We resume the source. We get it running. And then somewhere else in the application, you know, probably in your I/O Q, you can call dispatch_source_merge_data.
And every time you read another chunk of data from the file, let's say we've processed another 500 bytes, well we can just merge 500 into the source and that's going to atomically add that 500 to the data of the source. And then as time permits and the event handler gets called on the main thread, its going to go and increment the progressBar.
So you didn't need to do any locking. It's very efficient, and it's let you in a weight free manner, coordinate data from your background operations to your main thread without stalling either side of the pipeline. The external types that we allow you to monitor are things like read and write for file descriptors. Processes, you can see when other processes on the system come and go, and you can monitor those by their process ID, and it's a bitmask type.
So this is where you can get bitmask flags like the process has exited or the process has exec'd, and is now a new executable image. There's a signal type which allows you to monitor UNIX signals, things like sig hup or sig term, and that's going to deliver account of the number of times that the signal has been delivered.
I do want to point out on the signal type that we're using the K event mechanism for monitoring signals. We are not installing a signal handler. The advantage of this is that multiple subsystems in an application and multiple frameworks can all monitor the same signal.
On the other hand, you do need to be careful to ignore the default signal handler for any signals you're monitoring, for example sig hup the default handler is to exit your process. Well, unless you ignore that signal in your application the kernel will exit your process before you ever get notified of the events.
So be sure to adjust that appropriately. The VNODE type allows you to monitor file system activity. You can notice when files have been renamed or deleted or written to. And again, that's a bitmask of all those different types of events that happen on a file. And finally, we support monitoring mach_ports both send rights for dead name notifications and receive rights for messages that are available to be dequeued.
So here's a concrete example of using a file descriptor source. We're going to open the standard dictionary of words that's available on a BSD system. We're going to create an event source for readability and we're monitoring the file descriptor that-- the, you know, the actual UNIX file descriptor for that activity. And we're targeting a queue that we've created for where the event_handler block will run.
So here we set an event_handler, very simple, we're reading some data off for the file descriptor, processing it. And then we're going to test for some sort of error condition. Maybe we've reached the end of the file. Maybe we've reached some non-recoverable error So we know we don't need to read any additional data from this file.
It won't succeed. So we can go ahead and cancel the event source. Now we're not going to get anymore call backs. And here is that example of the cancellation handler. This will get run on the target queue of the source when dispatch_source cancel is called. And by this time, we know that GCD is not looking at this file descriptor anymore, at least for this event source.
And so we can close the file because we're done with it and we know GCD is done with it. And as always after you've configured your source by setting the event_handler or the cancellation handler, go ahead and call dispatch_resume in order to start getting event activity from the source. Quick note, since we're talking about file descriptors.
We really highly recommend nonblocking I/O. And the reason for this is if you're targeting a serial queue and that serial queue is running your event_handler block and that's in a blocking read or write call, well, nothing else is going to happen on that serial queue. Maybe you're OK with that, you know, as long as you know that's going to happen that might be fine, but this works really great with nonblocking I/O because we'll monitor for the ability to read or write data. As soon as there is that ability, we'll invoke your event_handler. It can go read as much data as available or write as much data as will fit into the buffer and then you can continue on in an asynchronous fashion.
And if you're not familiar with nonblocking I/O you can check out the fcntl man page for O_NONBLOCK option and a file descriptor. And then of course, look at the read and write family of APIs and the E again error code which let's you know that the operation couldn't complete because there wasn't data available and it would have blocked. And so now, in order to discuss some advanced topics of GCD, I'd like to turn it over to Dave Zarzycki.
Alright, thank you Kevin. So let's get in to some advance topics and I'd like to start with pipelining.
So what is a pipeline? Well you can imagine multiple stages in your program. You can-- might want to represent those with dispatch_queues. And what you can is you can shuttle some context between those stages to accelerate your program. And there's a-- the potential for concurrency is created between the different queues that exist in your pipeline. So while an individual stage might be a serial queue, the totality of serial queues and global_queues will create an opportunity for concurrency.
And a different way to think about this is concurrent tail-calling. So what does that look like? Well, imagine you allocate some context to tract your outstanding work. This is the big picture kind of context. And then you can have a few different queues. The first queue might represent reading data from a file.
The global_queue in the middle would be the amount of computational work you want to do on that data right from the file. And the queue on the right is the ultimate final stage where you write the data back to a file. So, you can start by enqueuing that context on your reader queue and it's going to read that data from the file system which is probably for the vast majority of your customer's single system, single hard drive. So that's very serial device.
So read the data one at a time. Then we can redirect the context over to the global concurrent queue and have it proceed to process the data. It might call dispatch_apply to further fend out and take advantage of the available cores on the machine. Once dispatch_apply returns, we're back to our context and that context will shuttle itself over to the writer queue which will-- it will then write the data back to the disc and we'll proceed on and on down our pipeline.
So to give you an example of what that looks like in code. You're going to allocate your context structure, going to initialize it somehow, and then what you're going to do is dispatch_async_f, probably, the-- to your first queue in the stage, the context and the first stage of the actual logic, the function.
So what does that look like? Here is your reader function and there's, you know, the basic C boilerplate. But ultimately what ends up happening is you run your input method which, so here my read input buffer, which takes the context and does the right thing. And then, we get the global_queue which is the next stage in the pipeline and we dispatch_async at right at the end, just like a tail-call.
So it will send it to our work queue, pass the context along, hand it to the next function in the pipeline. So here's that next stage in the pipeline. We're going to get the global_queue again, call dispatch_apply with our account, and we're going to do some parallel work given the index and the context and work on some data in parallel. Once that completes, we'll hand it off to the next stage in the pipeline, we'll dispatch_async_f.
We're going to hand it to our writer queue, move the context over, pass to the writer function. And of course, the writer function is going to write the data back to the disc. Again, the writer queue is probably serial so that way it can deal with the reality of spindles these days.
So that's what pipelining looks like. We'd like to point out that dispatch_async is a wrap around dispatch_async_f. And dispatch_async_f can be more efficient, sometimes a lot more, depending on how well your code can do this pipelining technique. And the reason it can do that is, well, it avoids the Block_copy and Block_release at each stage. And then also we have a little trick inside that we can actually take our own little memory allocation and pass it along your pipeline just with your code. So we don't even need to malloc or free along the stages with your code. So that was pipelining.
Now I want to talk about performance and tuning. So the topics we're going to talk about for the remainder is, memory management, imbalanced queues, striding, which is an array specific technique, benchmarking in general, and finally a brief talk about instruments. So first, memory management. dispatch_async and dispatch_after are asynchronous. They need a little bit of a tracking data structure to make that magic happen.
It's going to be about 64 bytes or more depending on the size of your block, in the heap per call. So let's say you had this four loop or some other iterative device and you did a dispatch_async and did some work. Well if you happen to suspend that queue for the duration of that dispatch_async and four loop combination. Well guess what? A lot of memory is now going to be sitting in the background with all these pending blocks waiting to be run. And it wouldn't be hard to make some mistakes and end up with lots of memory consumed just waiting to be run.
So what can you do? Well first of all, you can look for this problem. You can check the size of the dispatch continuation zone with the heap tool. We'd like to point out this is just a debugging technique. This is not an API you should be trying to do something with. So for example, you could run heap against the program. We're going to grep through the output so that way you we can just condense it per slide.
And what you can see is the zone. In this particular case, it's empty. The program sitting idle, it's well behaved, there are no floating continuation objects running around. So that's a well behave app. On the other hand, if you saw megabytes upon megabytes and it was staying in that state, maybe you want to consider doing some more thorough performance analysis.
So we have a bunch of techniques available for tracking outstanding work and these are the techniques you can use to try and balance things out. You can use semaphores for some throttling with the value grater about than zero that Kevin talked about. You can use groups to also track a set of work before starting the next stage of the work.
And you can use the call back technique again for trying to schedule the next stage of the work after the previous stage completes rather than trying to just enqueue them all back to back. But the long and short story is pay as you go. You know, don't try and count, you know, schedule everything that needs to be calculated right up front, try and smooth things out and especially consider doing this if the user has the opportunity to cancel things.
You don't want to have to have a huge delay between cancellation and when that takes effect. Another technique to look out for is to watch for latency in instruments. And the more latency exists the less responsive the user interface. And latency can be a side effect of long queues. And only you can decide how much latency is enough or too much.
Similarly, let's talk about imbalanced queues. Queues are a classic computer science paradigm. They're producer, consumer, it's another popular way of describing them. And dispatch_queues make these designs easy. We're doing the work for you. You just-- the producer does a dispatch_async and GCD automatically assigns a thread to the consumer and that does the consumption, and ultimately calls your work blocks.
However, optimal performance may require analysis and tuning. And why is that? GCD, despite-- well is very fast but it does have ramp up and ramp down costs. There's an automatic thread pool. We need to bind these threads in queues and then the queues and the blocks can run together efficiently but the binding and unbinding takes time.
So for example, overwhelming the consumer may result memory bloat, like we just described. The producers producing more work than the consumer can drain it. The high water mark and memory keeps rising and that's something that can be problematic. On the flip side, starving the consumer can result in wasted CPU cycles. If the producer takes more time, creates a tiny amount of work to run on the background.
Well guess what? We're probably, probably going from idle state on that second CPU up to the busy state finding no work to do and coming back down to idle. And that's a ramp up and a ramp down. And with the consumer being very fast or the producer being very slow, that can be inefficient. One way to look for this and perhaps, perhaps see it is a lot of statistical samples with sampling or shark or a showing calls to dispatch_wakeup.
This is our kind of please ramp up logic. So this is what you can do to fix it. You need to reconsolidate blocks. What does it look like? Well, imagine you had some iterative technology it won't or probably won't be as obvious as this four loop but we're going to use a four loop for simplicity.
And the iterative work that's being done does some dispatch_asyncs to a serial queue and it does a tiny amount of work with that object or iteration. Well, the obvious thing to do is just rearrange the loop. If it is a serial queue, just do one dispatch_async, do the iteration on the serial queue, and then your code can run a lot faster, there's a lot less ephemeral memory allocation and a lot less dispatch overhead.
So we're spending more time in your code. Well, that was really obvious what we just showed, but how can it really happen in practice? Well after prototyping and refactoring, it really just trying to get things work or to work better. Your code might do this accidentally and it may not be obvious it might be, you know 100 lines apart but it-- that may be what it's doing. So just be on the look out for that. Use the tools to maybe notice it. And if you consolidate your blocks as it described, you can avoid-- if more-- avoid lots of ephemeral heap allocations.
So a very specific example of consolidating work that is array eccentric, is striding. So when the ratio of the scheduling is bad compared to the work, we can change the ratio and this is an array specific solution. So imagine we have an array and going through each one of these chunks takes too long. Well, what you can do is rechunk the work and do sections. And this is what it looks like in code.
Take a dispatch_apply, make some wiggle room, and what you do is you slide in some work. This example is out of the dispatch_apply man page so you don't necessarily need to make the notes right now. But what we're going to do is take the count, divide it by the stride, and then each block of the dispatch_apply will then iterate on that stride individually, thus changing the ratio of dispatch_apply scheduling work to actual real code that your-- in real work that your code does.
Finally, at the very end of dispatch_apply, we need to do the remainder of work. I'd like to point out that for example in this case, if the count were less than 137 which is our stride, dispatch_apply would be a no-op because the result of the division would be 0. It quickly return and then the remainder would run on the local thread at optimal efficiency.
And this makes sense if the-- we're striding to become faster than anything less than our stride probably isn't worth doing concurrency. Alright, so I've been talking about measuring and improving things. But I haven't really talked about the actual measurement itself. Let's start about benchmarking. Good benchmarking is scientific. Don't expect to do something quick. Don't expect to avoid thinking.
You need to ask yourself, what is being measured? What isn't being measured? What variables aren't easy to control for? And what level of statistical analysis do we want to get involved with? But more to the point, we need you to start with big and obvious concurrency, if you're going to try and improve things. And we also ask that you refine, verify, repeat. So start with the serial case, benchmark the concurrent case.
If your code doesn't seem to be working out concurrently, well, maybe you need to look for a different solution or maybe your code just isn't doing enough work that is really parallel friendly. And you also need to ask, is it worth the extra complexity and indirection? And only you can decide that. We also need you to ask to-- we need to ask you to ensure that you're spending most of the CPU time in your code. This is, you know, the problem at hand. GCD is lightweight but it's not free.
So if you managed to chunk the work so tiny that the ratio of dispatch_overhead to your work is high, then you probably need to reconsider and refactor. So please use available tools like Shark and instruments and sample to get some ideas what's going. Here's a coding example. What's highlighted is your function that you want to test. Oops, excuse me. What we're going to use here is a gettimeofday which is a UNIX walltime clock.
And we're going to use in a way that is friendly towards benchmarking. And the reason for that is that we call gettimeofday to get a start time, iterate through a loop, and then call gettimeofday at the end. It's important to move the timing outside of the loop so we're not timing the timing itself. Then we can subtract the start and the end and then calculate what the average number of nanoseconds per lap.
So, we like to recommend that the lapse variable is huge, maybe like a million, whatever it takes to run it for like a minute. But why is that? Well, long runtimes average out unusual events, preemption, lazy library initialization, having the heap hit the high water mark and other coincidences both good and bad.
The longer you run it, you can go for an average and just work with that resolved. And like I said earlier, if you ensure the time stamps are outside of the loop, you can mitigate the cost of the timing logic itself. Finally, a brief reminder about instruments. There are a couple sessions this week.
With instruments, they've added some GCD provider, you can look into the latency of blocks, you can track which blocks are enqueued on which queues, and you can also track which blocks are executed synchronously with dispatch_sync. And it's important to know that because again, ramp up, ramp down cost if a thread has to block, to wait for a block to complete, then that can be inefficient.
So maybe switching to an asynchronous design can improve things. Finally, you can-- it can help you do optimization. You can look for the longest running blocks which are opportunities to rechunk and make some smaller-- more discrete work. And you can look for the most executing blocks at the opposite end of the spectrum and consider reconsolidating.
This is I believe the queue centric view where it's tracking statistics, the numbers of syncs, number of blocks, the latency, total CPU time, but I didn't-- strongly encourage you that you go to the instrument session to really see such a powerful tool in detail. So that's understanding GCD in depth. We have a technology overview, reminded you about some of the basic technologies like queuing and dispatch_async in groups. We also dove in in depth and talked about a lot more of the object concepts like suspension, resumption, context pointers, doing things once.
We also talked about semaphores in groups, in time, and we also have the thorough discussion about external event sources by Kevin, both internal and external. We hope you can use those to accelerate your code. And finally, we talked about some performance and tuning techniques. So with that, I'd like to invite Michael back on stage to talk about-- he is our Developer Tools and Performance Evangelist. We also have some developer forums available and we have lots of documentation. There is a new concurrency guide available on the website. There are man pages. There's header DOC.