Improving Your App with Instruments - WWDC 2014

Tools • iOS, OS X • 55:25

Whether you're new to optimizing your app's performance or a seasoned veteran, learn about the new version of Instruments. See the new workflows, tools, and tips that will help you analyze and refine your app as it adopts the latest Apple technologies, including Swift and app extensions.

Speakers: Daniel Delwood, Joe Grzywacz, Kris Markel

Unlisted on Apple Developer site

Downloads from Apple

HD Video (442.6 MB)
SD Video (182.4 MB)
PDF Slides (2.4 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Howdy. I'm Daniel Delwood and I'm excited to welcome you to Session 418, "Improving Your App With Instruments". I've got a lot of fun stuff to cover today, so let's get started. So there are five items on our agenda today, but really only one goal and that's to give you the tools you need to go out and proactively profile your apps, looking for memory problems, CPU problems, and overall to increase the efficiency of your applications.

So I'll start by talking about what's new in Instruments this year, and then we'll have three sections: memory management, time profiling, and performance counters, each with associated demos. Hopefully, so that whether you're a new profiler on our platform or an experienced Instruments veteran you should be able to take away some tips and techniques from those demos. And, finally, what session is complete without covering DTrace, and so we'll talk about some new DTrace improvements in OS X Yosemite.

So to get started what is new in Instruments this year? Well, we've got two major themes, and the first theme is a new UI, and so you'll notice it looks quite a bit different than before, but this is to match the style of Xcode and OS X Yosemite, and we think that you'll really like it. At the top I'd like to point out the toolbar. Of particular note is the path control there, it allows you to quickly select both target device and process in one gesture, which we think will really speed up some of your workflows.

And on the right we've got a centralized place for all of your configuration options, so whether this is recording options before you run, viewing options, or an extended detail, stack traces. There's one place now to go and that's called the Inspector. Now the reason we did this is mainly for your content, and we all spend a lot of time looking at statistics and call trees and having more real estate, more space to clearly dive into your problem is really what we did this for.

So that's the first theme. What's the second theme? Well, as you may have noticed, Swift is a pretty big deal this year, and we're really excited about what it offers. But as you go out and adopt Swift in your application, whether it's by adding a new Swift class or even converting some of your current code to Swift we want to make sure that you have the confidence because you can measure and you can understand what's going on with the technology.

And so Instruments has support for Swift, and this will run across all of our demos today, so we'll be using both hybrid and pure Swift apps for those. Now there's a host of other features and Instruments, some of which we'll get to today in the demos, like simulators behaving as devices and Instruments now, some counters improvements and the like, but those are the two sort of high level things that are new.

So let's dive in and talk about the first problem that you may face when developing your application, and that's memory usage. So how can you reduce the memory usage of your app? Well, what we're talking about here is persistent memory usage, so that helps narrow it down a little bit, but we also needed to find out like what is your app using?

And for doing that, we can think of the memory in your application as sort of an iceberg, you've got the heap memory on top, which you know about because these are the objects you interact withview controllers, NSImages, managed object context, these are the objects that your code directly interact with.

And then under the surface there's everything else, there's the layer memory backing those images or your managed object context, there's a lot of objects, as well. And so if you've managed the memory sort of on the surface and you manage it well, it turns out that you actually do a pretty good job managing the whole usage of your application.

So how do we manage those objects that we deal with? Well, for that there's an ownership model. And Objective-C's Ownership Model is pretty straightforward. It's a reference guide model based on retain and release, and when the object drops to zero the object is freed. These rules are well-known, the rules are documented, and the key point here is that it's deterministic, simple and pretty fast. So what about ARC?

Well, this makes it so much easier because it manages the retain/release for us. And so it transforms that problem from dealing with retains and releases to dealing with your object graph and managing relationships at a semantic level. So what about Swift? Well, Swift's Object Model, understandably, is a lot like Objective-C's, although I should point out that there are a couple of differences, namely that the retains and releases are inserted for you and there's really no need to worry about them.

If you're working with Objective-C, though, you do still have to manage autoreleasepools and for that there's a nice standard library function, which accepts a closure, and it's really easy to use. And so just make sure that you're inserting autoreleasepools as makes sense with the objective C code that you're using.

So if these are the ownership models what tools do we have available to us? Well, one of the heavy lifters that we have is the Allocations Instrument, and this provides back traces and information for all of your heap allocations, so it gives you things like class names, like NSMutableArray. If you're dealing with Swift objects, it's important to note that your Swift objects will be prefixed with a module that they are contained in. And so for your application this may be just your app's name.

But for Swift objects, the heap only will contain your reference types, and so if you're thinking about this at a code level, these are your classes, not your structs. But once you identify these allocations, the allocations will provide you the reference histories for them, so retains, releases, and even allows you to pair these. It also provides a lot of information for anonymous VM usage and so you can use a Call Tree and track it back to the source.

So I would actually like to show you a demo of using allocations and profiling and app extension. So I have here a simulator of an iPhone 4S, and I'll launch Instruments, and you'll immediately notice that we have a new template chooser. And the template chooser allows you to select at the top your device or even manage some devices.

For simulators we can go ahead and choose to launch a different simulator, and allowing you to choose the target that you want to profile before even selecting your template matches the workflow from Xcode a lot better. And so if you're used to choosing what you want to profile and then the tool you can do that or you can just go ahead and select it as always the tool and configure it first.

So I've selected the Leaks Template, which has 2 instruments, the Allocations and Leaks Instrument. And for this demo I'll pull up the list of apps installed on my simulator, and you'll notice it behaves just like a device. It gives me installed apps, and now, new in iOS 8 it gives me app extensions, as well.

If I select, say, a sharing extension, the popup will also let me select what app it's hosted in. And since app extensions are accessible to multiple apps in the system, even system apps can have behavior provided by your extension, you may want to select a process other than your main app executable when you're profiling your sharing extension.

Now for this demo I'm actually going to select a today extension, and since we know what view it's going to appear in there's actually no need to select a hosting app. So I've actually launched my app in the simulator, and it's an app extension bringing in today view.

And what I have is an extension that allows me quick access to my game, and so maybe I'm playing a move very slowly, I'll play a move every other day, and I just want to pull it down and make a quick move, but it's a way of me guessing what the pattern is, it tells me how many I get right, how many I get misplaced or wrong, make a couple of random guesses.

You'll notice the memory usage and Instruments is going up slightly, but what I'm really interested in is finding out how I'm doing cleaning up memory when I create a new game. And so I can just create a few games here at the end, and in Instruments you'll notice we detect quite a few leaks. Let's take a look at that.

So, first of all, in Allocations it's showing me all of the categories that I've recorded, things like CFString, and I can type in for my example, dot, and we see four different categories. We see the colored button and the DotsTodayViewController are prefixed by my module because these are Swift classes, and the other two are Objective-C, so they're running together, working fine.

If I select one of these Swift classes and dive in even more I can even see the retain/release history, paired or not, and if I go back to the summary I'll quickly notice that the persistent number, which is the one I want to check if I'm looking for memory growth, is, well, 100%.

Now this is where your expectations as a developer come into play. I expected my colored buttons to be reused, so there's a few of them persistent, a few of them transient. But my games, I expected there only to be on active at a time. So this I've immediately identified a bug and I should go at this point and fix it.

Now there's a couple other ways to find memory growth. I'm going to show you two more. First, we've got leaks, and leaks will tell me all the memory that it could find was unreferenced. This is a type of persistent memory growth, and so we've got malloc blocks, a couple of colors, and those games again.

And when it comes to persistent memory growth, there's really three types. There's leaked memory, there's abandoned memory, and there's cached memory. Leaks will tell us the first of those three, but if I want to investigate all three I can use a technique called generational analysis. Now normally you'd used this Marked Generation button while you're running and set down some flags for you to then later know when you were doing certain actions, but in this case I actually know when I did them and I can add them after the fact, which is a really powerful technique.

So if I select Marked Generation, it creates sort of a baseline snapshot, and what this is doing is categorizing all the persistent memory in my application by where it came from on the timeline. So anything before this flag is grouped into generation one, anything between flag one and flag two is generation I guess B, as it's labeled, and I can keep marking these.

I can even actually choose to move them around just by dragging them, and it updates the generation information immediately. So if I turn down generation B here, you'll notice that, oh, well, we have those dots games again, and we also have an equal number of NSMallocBlocks. Well, that's kind of interesting, I wonder where these are coming from. Perhaps we've got a retain cycle involving a block.

So on the right in the inspector area, I take a look at the stack trace, and I'll notice that here we have a Swift frame calling new game, an action on my view controller. Let's take a look at that real quick. If I jump to Xcode, here's my new game function, and it's just creating a dots game, assigning it to my library and then it's registering for state changes on that game and passing in what looks like a block. Well, this is actually a Swift closure. We can take a look at the header, and we see here that registerForStateChanges takes a dispatch block, and here when we're providing it we just give a very natural syntax.

Well, if you're familiar with blocks and retain cycles it's very easy to capture the game in the block registered on that game. And so what we really need to do here is actually declare that we would like to capture new game and, well, it's probably safe to capture self weakly as well here. Now Swift offers a really good technique for doing this, and that's Captureless. So we can just say that we would like to capture self and new game weakly in this block.

Now we get a compile error real quick, and that's because when we declare something weak it gets implicitly converted to optional. Well, that's fine, if self is nil, we really don't care to call that method, so we'll go ahead and add some optional chaining, and there we go, we're done.

So, as I showed you, Instruments now supports app extension profiling. You can specify your host app either in the menu, in Instruments or in your scheme, in Xcode. And app extensions, while they are transient processes, memory does matter because if they crash, if they flicker it's a very bad experience for your users and you want to make sure that when they're running in other people's applications, when we're running on a system that you get the best user experience possible.

Now I'd like to direct you to two other sessions, if you're interested in learning more about app extensions on iOS and OS X. So let's talk briefly about memory management with Swift. Objective-C code can still mismatch retain/release, especially if it's a code that you've handwritten and you haven't converted to ARC.

So when you're using Swift objects and passing them to Objective-C code you may need to do some retain/release pairing and track down those problems. Also, more important, is that you can still form cycles in Swift. Luckily there's two great tools, language tools available to us to manage that graph and not have toand break these cycles.

So the first of them is weak, and this actually converts your type, it requires that you use an optional type because what it does is it returns the type of nil, depending on whether the object that it references is still alive at the time you attempt to use the variable.

Now for unowned you may think this is a lot like unsafe, unretained and ARC, and it's actually safer than that. What it does is it returns the type always, but if the object doesn't exist you don't get an unknown selector bug or an indeterminate crash later. It actually validates if the object is still there and then gives you a deterministic crash, so some very powerful tools at your disposal when dealing with Swift.

And, as I showed in the demo, with block captures this is a really easy way of creating retain cycles and something you should be aware of. In Objective-C, if I reference self, like this, it's going to capture it strongly, as well as new game, and I would have to use these weak type of declarations and then use them inside the block.

In Swift, like I showed you, it's very, very easy using capture lists to quickly change the memory type of your variables within the closure, and so just by declaring the weak and the capture list we can use them and no problems. And Swift is also really powerful here in that if you reference a property it will force you to declare that you really do intend to capture self by using self.property.

To learn more about Swift we've got a lot of sessions this week, "Swift Interoperability In-Depth", "Advanced Swift", and if you want to learn more about the generational analysis that I did during the demo I'd refer you back to last year's talk on "Fixing Memory Issues". So, with that, I'd like to turn it over to my colleague, Kris Markel, for telling you about Time Profiling.

Time Profiler at a high level is a tool within Instruments that helps you make your apps faster. The real question, being faster, in and of itself probably isn't your real goal. Why is it that you actually want to use Time Profiler? And the answer, I think, in a lot of cases is to provide a great user experience. And what does that mean?

That means you have faster app launch times. If it's important that your frame rate stay at 60 frames per second, Time Profiler is a tool that can help you do that. Buttery-smooth scrolling, Time Profiler is great at finding hiccups in your scrolling and helping you get past those problems.

And kind of a more general category, you want to have a responsive UI. If the user does something in the app, they need to see the app respond as quickly as possible. You don't want them to click a button and then wait and wait and wait for something to happen.

So what exactly is Time Profiler? Time Profiler is an instrument that samples stack traces at pre-described intervals, and its goal is to give you an idea of how much time is spent in each method. There's an important distinction here in that Time Profiler is not actually tracing through every method and function calling your app and measuring how long it takes. It is capturing stack samples and recording how often frames appear in those samples.

Now this can be important if you have a function or method that doesn't run enough to show up in any samples or it's too fast to show up in any samples. This isn't an issue most of the time, but it's an important distinction to keep in mind when you're using Time Profiler. When should you use Time Profiler? Well, there's a couple different situations.

If you have a specific problem that you know about, Time Profiler is great, you know, if you have stuttering or frame rate slowdowns, some portion of your app is taking too long, parsing a file or something of that nature, but it's also fine to use Time Profiler to identify and fix hot spots before they become problems. You don't have to wait until you have a problem before firing up Instruments and using Time Profiler. And, actually, it's a good idea to keep an eye on the CPU gauge inside of Xcode.

If that's spiking in ways you don't expect or it just seems higher than you want, that's a little clue, you know, something should go off in the back of your heads, maybe I should look at Time Profiler and see exactly what's going on here. And so now I'm going to show you a demo of how you can make your app faster with Time Profiler.

And what I have here is a hybrid app and it's got some Swift files and it's got some Objective-C files, and I've been working on it for a while, and I want to get an idea of how well it's running before, you know, I don't know of any specific problems, but I want to kind of take a look and see if maybe there's some hot spots in there.

So I'm going to go ahead and Time Profile this app, and to do that I am going to press and click and hold on the Run button in Xcode and choose Profile. And this is going to launch Instruments into the template chooser, and once it's done building I'll show you a nifty little trick.

And if you're familiar with Instruments in the past, what would happen here is it would give you the template chooser, what you see here, and you would choose, you would go down to the button down here, and you would click it, and Time Profiler or whatever-actually, let me choose that now while I'm thinking about it. Whatever instrument you had chosen would automatically start running and recording for your app.

With the new Instruments it gives you a chance to change your configuration before it starts running. Now if you want the old behavior it's still there, if you hold down the Option key you can see the button change to Profile, and that'll launch your app and start running it immediately. In my case I don't want to do that, I'm going to go ahead and go to Choose, and I'm actually going to make a couple adjustments.

And, actually, I'm right here, this is, you can't quite see all of it, but the two changes I want to make and why is going to become clear soon is I want to choose to record waiting threads, and this means don't only collect samples from the threads that are doing work, but collect samples from the threads that are idle, as well.

I happen to know that I want to do that because I've rehearsed this talk before. If you choose that option another really great option to choose in the view options is just to select running sample times, and what this means is even though you're collecting the idle samples, don't use them when computing time spent because they really sort of water down the results. You're going to have a lot of idle time and get some very small numbers that you don't expect.

So, with that, I am going to zoom out, I'm going to click the Record button to start my app, and let's see here. So here's my app, I've got a set of pictures over on the left, I can select them and get the thumbnail and then get the big view. And now I'm going to go ahead and click on my button, and whoa, that's awesome.

So let's go through and use a little more. One thing I notice is that between the time I click the button there's a little pause before the eyes actually show up, and I don't like that. My app is supposed to be eyemazing, if you can read that title bar that's clear, and having to wait for your googly eyes is not an eyemazing experience. Well, fortunately, I was capturing a time profile, let's go ahead and see what's going on here.

So now if we look at the track view, we now have new gestures for working on the track view. And to select a time profile all I have to do is click, or to select a time filter all I have to do is click and drag and that applies the time filter. So it's much simpler, hopefully far more intuitive than it was before. And then to actually zoom in you do the same thing, but you hold down the shift key, so I'll do that now, and I can zoom in, and then I'll go back out.

And so now I can see, if I come down here I can see that most of time is spent in the main thread and in the start method and, oh, you're not here to watch me click disclosure triangles all day. I'm going to go over and I'm going to use Command 3 to bring up the extended detail pane, and this over here gives me my heaviest stack trace. And what I'm looking for is in this column here. I'm looking for a big drop in the numbers.

And I can see right here I go from 120 to 180, that kind of gives me an idea of where my problem is, and when I select it in the stack trace, it selects it in the Call Tree view. And there's still a lot of stuff in there, a lot of which I don't really care about.

So I'm going to hit Command 1, well, actually Command 2, and go into my view options, and one of the ones I'm going to choose over here is to hide missing symbols. No, I'm going to choose hide system libraries, and this narrows it down to just my code. And it's kind of nice, if you look in here you can see, I'm going to expand this one more.

Oop, you can see here's an Objective-C call, here's a Swift call. Oh, I got this backwards, right? This is the Objective-C call, this one down here is the Swift call, and it's at Ob-C is the bridge between them. And if you're using a hybrid app you'll see those, and just so you're aware that they're there, they're not causing any harm or anything, it's not anything wrong.

Another thing I want to do is I want to invert the Call Tree because this makes it a little easier to kind of see. Frequently the most expensive calls in your app are towards the end of the Call Tree, so that kind of helps you narrow in on that.

And another thing that makes this nice is I can see per thread really quickly what each, what's the expensive calls. So I'm here, so I'm spending a lot of time in this method right here, so I'm going to double click on it, and it brings it up, let me make this window a little bigger, and I can see my Swift code.

Ooh, and it scrolls really quickly, and thanks to this big red bar I can see that most of my time is being spent on this particular line of code. What is this code doing? Well, I'm using the core image face detection to find faces and then to find eyes on those faces, and that's where most of my time is being spent. What are my options here?

Well, I can either change the accuracy level of the detection so it doesn't find faces, as well. I can maybe turn off the blink detection. Those two things might make it faster, but I don't think they would provide as an eyemazing experience as having high fidelity face detection and getting, you know, the googly eyes need to be on the closed eyes, as well.

I could maybe write my own face detection and maybe that'll be faster than what's provided in the kit, probably not. So what are my choices here? Well, I actually notice using the app that even if I can't make this faster maybe I can change the way the work is being done.

And to do that, I actually am going to change it so that when you click on the app in the sidebar, when you click on the thumbnail in the sidebar, we'll go ahead and build the googly eye image then, and then by the time you click on the button it'll be already ready for you.

And so I'm going to scroll down to my didSelect, tableViewSelectionDidChange, and I happen to already be caching my googly eye images, but I'm now doing, I'm currently doing it when you click the button. So I'm just going to add a little code here to generate that when the selection changes. We're going to just move the work around here, and I'm actually cheating a little bit here, I want to point this out.

I'm using a mutable dictionary as my cache, and so instead of having to deal with a mutable dictionary and multiple threads, I'm actually dispatching this back to the main thread so that all of my dictionary access happens on the main thread. I can do this because this is a fairly simple app, it's not a lot of threading going on, but if you're doing more complex stuff you're going to need a more involved solution, but this works for me.

And if we scroll up to the actual-this is when you click the button, you can see here we basically get the selected row, we see if the image is in the cache. If it's not, we go ahead and build it and put it in the cache, and then we display it. So let's go ahead and see if this makes it any better.

So click the Profile button, you can see it profile again. Now if you keep your eye on the track view you'll see that it still does work and you can still see the spike, right? But by the time I actually go to click the button the work is already done, so now there's a little spike, click the button, oh, scroll down here, try this one, oh, I love this app, I could do this all day. Oh, wait, what's going on here? Where is my scrolling? What's going on? Oh, it's stuttering, oh, this is not eyemazing.

I need to fix this. Okay, let's see what we can do. This is really terrible. Fortunately, I'm still running Time Profiler, so I'm going to go ahead and select some of these little spikes in here, and I apply a time filter and hold down the Shift key, and go ahead and zoom in.

And then if I look, I still have this invert Call Tree, so if I expand my main thread I can see its most expensive calls. And right here the thing that's really hurting us is the scaling the image, that's where we're building the thumbnail and making it smaller. But the other thing I happen to know: that I can see the CPU spikes and there's a few of them, but it doesn't really quite explain that much stuttering. I would expect the CPU to be doing a lot more work if I was stuttering like that.

So another thing I'm going to go check is I'm going to go up here in the toolbar to our new strategy buttons, and I'm going to check out the thread strategy, which is this button over here. And this is going to show me what each thread is doing. So now you can see the whole thing. Each of those little stopwatches represents a sample that was taken on that thread.

The ones that are solid are ones where the thread was doing work, and the ones that are translucent are the ones where the thread idle. So if I go ahead and click on one of these and you can kind of see if you look up at the top it says idle call stack sample, not running on CPU, so this is an idle case. And now I want to make one thing clear here because, again, this demo is well rehearsed, I know exactly which of these to click on to find my problem.

When you're dealing with your app, there's thousands of these samples. You're going to have to use some detective work to really narrow down where in your app you're seeing the stuttering or you're having the problem, and you're probably going to have to click through several of these to find out exactly what's going on. But in my case I can look through and in each of these, I can see in here, what is this: NSURLConnection sendSynchronousRequest..Whoa, that's a big red flag; synchronous request, bad idea. Synchronous request on the main thread-what the heck is going on here?

So let's go back to the app, I know there's something going on here, and I actually know associated with building the thumbnail. So I'm going to go into my photo class, and go to where-oh, there's a little comment here, what does this say? I read on the internet: Kris from six months ago. I don't even need to read this whole thing. That guy is a fool, if only there was some tool. I'll show you a tool.

Okay, I'm just deleting all this code. We don't need to use a URL request to load a local resource. I happen to know that-that's good-bye-that the image we're looking for, that we're looking to create the thumbnail for, if you look down here in my initializer, I actually have a property that already contains that image. So instead of doing whatever that guy from six months ago was thinking, I'm just going to use the image that I already have, and that'll free up my block thread.

Now I also happen to know, this is a well rehearsed demo, that doesn't quite solve the problem. It helps a bunch, but there's still a little stuttering, and to fix that we're actually going to do another little cheat, we're going to move some work around. I'm actually going to go and pre-generate those thumbnails when the app starts, and I'm going to do that here.

This, my Swift class here has a load photos method, and this just goes and pulls all the photos out of the bundle and creates a list. And in here I'm going to add some code that actually goes through, and just simply touches each photo, asks for its thumbnail, and that gets-that causes the generation of the thumbnail, and that work happens.

We're doing this in a background thread, basically, as soon as you load all the photos the next thing we do is we go and generate all the thumbnails. This doesn't actually technically make my app more efficient, it just moves the work around, but I think for the user, from the user experience they won't really know.

So now if I go ahead and profile my app again, we take a look, you can watch. At the beginning of the thread you can see it doing more work, if you actually use the app. I should have some nice buttery-smooth scrolling here. Oh, this is a good one: fun at the park, fun at the park with googly eyes. So now my app is great, it's super responsive, buttery-smooth scrolling, it's just so much better. And that's an example of Time Profiler.

[ Applause ]

So to real quickly, or I don't know, maybe I won't be that quick, we'll see, review what we've done. We talked about the new track view, you can identify and zoom in on the problem areas. You drag and select to apply a time range filter. Shift+drag zooms in, and control+drag will zoom you out.

We also talked about the new inspector panes. There are some new keyboard shortcuts, so you can quickly move between them. If you're familiar with a certain IDE, this should kind of ring a bell. Command 1 is the record settings. These are the settings that apply to the next recording session.

Command 2 is display settings, so it alters how you see things, either in the detail and extended detail or in the track view itself. And then Command 3 is the extended detail, so you can see basically it's details on whatever is selected either in your detailed view, like the Call Tree view or in the actual instruments tracks itself.

We also looked at two of the strategy views. The one we didn't look at is the CPU strategy, and this will show you what worked, each core on your machine is doing. We did see the Instrument strategy and a few Instruments that all of you are probably familiar with this strategy. Each track shows you what an individual instrument is recording. And then, finally, the thread strategy, which shows you what each thread is doing.

And then just a quick reminder, if you're in Time Profiler you can enable record winning threads and see any blocked threads you have, and then you can go through them and see maybe there's something in your code that's preventing your threads from doing work when they should be.

We also talked about a few Call Tree settings. Since expensive calls are frequently near the end of the call stack, you can turn on Invert Call Tree. It's also a great way to see what each thread is spending its time doing. And then you can hide system libraries to help you focus on your code and not really, you know, what the frameworks are doing-you probably can't do a whole lot about.

And now I want to discuss a couple tips. In the Context Menu, on the Call Tree view, you can choose to focus and prune current settings. Charge moves the associated cost based on which option you select. Prune removes the associated cost, so those things aren't even sort of part of the calculation anymore. And then focus you can think of as a prune everything but.

And then, finally, I want to talk about two more guidelines. Some things you should always do when you're using Time Profiler. First, you should always Profile Release builds. The performance characteristics of a debug build versus a release build are significant. And profiling debug builds could cause you to fix problems that aren't actually there, that get optimized away.

The default in Xcode is to profile Release builds, but this is not necessarily the best option for every Instrument. For example, the Allocations Instrument when using Objective-C Code, the debug information provides it extra data it can use to report on your allocations. And you can change this in the Scheme Editor in Xcode. You just choose the Profile button and you can change it from debug to release.

And then, finally, for iOS it is really important for you to profile on the device. The simulator, for some things it is much, much faster, for many things. for some things it's actually much slower than the device. It really is not a good use of your time to run Time Profiler in the simulator, all right? And, with that, I'm going to hand it off to my colleague, Joe Grzywacz, who is going to show what options you have when Time Profiler is not quite enough.

Thank you, Kris. My name is Joe Grzywacz. I'm going to talk to you about a few things, starting with what are Performance Counters? Each of the processors in your machine contain some small number of these 64-bit hardware registers. Now you have typically like 4 to 8, some very small number, it's a very limited resource, and these are separate from the regular floating point and integer registers that you may think of when you're talking about your program actually executing and doing work.

Now what can you do with these registers? Well, they're good for counting some small number of events, things such as number of instructions you've executed. It'll do cache misses, branches taken, you know, low-level CPU-specific things. Alternatively, you can have them take a call stack every time some predetermined number of events occurs. So what does that look like? The called performance monitoring interrupts, or PMIs for short. So let's say we have a program is actually getting here over time, and the blue bar represents the number of branches we are executing.

And so higher means we actually did more in this period. With time based sampling we just saw in Time Profiler we're going to take samples at a regular periodic interval, 1 millisecond by default, and this is true for the Time Profiler, as well as the counters instrument. But if we're really interested in what our program is doing when it's being very branchy, what we can do is set up one of these PMIs so that we actually get more samples during the periods we cared about.

So we do, let's say, every 10,000 branches or so, let's actually take a sample, and so we get more in the areas we care about, fewer everywhere else because that's just noise to us. And this should let us hone in on the problem in our application just a little bit quicker.

So what else can you do with Performance Counters? Well, you can get more information about your app's performance beyond just time that we saw, so it's sort of a Time Profiler on steroids, and this is why we consider it maybe a little bit more of an advanced instrument, but it's something you should still have in your toolbox.

So what can you learn? Things about how you're utilizing your CPU resources, things like the various levels of caches in your machine, the execution units, translation look aside buffers, if you know what that is. So you can look at all sorts of low-level details that might be bottlenecks for your particular application.

You can also learn more about your application, itself. For instance, how often are you branching? How often are those branches mis-predicted? You can even look at your instruction mix, you know, how many integer operations are you doing, how many floating point operations are you doing? These are things you can't always determine statically by looking at your program because you're depending on some user input or something from the internet of that sort of thing.

So we've had counters for a few years now, but this year we've added a few new things. First, you can now specify formulas, so now you can get those raw events in, but then actually apply some basic arithmetic to them on top of that. So you can see, we can generate IPC, which is instructions per cycle. That's a measure of how much work you're doing in a given clock tick. Generally, a higher number here means you're doing more work and performing better.

You can calculate branch misprediction rate or things, like your L1 cache miss percentage, take the total number of loads to your L1 cache that missed, add them to the total number of stores I missed, and then divide that by the number of total loads plus the number of total stores, and multiply it by 100 for a missed percentage.

Now we've had x86 support for a few years, and this year with iOS 8 you can get 64-bit ARM devices, tab counters, as well. And then the event profiler instrument, if you're familiar with that, that's where you used to go for any kind of PMI-based information for that kind of sampling. You can just do that right within the counters instrument now and there's actually no need to go do two separate instruments. Anything counters-related, head over to counters.

And, with that, we'll do a little demo of iOS Performance Counters. So I have a little basic Matrix Multiplication app, so if your linear algebra is a bit hazy to do a Matrix Multiplication of a two N by N Matrices, for each element in your alpha Matrix C you have to multiply piece by piece an entire row of the input Matrix A times an entire column of the input Matrix B, and then add each of those parts together.

So in pseudocode it's basically just this triple-nested for loop here, where all the work is really the sum of products happening here. So in Swift code, this does scroll fast-all right, so we have our, basically, this is our triple-nested for loop, and with all that work being done again and that summation of the Matrix A times Matrix B. And so what I want to do is go ahead and profile this in Instruments. So now this is going to run on my device.

I'm going to use the option profile, so that's just going to launch Time Profiler, and it's going to start recording on my device. And I can tap a button, and it starts my Matrix Multiplication. Now this is going to take a little while because I'm multiplying two 1,000 by 1,000 matrices, each of them contains a double, so that's 8 bytes, so it's 8 megabytes per matrix, 24 megabytes in total.

And so it takes us here about-well, it took about, what, 12 seconds to do that calculation. So I'm going to go ahead and stop, click drag to time filter, to the region of interest. Let me make this a little bigger. And what I want to see is where was my app spending time?

So if we zoom in here on our extended detail view, where we can follow that really big number, that's how many samples there were. It gets really small as we drop down here, so the problem is most of the samples are actually occurring in this Matrix Multiplier routine.

And if we double click on that routine we get taken to the source code, just like before, and we can see, just as I expected, 92% of my time is being spent right on that summation line. I don't have much code, it's all being spent on the one line I wrote, so I'm kind of at the end of the road with Time Profiler. I mean, I think I've done as much as I could do. Well, we can actually dig a little bit deeper if we're willing to go look at this assembly.

And what we have here is these ldr instructions that are taking a bunch of time, and an ldr is basically just loading a register from memory. And so we can see these two ldr instructions, one of them is loading Matrix A, one of them is loading Matrix B, and they're taking a long time. So now I can use the Counters Instrument to actually go and figure out why. So I'm going to start a new template, select Counters, there we go.

And now I want to add some interesting events to record using Counters. And so if we choose to add an event, we can see the list of all of the events that this device supports, and I'm interested in my L2 cache, so I'm going to add the number of L2 cache loads. And I have a favorites list here already of the L2 cache loads that missed, the number of stores, and the number of stores that missed. So I have those events. I'm going to create a formula using them.

So I want to get one number that kind of represents those four things, and what I'm going to do is actually generate a miss rate by saying the number of loads plus the number of stores that missed, and divide that by the summation of the total number of loads, which is loads plus stores, okay.

And if I multiply that by 100, now I have my missed percentage, that's a little negative for me, I prefer to think in hit percentage, so I'll just take 100 minus that number and give it a name, L2 Hit Percentage. Add that event, and as I mentioned before IPC is one of those handy guys to have. There's a predefined formula for IPC, so I'm going to go ahead and add that, as well.

And so I'm going to start my recording again, and this time I'm going to see the Call Tree view, as well as the track view has a bunch more things being drawn, and those are all my events. So I'm going to start my Matrix Multiplication again, you can see all those events going by, and those are being calculated. And as this wraps up we can actually start looking at the information in detail. So it just finished, and what we have here is the number of L2 cache loads, was at 462 million.

Okay, that's a lot. 124 million of them missed, and there's a much more insignificant number of stores. Our overall hit percentage was 72% or so. Okay, that's not too bad. Our IPC was 0.6, that means we're doing less than one instruction per clock. I don't know, I think that sounds kind of bad, I thought computers could do multiple things per clock these days.

But the number that sticks out to me most is that L2 cache load number: 462 million loads to my L2 cache when my input matrices are only 1 million elements each, so a kind of back-of-the-envelope calculation says I should only probably do about 2 million, so why is it so big?

Well, it turns out if you go online and search, there's actually much better ways to do Matrix Multiplication that are a lot more cache friendly for today's processors. And one of those ways is actually go and block up our Matrix Multiplication instead of trying to do an entire row and an entire column at once, we're going to do it in smaller pieces. Because this Matrix is so big trying to do it all at once means we're going to load in a lot of data and then kick it back out to memory and then load it back in again and kick it back out, and that's very inefficient and slow.

So I have another implementation where I actually block this matrix up into, oh, like 50 by 50 pieces, so much smaller. And we can see it execute here and there, it already finished. So we did this in about, let's see, time filtered down to that region and we can see where our running time now is at 3.5 seconds, so we already split it up about three times and it took about 3 lines of code, very straightforward.

And our L2 cache load number is now at 24 million, so we went from 450 million down to 24 million, so that's much improved. Cache stores are still somewhat insignificant, although they're growing in size relative to loads. Our L2 hit percentage, interestingly enough, has come down. We were in the low 70s before, now we're at 64%, that's a little bit odd. Our IPC, though, is much better, it's at 1.6. Now we're actually doing 1.5 instructions per cycle, so that's a lot better.

So this is where we kind of consider this a little bit more advanced instrument because you can generate a lot more of these numbers, but you have to know how to interpret the results. And so our IPC got better, so that was great. Our cache hit rate went down, but was that a problem? Well, no, because we actually did hundreds of millions of loads less to that cache, so the fact that that hit rate came down a little bit, not really that big of a deal.

To kind of wrap up, this Matrix Multiplication thing you could continue, keep on iterating and try to write a better version, however, we've already done that for you. There's an Accelerate framework that has that done, and if I tap on that version, see it actually finished here in about 200 milliseconds and it's one line of code and it works, and it's optimized for multiple cores, multiple devices. A piece of cake, please don't go write your own Matrix Multiplication.

Now the last thing to point out on this demo is I spend a lot of time over here, down here, adding these events, adding these formulas. They're really useful. I'm going to come here to this instrument many, many times. I don't want to type them in again, so be sure to go up to File > Save As a Template, give this some interesting name: L2 Hit Rate Counter, or something like that. And then save that off, and then the next time you go to the template chooser, use the custom filter, load up this template, you don't have to type that in again and again and again.

[ Applause ]

So what did we see? With counters you're basically going to collect data by default, similar to Time Profiler. It's a statistical sampling of your application. This is an important point: that basically for the past 1 millisecond by default, we're aggregating the time and all the events, and then we're going to attribute it to that one moment in time where we take a call stack sample of your application.

This means you want to get a lot of CPU samples to get really an accurate picture of what your application is doing. If you're not very CPU-intensive, this isn't the instrument for you. You can tweak that slightly by using the Performance Monitoring Interrupts. So this allows you to vary the sampling rate based on the number of events that are happening. So if you're really interested in branches or L2 cache hit rate you can actually sample based on how many of those things are occurring and get more refined results on what you're looking at.

Note that the PMI instructions can be slightly, a little bit imprecise in that, by the time the interrupt fires, let's say you take a call stack, a small number of instructions may have gone by, which means you won't be on the precise instruction that caused this, but you're going to be right in the neighborhood and it'll get you to the problem a lot more quickly.

And be sure to use formulas, so you can gather all these raw events, but a lot of times it's a lot easier to generate one metric, that makes it a lot easier to see the problem at hand. So generate those formulas and then save them inside of a template so that you can get back to them quickly in the future.

Changing gears entirely. DTrace, if you're unfamiliar with DTrace, I recommend you go online, there's plenty of information about it. Basically, it's a kernel-based recording technology, it'll let you get more information about your application, especially how it interacts with the kernel. Today we're only going to talk about what's new with OS X Yosemite.

First, there's a dynamic tracemem. So a tracemem is a way to actually print out the contents of your memory. So previously you had to know the size of your array and compile time, and that happens to work a lot of the times, but not always. Now it'll work with variable size arrays. So say I want to trace this CGContextFillRects guy? Arg1 is a variable size rects array, and Arg2 is the count of how many things are in there.

So you can write this DTrace script that upon entry into CGContextFillRects, what we're going to do is copy in Arg1, which is the array, and then the size of that array is the size of a single CGRect times Arg2, which is the count. So once you have that copied into the kernel we're going to call tracemem on that, so we pass the array, we pass 512, which is the maximum size we ever expect it to be for our application, and then again we pass that same count. So now tracemem works on your variable sized arrays.

The next big improvement is the Histograms that have been improved vastly. There's a number of new options, too many to discuss here, but basically you'll get much better, improved Histograms, easier to read, better ASCII art on your terminal when you're generating these things. I refer you to that excellent blog post on DTrace.org that will describe to you all the options and all the ways that it will be helpful to you.

There's a few other new features. Previously, if you wanted to make sure you got every single probe firing from DTrace, you might have to start your application as root, now you can actually ask DTrace to wait for the process to start with -W. So you fire up DTrace, and then you start your application. And with a little help from the kernel stopping your application, letting DTrace come in and install its probes, its hooks, then your application will be released by the kernel to run and you won't miss any of the probe firings at the beginning.

There's also a more advanced feature, which is tunable variables. So if you run the sysctl kern.dtrace you'll get a list of what variables your current kernel supports, things like maximum buffer sizes, enabling additional static kernel probes, some pretty advanced features that you can now toy with that you couldn't before. And to learn more about them, all the documentation has been updated, so just run man dtrace and you'll get all that info.

So to summarize, today we saw a lot of things. We saw Allocations, Leaks, Time Profiler, Counters; so there's a lot of tools in the toolbox, and you can use them on Swift and Objective-C Code alike, it'll just work. And we want you to be proactive. Use these tools throughout the application development process.

Even while you're debugging something, maybe you see out of the corner of your eye one of those CPU or memory gauges spikes. You don't expect it. Go ahead and profile your application in instruments right then and actually try to improve your application right then and not just at the end. And while you're doing that remember don't assume, go ahead in profiler app, make a change in Xcode that you think will fix it, and then profile instrument again, and keep on iterating until the problem has been resolved.

If you have any more questions, you can always contact Dave Delong, our Evangelist. We have documentation on developer.apple.com and there's excellent dev forums that you should know about, where you can talk to other instruments users, post questions, post answers. Instruments developers will be on there from time to time answering your questions, as well.

There's a few related sessions, and we saw some stuff about App Extensions today, so you'll see that. There's a second talk tomorrow. There's some Swift talks, of course, all throughout the week, and even some interesting Swift debugging stuff with LLDB on Friday. And, with that, I'd like to thank you for coming. Have a great week.

[ Applause ]