Essentials • iOS, OS X • 57:47
Core Data is a powerful framework on OS X and iOS for automatic persistence and object graph management. Master the techniques you need to get the most out of Core Data. Learn the best practices for working with nested managed object contexts, passing data between threads, and how to use Instruments to identify and fix performance bottlenecks in your app.
Speakers: Ben Trumbull, Melissa Turner
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning, everyone. My name is Ben Trumbull, and I'm the engineering manager for Core Data, and I'm here to begin the Core Data Best Practices session. and today we're going to talk about a number of topics. We're going to talk about concurrency, nested context, and then I'm going to bring Melissa Turner on stage to talk to you about performance, schema design, and search optimization. So as part of some of these topics, we're going to talk about using Core Data with multiple threads, sharing unsafe changes between contexts, debugging performance with Instruments, tuning your model, and improving your predicate usage. So first up, concurrency.
So when using Core Data or really any modeling objects, there's some challenges you're going to face. First is obviously thread safety. But then we're going to look at some issues with transactionality when you have a bunch of changes together. And of course, you're going to need to balance that with performance.
So in the past, a lot of people have done something like this. They have a bunch of different contexts together and they use perform selector to route, say, a merge notification or some other method onto the main thread or a specific thread to get to a context. This makes us a little sad, though. So in Lion and iOS 5, we introduced some new methods and some new concurrency types for managed object.
And instead of having to trampoline through perform selector, you can use perform block and perform block and wait. It's going to look a little bit something like this. When you create a managed object context, you'll specify what type of concurrency you want it to use. It will manage that itself and then use perform block to route it tasks.
So there are three concurrency options you can use in Core Data. The first one is you can tell managed object contexts that you want it to be bounded to the main thread. And this is great for interacting with view controllers and other aspects of the system that are bound to the main thread or don't really know very much about Core Data and its concurrency. And then for a lot of your background tasks in your own work, you can use private queue concurrency. And finally, there's the confinement concurrency type, which is basically what people have been using in the past before we introduced these new options.
So for the confinement concurrency type, you're basically required to have a separate context for every thread, and managed object context can only be used on the thread or queue that created them. And this is the default legacy option. So with the confinement type, everything is going to be serialized against your work throw.
And you can use either a serialized dispatch queue or an S operation queue with a maximum concurrency manually set to one in addition to a specific thread. So here I just want to point out that Core Data isn't using any thread local state. And we're really interested in having a single control flow. We're not really as focused on whether or not dispatch queues work with multiple threads or how that's happening underneath the covers.
So thread confinement is pretty straightforward. It's safe. It's efficient. The transactions are obviously all scoped to the managed object context since nothing else gets to interfere with it. But it does put a burden on you to manage all of these issues. So in particular, tracking which context goes with which thread, potentially keeping extra threads around for background tasks, and then all of the special behaviors that Core Data uses to integrate with view controllers, Cocoa binding, and other things, these main thread behaviors, undo management, we're going to have to infer based on whether or not you created the managed object context on the main thread. And those things are driven -- those we call user events typically are driven by the run loop on the application.
So in contrast to confinement, the private queue concurrency type, NSManagedObject maintains its own private serialized queue, and you can only use it on this queue. And you do that by setting up blocks as tasks and queuing them using perform block and perform block and wait. Now, within those blocks, you can use the managed object context API normally. And I just want to really emphasize that in this case, the queue is private, and you really shouldn't yank it out and interact with it directly. If you want to, you can dispatch work to your own queues with dispatch sync at the end of those blocks.
There are a number of advantages to this. It lets the managed object context maintain which queue it's using and handle whether or not it's in the right state, the right thread. And other threads can easily interact with that managed object context by just calling perform block. Unlike the concurrency -- the confinement concurrency type, those other threads really can't message that managed object context at all. And these can be created from any thread. And the queues are going to be much more efficient than keeping extra threads lying around in the background to do other tasks like background fetching.
And the third type is the main queue concurrency type. This is going to behave very similarly to the private queue concurrency type, only the queue is obviously always the main thread. And non-main threads can just call perform block on that as well. And it will integrate all of those behaviors that I talked about, undo management and other application lifecycle events with the main run loop.
So what that means is when you create a managed object context with the main queue concurrency type, your view controllers and other things can just message it directly. They don't have to know about all these different perform block APIs. And it's very easy for other tasks that you have in the background to just enqueue perform block on it and have those then update view state.
So just sort of a diagram of what I mean going on here is a background thread can enqueue a block directly, but the view controllers can just start using managed object context API. So in this way, Cocoa Bindings, for instance, doesn't know about concurrency types or perform block, but it can just work with the managed object context the way it always has, and you can have background threads or other queues enqueue messages to happen on the main thread context that way.
So I mentioned that we have these notions of user events, and for the main thread, that's going to be tightly integrated with the applications run loop. But for context running off the main thread, either in a private queue or in your own thread, what I really mean here is Core Data is going to defer a bunch of tasks and then coalesce work later on. So this is the coalescing changes for notifications, delete propagation, setting up the undo groupings, stuff like that. And for the most part, on background threads, we consider this to be the time in between calls to process pending changes.
So a couple of useful points for all the concurrency types is that managed objects are always owned by their managed object context and that object IDs are a great way to pass references around between contexts because they're going to be safe, immutable value objects. And something else that's a nice point is that retain and release are going to be thread safe on all Core Data objects everywhere, all the time, without exception. They should be thread safe on all Cocoa objects, but your mileage may vary on that one.
But that means you can actually retain a managed object independently of its requirement on the managed object context. You just can't necessarily use it directly. So some good times for you to pass around updates to other contexts or to update the views are going to be with these NS notifications that Core Data provides, with the objectsDidChange notification and the objectsDidSave notification.
And you can refresh other managed-object contexts pretty easily after they save with the merge changes from context-did-save notification. And here, I'd just like to call out that you're responsible for the thread safety of the managed-object context receiving this message, but you don't have to worry about the notification data that's being generated here. Core data will manage the thread safety of the contents of that notification. So you just have to maintain the rules that we've outlined in the past on the receiver of the merge method.
And when you're inside some of these notifications as an observer, you can find some useful methods of taking a look at the state of what's changed in the managed objects. Something that we added last release was change values for current event, which will give you the values that changed since the previous call to process pending changes. And then some older methods, change values and committed values for keys, will go back to the last time the object was saved.
So now I'm going to go into a little more depth about these perform block and perform block and wait methods that I mentioned earlier. And our challenge here is to find a way to pass work to other threads, these managed object contexts running on their own queues or the main queue, and to sort of demarcate the actual group of changes you want to be coalesced together, whether it's for an undo grouping or validation or potentially to save, as well as a convenient way to integrate with all the other APIs on the platform. And that's part of the reason why we chose blocks.
So perform block is an asynchronous request to enqueue this. We consider its own self-encapsulated user event, and it also includes an auto-release pool. I really want to call out that in all of these methods, it is very illegal to throw an exception outside of the block. So if you do have exceptions, please catch them and resolve them inside the block.
And there's no support for re-entrancy here in this perform block method. And by that, what I mean is when you call perform block on a managed object context, and within that perform block call, you call perform block again, you're basically just sort of getting the same effect as if you had iteratively called perform block. So this is an asynchronous call, and all it's doing is queuing up attacks to be happened later.
So in contrast, we also have perform block and wait. This is synchronous. It's very lightweight. We don't consider it to be any kind of event, so there won't be any change notifications or undo groupings coalesced at this point in time. It doesn't even include a release pool. But what it does do is it will support some re-entrancy. So if you call perform block and wait from within another perform block, you will basically get them nested. They'll be executed immediately inline as opposed to enqueued later. So this is very convenient as long as you're just working with one managed object context for these blocks.
So these APIs are very fast and lightweight. The perform block and weight API is on the same order magnitude as value for key, for instance. And the changes there from Core Data's perspective are going to be scoped by the block. So however large or small you make the block is going to be sort of one self-encapsulated change set.
[Transcript missing]
So a simple example for how you might use some of these APIs. Here we have a context and it's synchronously calling perform block and wait to execute a fetch request that's been captured by this block from some code further up. And if we don't have an error, then we just ask the array of managed objects to give us back its object IDs and we return those out of the block with a block variable.
So as I mentioned, the queue is often going to be very private to the managed object context, and we don't want you changing anything about it. So if you need to, and you're using your own queues, as I would expect, you can just simply, at the end of the work block that you passed to managed object context, enqueue another block back onto your own queue as the callback to let it know that it's done and process any results.
There are a number of other ways you can coordinate with either your own queues or other queues on the system. Dispatch semaphores, for example, are one way of doing that. You can create a semaphore and then at the end of the block signal the semaphore. And then in this particular code snippet, the context is asynchronously performing this block and the code that is calling perform here is actually waiting until that is done on the semaphore.
And then something else that I'd sort of like to give a little shout out are dispatch groups. If you haven't used them, they have some very interesting behaviors. And you can use them to organize some pretty complex dependencies between a variety of queues and blocks between them. So when you use dispatch group enter, it's a little bit like incrementing a retain count on when the queue will be done. And then the worker blocks can call leave to decrement it. And then when it ends up getting back down to zero, conceptually, dispatch group wait will return or dispatch group notify will enqueue a block that you passed it onto your own queue.
So what this lets you do is basically you don't actually have to know in advance how many waiters you want to float around. You can just keep calling dispatch group enter as you create more work or as you decide to build in new dependencies and then have them call dispatch group leave. So this is a very simple example. It's very similar to the semaphore example. This becomes more interesting when you have more queues involved.
So now I'd like to move on from concurrency to talk about nested managed object contexts. And in particular, the reasons why you'd be interested in nested managed object contexts are going to be passing unsaved changes around between contexts and implementing something like an asynchronous save. So in the past, working with managed object contexts, you can push and pull changes that have been saved between contexts and use, like, the merge notification to do that. But passing unsaved changes between contexts or having them really work with unsaved changes can be very difficult. And similarly, it's difficult to sort of break up the save operation to be asynchronous.
So here, for a nested context, the parent contexts are going to sort of act like the persistence store from the perspective of the child context. And the child context is going to see the state of its objects as they currently exist in the parent. Children will then inherit unsaved changes from their parent whenever they fault things in or they execute a fetch request. And they'll marshal their saves in memory. So instead of saving back to disk, the children will just turn around and save to their parent context.
So it looks a little bit something like this. And the child doesn't really know that it's not actually talking to the persistent store. It's just talking to a parent context. And the behaviors are going to be very analogous in terms of the ways both saving works, fetching works, and faulting.
So in this way, peers that all inherit from the same parent context can push and pull changes between them. And you can implement an asynchronous save by setting up the parent context to have a private queue and have the child context typically on the main thread save into the parent context and then tell the parent context to save. And one of the ways you might leverage that is something like a detail inspector. The detail inspector will inherit the view state as it is in your main context.
So for sharing on saved changes, when you save the child context, they'll just push up one level, and then you can pull those changes back down using a fetch or the merge notification between child contexts or calling refresh object. It's the same way you would with not nested managed object contexts.
For an asynchronous save, when you save the child, like I said, the parent context gets those changes and holds on to them until it's told to save. And the changes won't be written to disk until the root most parent context calls save. So that would look something like this here where a parent context has a private queue concurrency type, so it will execute requests asynchronously, and the child context gets set up. It just creates a reference to its parent context. So when the child saves, it pushes its changes up to the parent, and then here it enqueues an asynchronous block to tell the parent that you want the parent to save.
For inheriting changes in the Detail Inspector, you just create a child context for the Detail Inspector. And if you decide to commit the changes within the Inspector, they'll get pushed back into the parent, which is probably going to be something like the main queue context for your view state.
And any fetching you do in the child context for the Inspector is just going to incorporate the current unsaved state in the parent. And you don't even necessarily have to do anything special. If you decide to cancel out of the Inspector, you can just throw away the child context.
So some important things to remember is that saving with nested context is only going to push the changes up a single level, but fetching is going to go to the database and pull data through all the levels. Keep in mind, though, that in general, core data isn't going to change any objects that you already have out from underneath you, so if you fetch an object that you already have, you will see its previous state. So say it's been dirtied, we're not going to blow away your changes.
We're simply going to keep in the fetch results the reference to that object, and you can call refresh object if you want to get new values for it. Object with ID on a child context will pull from the fewest numbers of levels necessary to get that data, so it might go to the database or it might only go up a single level to the parent. And all parent contexts must adopt one of the two queue types for concurrency, so they can either be a main queue concurrency type or a private queue concurrency type, but we don't support them with the legacy confinement concurrency type.
And child contexts depend pretty heavily on their parents. So the parent context really should not do blocking operations down on their children. And by this, the children are going to call perform block and wait to do a lot of operations for you. So like, for instance, execute fetch request on a child context internally is going to turn around and ask its parent context to do part of the fetch and then pull down those changes into itself.
So what this means is there's sort of naturally a dependency there. And if the parent context turn around and call perform block and wait on their children, you'll basically end up deadlocking, right, because you'll have all these different queues trying to synchronously wait on each other. So in general, you should imagine that requests are going to flow up this hierarchy of managed object contexts finally to the database at the root. And results are going to flow back down. And now I'm going to bring Melissa Turner on stage to talk to you about performance.
[Transcript missing]
How do you know when you've got a performance problem? How do you figure out what to do when you've got a performance problem? Lots of questions. The first stage when you're starting to sit down in front of your application and say, "Okay, is this thing ready to release to my customers? Is it performant enough? Are they going to be annoyed with me? Are they going to file bad reports on the Apple Store? Are they going to give me five stars?" Start asking yourself questions about the application.
What environment does it run in, and have I designed it to be compatible with that environment? What should it be doing, and are the shoulds and the dos compatible? What kind of things do you need to know about the environment? Well, actually very little nowadays. As long as you're using the Apple Supply frameworks, things like, you know, lib dispatch, then we will take care of making sure that you're doing things properly from, say, the confinement standpoint.
But you will need to do things like design for your network environment. If you have an application that goes out, use the NS incremental store APIs to build a store that talks to a web service. You probably want to make sure that whenever your user triggers an action that will require going out to that web service, it doesn't block the main UI of the application. You'll need to think about stuff like that. That is a performance issue.
You'll need to think about what is sufficient performance versus what is optimal performance. Sufficient is your application gets up and gets the job done. Optimal is it really wows your user and allows you to spend more time doing interesting things in your application because you're not wasting cycles doing things inefficiently. And one crucial, crucial, crucial point to remember is if you're building an application that supports multiple platforms, test on the minimal configuration. This cannot be emphasized enough because if it works really well on your minimal configuration, it's going to blow people away on all other platforms.
What should your application be doing? You should know this. You've written it. You know things like, well, it opens documents. If you open a document, there's very little way to get around it. You need to do file system access and load at least some of the data so you can show it to the user. That's what they're expecting.
The user instigates network access, it's the same thing. You should know when the user is accessing the network and how they're accessing the network so you don't do things like accidentally, you know, go out and fetch the same piece of data three, four or five times. And you need to know what kind of random processing your user is likely to kick off. Calculate me some transform on an image, scale it, apply a filter, do something interesting like that. These are the things you know your application can do and you should expect to see those in your performance.
And then there's what the application does do, stuff that it does automatically. Do you have a dataset that you need to go out and check periodically to see if there's new data on your Web service. That kind of thing happens automatically. You should build that into your calculations. Try not to do it when the users kicked off that image transform.
Does it post notifications? Try and do that in some unobtrusive way, using our APIs that will make it all happen nice and smoothly. And you know, if for some reason you want to calculate 2,438th digit of pi, try and do it at 3:00 in the morning on a Friday when they're not likely to be using their application.
How do you figure out what your application does once you know what you think it should be doing? Well, measure it. Measure, measure, measure, measure. This is where everything starts. Figure out where your application is actually spending time so you don't end up spending two weeks optimizing what turns out to be 1% of your application's workload. It's much better to spend two weeks optimizing 50% of your application's workload.
Start with the time profiler in instruments. This will tell you exactly where your application is spending all of its time method by method. There's also the Core Data template in instruments. This will tell you when Core Data is touching the file system. We have a template that contains instruments for fetching, for saving, for firing relationship faults, and for when we have to go to the database because the data that we're looking for is not in the cache.
And there's also com.appled.cordata SQL debug default, and if you pass this to your application when you launch it or have it in your defaults right, it will cause Cordata to print out all of the SQL that's being sent to the database. You can have a look at that, see what you're sending to the database, look at the SQL that's being generated, figure out if this is really the SQL that should be generated in that case, if you're doing too much work, doing too little work, doing too many trips to the database, this kind of thing, this default will tell you that.
And many of you have probably heard this before, because it's a very common phrase in the real world. If you're building anything with your hands, measure twice, cut once. You cannot uncut a piece of lumber. And that's less important in the virtual world because, well, we have SCM systems. It's always possible to revert to yesterday's build. But the thing is, you can't get back the time you have invested going down that false path. So, you know, make sure you're actually fixing the right thing before you go off and fix it.
So for the rest of this presentation, I'm going to do a series of demos, or I will be having my lovely assistant do a series of demos that are based around a table view. This is primarily because table views are easy to visualize. If I say there's too much data being loaded, you can sort of get a grasp of what that says. If I say there's too little data or the wrong data, it's badly formed, you can get an idea of what that means.
But the lessons are generally applicable to anything that's going to be loading and processing data from a store. And just as a disclaimer, the demos were specifically chosen so that they have performance issues that are visible on stage. Any performance issues you have in your apps will probably be a little bit more subtle, but they'll have the same basic patterns.
In the beginning, there is a table view. And you know, your customers are not going to pay you for this. Because, you know, that's not terribly interesting. You need something. And in my case, I went on vacation. Those of you who are familiar with this picture will probably realize I was in Rome and that this is a picture of the Colosseum, and it's an architecture picture. These are all pieces of information that I want to build into an application that displays my holiday photos.
So my first pass is going to be to take all of those pieces of information that I've got and combine those into an object that I can use to back my table view. Call it a photo object. It's got a label. This was taken in Rome. It's got a blob that is the photo bytes, some tags, architecture and coliseum, and a timestamp when the photo was taken. And at this point, I'm going to bring Shane up on stage, and he's going to see how well that worked in a first pass. Shane Hossel, Core Data Engineer Hello, my name is Shane Hossel, and I am a QA engineer with the Core Data team.
All right. So here we have the first demo that Melissa mentioned. This is version one of the photos application. And as you can see, this is simply mapped over a simple photo entity. It's a single entity application. And when we click on the record, we get to see the photo.
So this works as promised. Now what we're going to do is we're going to hook this up to Instruments and get some measurements. Now for those of you who haven't used Instruments before, I'd like to show you what you see when you first launch it. What you'll notice here is you get a -- what you'll notice here is you get a sheet with all your instrument templates. In our case, we're going to use the iOS simulator.
Off to the left, you have some groups which allow you to target a specific platform, either OS X, iOS, or the simulator. Now, you want to keep in mind when you're using the simulator what Melissa mentioned earlier about your environment. This is actually a simulated application, so while it looks like iOS, it's running on our development hardware. So we don't have the same constraints that we would have if we were using a device such as memory, processor, and disk space.
You select the Core Data template, you'll get the instruments Melissa mentioned earlier, Core Data fetches for our fetch activity, the Core Data cache misses, which gives us the file system access during Core Data faulting and the Core Data saves. And of course, we want to add the time profiler here as well.
Now, normally what you do here is hook up your target application, but I already have that set up, so let me go right to that. So let's run this in Instruments and see if we can get any interesting measurements. And as you can see here on the left when we launch, you'll start to see live measurements occurring. It's taking a little bit here.
All right, we had to wait for a few seconds, but we still have our photos as promised. And I'll do some scrolling real quick, because I always like to make sure everything scrolls nice and smooth. Good. So let's look at our Instruments measurements real quick and see what we have here. We notice that there is actually some delay when we launch this application. And that's evident in the time profiler as we can see the duration of activity that occurred on launch.
Coincidentally, our Core Data fetches shows us that there's some duration going on and some fetch activity that's taking place. If we click on that, we can see that the fetch count is 184. We're actually loading all these objects in at launch. This isn't good. If we had thousands and thousands of photos, our users would have to wait for this to load.
So I'm going to file this up as a bug and send it over to Melissa and see if she can do something to fix this. So when you want to optimize your application's performance, the first place you're probably going to end up is at NSFetchRequest. This is really ground zero for your optimization because this is where you tell your application what data it should be loading.
There's all kinds of flags, little things you can say, specify the data. A few of the interesting ones are batching. What's batching? Well, you specify that you want to return batches of objects. This is primarily good for underlying something like a table view where you can scroll. And Core Data will return basically information about all of the objects that need to be displayed when the user wants them. But it won't actually fetch them in because, well, the user may never get there. So once the user scrolls along in their scroll view, Core Data will go out and fetch.
The data the user is interested in seeing. This can be a lot more efficient than doing an upfront fetch because, well, you don't fetch 20,000 photos, especially if the user only wants to see the first page. If you're not using a table view but you're using a more page-based application, something like, I don't know, Google search results, you probably want to think about using fetch limits and offsets. This allows you to page data in and do batching yourself. It's a little bit more time efficient because you'll only load a specific number of objects. It's a little bit more complicated. This is the dance of performance.
And you're always going to run across this in every application that you write that you try to do performance optimization. Every application is going to be different. Every application has a different data set. Every application has a different UI. And you're going to need to make different tradeoffs between memory use, CPU use, code complexity. There's no right answer here.
But here's some of the things you can do. Some of your options. You can use predicates to filter down the number of objects that are ever going to be returned. Don't need to return a thousand photos if your user only wants to see the ones that were taken in Rome on February 2nd.
We support aggregate operators. How many photos are there in the library? What's the earliest photo taken? What was the latest photo taken? Or, you know, jumping from photos to something else, what's the average salary of all employees working for a given manager? We also support the grouping, which is SQL's group by operator, if you're familiar with how that works.
You can go find documentation of how to use that in NSFetchRequest.h. There's other things as well. You can specify return types, dictionary fetching if you want to, that allow you to very precisely target the information that's being brought into your application so that you really only need exactly the information that your user wants to see.
But you know, fetch requests will only get you so far. From there, you really need to go, Look at your schema and figure out what it is. Your application concept is going to drive your UI. I want to build a photos application. Well, that tells me certain things. I've got photos at some level. I've got these large binary objects that I'm going to have to put on screen.
I've got information about those photos. I want probably a list or some way of selecting them. In our case, we have a table view that allows users to scroll through a large collection of photos and figure out which one they're actually interested in. So the application concept drives the UI. The UI, in turn, drives the schema. There's no point in having the labels for my photo if I never actually display them to the user. There's no point in having tags if I never display them to the user.
But there is no one true schema. I mean, you'll go out and research and read a little bit about database, building correct databases, and you'll read things like first normal form, second normal form, fifth normal form. And this tends to give the impression that there is like a perfect schema. And really there isn't. The perfect schema is the perfect schema for your app.
Combine your entities and relationships in ways that make logical sense. Put -- relate entities that are on logical transitions as your storyboard flows through your application. And, you know, don't try and follow ten joins to get from page one to page two. Just make that one join. That makes more sense. That's faster. And faster is important.
Normalization. I talked about normal forms. This is where normalization comes in. This is all about reducing data duplication. There's a lot of reasons for doing this, all of which help with your performance. This reduces the possibility of data skew. If you store the same piece of information in 10 different locations, then if you decide you want to change that piece of information, well, you have to go change it in 10 different places. It's easy to forget one.
You'll minimize the amount of storage space. If something is stored in ten different places, well, it's stored in ten different places. It's going to take ten times the amount of space storing it in one place would. Minimize memory usage. That's the flip side of storage space. Well, we all want to make our applications have a smaller memory footprint, especially on an iOS device.
And it'll make your searching faster because as you search in something like SQLite, well, all of the data for a given record is contiguous in memory. As you go through looking at each individual record, database -- well, somebody has to load that and look at it. The more stuff you have in that record, the more memory we have to load and the slower that is. Over time, that can build up if you're managing a large data set.
So how does normalization work? Well, you start with something like this. This is what we have in the database for that schema we saw earlier. And the first thing we see is we got this column here, and it's got two signifiers of problems. There's multiple values in that field, which is never a good sign. Two, you see the same values being repeated in multiple rows.
Well, that's also a bad sign. If I move to Europe, move to England, and decide that in order to communicate with my new friends, I want to change all of the vacation tags to holiday tags, I'm going to have to find every single vacation tag and change it. It's much better to move that off into its own single table and unique them. Make sure that family only appears once, vacation is there once, work is there once, and so on. Work is there at once.
and at this point you're probably expecting me to talk about tables and keys and foreign keys and primary keys and blah blah blah blah blah. Core Data does all of that for you. You create the relationships, set them on the objects, Core Data deals with all of your key management. That's the last I'll say about keys.
Okay, so now we've eliminated some of the data in our table. We see more duplicate data there. But is that really duplicate data? Not really, because when you think about it, the timestamp is an important piece of information about a given photo, and changing a timestamp on one photo does not mean you want to change it on every other photo. So we're going to leave those ones alone.
Data Contiguity. Photos are big, arbitrarily big, arbitrarily small, have no idea. It's not hugely efficient if you know you're going to be searching on the photo element to store that photo object on the same table as the meta information underlying the table view. So we're going to split that off onto its own separate table as well.
At this point, I'm going to talk a little bit about external data references, which is about going one step further. In normalization, we move the data object off of the source table and onto a destination table. In Lion and iOS 5, we went one step further and allowed you to move it out of the database file entirely by using what's called an internal external data reference.
Like normalized objects, these are best stored on dedicated objects. And because we can't read your mind and tell when you've created an object that you're done with it, in order to free up the underlying memory, you're going to need to refresh the object, tell Core Data you're done with it, so we can release the memory that's being used.
So we had this schema originally, and we decided that really didn't work. There was a huge performance hit when we loaded, and sadness happened. So we redesigned it, and we're going to try this. Shane, how'd that work? Hello again. I have the fresh version of our Photos application here.
I already got it loaded up in Instruments, so let's take some measurements. Pay particular attention to the Core Data fetches when we launch to see if we actually got a performance gain and fixed this bottleneck. Oh, that was fast. That's actually quite nice. Let's make sure the application still works when we click on a photo. Indeed, it does. You might notice over here on the Core Data cache misses, when I click on a photo, we get some activity going on there.
Right there. I have zooming enabled now. I'm going to do some scrolling really quick to make sure this scrolls as we expect it to. You might notice some Core Data fetches going on while I'm scrolling as well, right through here. All right, let's look at this instrumentation and see what the measurements are telling us.
Of course, we have the fetch count here, which is showing us it's still 184, but the duration's a lot better. So we're still loading all the photos in, but we're only loading in the photos -- we're not loading the photos, we're still loading all the objects in, but the photos are now a relation.
So we're not loading those in, we're delaying the load of those when we click, and we can actually fire a cache miss, and we can see that activity here. Sorry, let me scroll in here. Let me zoom in. You might have noticed also, when I was scrolling, we had some other fetch activity going on.
This is us loading that in as needed on demand as Core Data is fulfilling its promise. So with that, now that we've fixed that performance bottleneck, it's time to add some features. Of course, as soon as the application gets fast, we can add some new features. So I'm going to put in a feature request for probably a search bar, and we'll see what Melissa can do for us. So we'll see what Melissa can do for us on that. Melissa? Okay. Thank you.
So when we start talking about searching, this is a good time to start talking about denormalization. Because one of the main things you're going to denormalize is information about the data in your database. And why do you want to do this? Well, there's two types of denormalization. One is attempting to minimize the number of joins you have to do by putting meta information about relationships onto the source table. Things like whether or not there is an object on the other end of a relationship, whether how many objects are at the other end of a relationship, or aggregate information like, you know, what's the average salary for people who are directly reporting to a given manager.
You also want to minimize repeated transforms. And what does that mean? Well, remember I said earlier that you want to eliminate duplicate data where you can, and that skew is bad, and having same or similar bits in multiple places can make it more painful to update. There's a caveat to this, and that's if there is some CPU-heavy process that has to be done repeatedly, say, for example, you're searching, you want to not have to do that every time your user searches. You want to instead put your data into some form that's more easily searchable, it's canonical.
What do I mean by that? Well, I want to build my application. Shane has filed a bug and said, well, you know, searching. I have like 184 pictures. It's hard to find the one I'm interested in. I want to be able to search by tag. Okay. That's nice. I want to know how many tags are on a given photo first.
I'm going to do that. I also want -- because, well, Shane thought searching was neat and I thought, you know, we're graphical creatures, we like searching for things visually, I want to add a thumbnail to the master view so you can get a sense of which picture it is you're looking at as you scroll through the table view. And that's transform.
I don't want to have to calculate the thumbnail images from the main image data every single time the user scrolls. It's going to be something they do a lot. They're going to be scrolling through this scroll view regularly. I don't want to have to calculate that thumbnail every single time they scroll.
So I'm going to precalculate that and put that in the database. And here's where that there is no one true schema thing comes in. I can put this in a couple of different places. I can put this on the data object because, well, I mean, it's a piece of data. It's photo information. But, you know, I can also put it -- I can To minimize relationship fault firing, I can also put it on the photo object, which is what I'm going to do.
And now that search optimization I talked about, there's a few ways, a few things you need to think about and consider when you're starting to build search into your application. The first is, well, what is searching for your user? There's two ways to actually implement it. The first way is to put everything in the view and let the user filter out stuff they're not interested in. The second way is to put nothing in the view and let the user filter in stuff they're interested in.
Why are these different? Why do you care? Well, we saw what kind of a performance hit we had in the first version of our application when we loaded 184 objects. It can be expensive to load everything and put it into the search view simply so your user can filter things out. So, you know, if you know your user's going to be working with really large data sets, help them filter in.
There's also the option of, do you want to do searches you type, do a new search with every keystroke the user gives you, and pull up incremental data sets? Or do you want to let them enter an entire search term and then return those results? At that point, it comes down to your environment. If you're using an SQLite database on the local disk, searches you type is going to be relatively fast.
If you have an NS incremental store talking to a web service, And your user's running on an edge phone, well, you've got one of two choices. Either do search on completion or make sure that if you're doing searches you type, that that search does not block the user interface, because they're going to want to continue typing while you're waiting for the first batch of search results to come back.
I mentioned canonicalizing strings. Well, what does that mean? This is where I got off track earlier. Your strings into a very simple pre-processed form. In the same way that I don't want to have to calculate the thumbnail in order to display it every single time the user scrolls, I don't want to have to do all of the case and diacritic stripping that's necessary in order to get an intuitive to the user search. They want to type in family. They don't care what case the tag was in originally. They just want to type family. They want to type vacation.
Along these lines, normalize because case and diacritic insensitivity is important. But it also lets you use begins with instead of contains. Most importantly, it allows you to avoid wildcards in a lot of cases. Regex is not your user's friend. For example, if you have -- you want to do type as you go searching in standard search field, you might think as your first pass, oh, I want my labels -- I want everything that's going to label case and diacritically like starting with red. No, really. You actually want a normalized search string. You want to do search with begins with normalized red. This is much, much faster. We can turn this into something in SQLite that will be at least an order of magnitude faster. then the original search.
Well, you say, well, I actually want matching at the beginning and end. Find me things that are substrings. We can do substrings too. That is again faster than matches. Okay, okay, you really, really want matches. I give. Doesn't mean you can't still normalize it and fire up the regex engine to look at the pre-normalized strings.
This is still going to be faster because I'm not going to have to do additional passes across that string to denormalize and strip case and diacritic information. Try and do those last, though, please. Why? Because, well, SQLite doesn't have a query optimizer. This is part of knowing your environment.
There isn't a query optimizer, which means you're going to have to manually optimize your predicates. Think about, you know-- Eliminating largest group-- building your predicates such that they will eliminate the largest group first. You're now also have to trade off group size and comparison speed. And as I said, put the matches operator last. For example, if you're looking for a search string containing something plus a time-to-time comparison, Your computer is really, really good at doing math. Put the timestamps first. That'll minimize the group that you actually need to do the substring work on.
If you've got tag search strings, well, that's really better done as an uncorrelated fetch. And we've got syntax that allows you to specify, do an uncorrelated fetch on a separate table, and then do the relationship join. There's lots of really neat stuff in the predicates. Go have a look at the BNF. It's really fascinating.
How does it work? Well, if I want to search for all photos taken in Rome in February, well, I've got two ways I can start looking at that. I can look for photos that were in Rome or photos that were in February. As I said, computers are really good at math, so February is the first column I'm going to look at. And that's going to dump two rows right off the bat. And now it's much faster when I go through and do my string operations on the labels.
So after all of that, we had a model, we did some search optimization, normalized our data in the database, and now our model looks a lot more like this. And I'm going to bring Shane up once more to have a final look and see what we've managed to make.
Instead of using Instruments, we're going to use the Core Data SQL debug string, or default, that Melissa had mentioned earlier. And I'm simply going to set that up as an argument passed on launch to our demo application. And here's version three of our demo, hot off the presses.
We've got some nice thumbnails. And of course, I get to see the photo when I click on it. That's important. Functionality still works. And you might notice here, we're already getting some-- Information logged to our console in particular, we have a SQL statement. So this is really nice because this gives us a window into what's going on.
There's some other annotations that occur here as well. The connection fetch time as well as the total fetch execution time. So I'm going to use the total fetch execution time to look at the difference between some of our search predicates. We're running right now with a Matches search predicate, and I'm going to do a search for Dog real quick. We already have this set up, so we have a nice cute dog we can show you.
And I'm searching for the first character of this that I started the search on. Let me zoom in for you guys here. I have zooming enabled now. And in particular, we can see we do have our select statement, and we are doing a matches on this. And in particular, I'm going to look at the total fetch execution time, which is .0385 seconds. So as Melissa mentioned earlier, we can probably use a faster search predicate. So we're going to do the begins with. Oh, and this is no longer case insensitive.
Run this and we'll do the same search for dog. Let me zoom in for you again. And the moment of truth here. Proof is in the pudding, if I can get that scroll correctly. Our total fetch execution time is now .0023 seconds. So I'm sure our users are going to appreciate this. Looks good. Thank you, Melissa.
I'll bring Melissa back up on stage now to talk about the rest of this. So, you know, knocking an order of magnitude off a search doesn't look that impressive when you've got 184 objects in the database and the search is returning in, you know,
[Transcript missing]
And, you know, on the minimal configuration you expect your users to be using to make sure that the performance there is acceptable, use some of the tricks we showed you to make To squeak every last second out of the app, so you can add new and interesting features that make your app a lot more interesting. The difference between a three-star app and a five-star app.
And I've talked a lot about specifically targeting the information you're interested in, only loading the information you're interested in. Equally important is getting rid of the stuff you're no longer interested in. Use auto release pools judiciously. Use -- that will purge batches of objects. Use manage object context, refresh object, merge changes to turn objects back into faults when you're done with them.
That allows Core Data to free the underlying attribute and relationship information. If you're done with a whole bunch of objects, you had a detail view inspector, the kind of thing Ben talked about earlier, up on your application, you're now done with it, call reset on the context, and that will free all of the resources on all of the objects that were in that context. So this is what we went over today. Ben talked about concurrency, nested contexts. I talked about performance, schema design, and search optimization. And now I want to talk for something we would love you to do for us.
Bugs. We don't know about bugs unless you tell us. I mean, we are not psychic. We don't see your user crash reports. We don't see the e-mails people send to you. If you don't tell us there's a problem with our code, we can't fix it. So please file bugs.
Bugs will be fixed faster if you give us steps to reproduce. Or a sample project. Best case is I tried to do -- I started my application, I did this, this, and this. I expected it to do this. It actually did this. That gives us a very clear case of how you're seeing the problem, what problem you're seeing, what you're expecting, and this gives us all the information we need to try and figure out whether that's actually a bug or we need to improve our documentation. Bugreport.apple.com. Use it. Frequently.
Often. I suppose that's the same meaning. Also use it for feature requests. If there's something that you think Core Data should be doing that it doesn't do, bugreport.apple.com. Enhancement requests. If you run across a performance issue, we love when you submit performance issues, especially if you give us instruments choices. Hi, I did this in my application, and Core Data was really slow and it looked like this. Can you make it faster? Also documentation requests. If there's something that's missing, something you think is confusing, bugreport.apple.com. Makes the crocodile smile.
If you need information, there's always Mike Jurowicz. He's our technology evangelist. There's also Cocoa.apple.com-- [email protected]. There's lots of documentation at developer.apple.com. And there's always the Apple forums. We hang out there when we're not being crushed by workloads. We've got a Core Data session this afternoon using iCloud with Core Data.