Application Technologies • 1:03:32
Learn how to use Core Data to manage large amounts of data efficiently while maintaining peak application performance.
Speakers: Melissa Turner, Ben Trumbull
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Hi, welcome to session 122, Optimizing Core Data Applications, or more specifically, Optimizing Your Core Data Applications, because really those are the only ones you care about. Checklist basically of things that you should look for, look at when you're looking into optimizing your application. First of which is it's awfully hard to optimize a Core Data application if you don't have a Core Data application.
Last year when we introduced Core Data, we had a fixed set of file formats supported, the XML, binary, in-memory, and SQLite stores. And a lot of you made it very, very clear that you wanted to use Core Data with your own file formats, that you had legacy file formats that you'd been shipping, that your customers were using, that were in wide distribution, and gosh darn it, you wanted Core Data to work with those. You asked, we obeyed, we now have an Atomic Store API that you can use to put your own file formats in. under Core Data.
Basically, your data in our world. Much like the XML and binary stores, and much like the name would hint, files are read and written atomically. This means that all of your application data comes into Core Data when the store is added, and all of it is written when the store is saved. There's two classes involved, NSAtomicStore and NSAtomicStoreCacheNode. And this is sort of how they work.
can see that we've got a persistent store coordinator and we've got a file. As many of you, all of you know, persistent store coordinators, the center of the Core Data world, that's really your first entry point for stores. User decides they want to open a file containing your legacy data.
They tell the persistent store coordinator to add a store. Persistent store coordinator looks in its dictionary of store types to find out what clash should be instantiated to handle that store data and creates an instance. It then sends a load method, which is the atomic store's queue to go off, look at the store file, find each managed object's external representation in that file, and create a cache node. Does this for all of the data in the file and then calls add cache nodes, which causes those cache nodes to be registered in the atomic store.
Load returns, add persistent store with type on the coordinator returns. And you've got a fully configured Core Data stack. Just like any of the other store types, Core Data can now query, edit, save, delete, update, do whatever they'd like with the user data in the managed object context, living on top of your store format.
Once the user's done with whatever it is they were planning on doing-- adding new objects, deleting objects, updating data-- They're going to call save on the context. They're going to hit a save button. You're going to call save on the context. That's going to get sent down to the persistent store coordinator.
And then interesting things start to happen. First off, for every newly inserted managed object in the context, new reference object for managed object is going to be called. This is your cue to create some form of a persistent object identifier that can be used to uniquely identify that managed object within its inheritance hierarchy.
Once we've done this, we switch in the data you've created into the persistent, into the managed object IDs, thus assigning persistent object IDs to all of those objects that were just created. Once we've done that, we call new cache node for managed object for all of these newly inserted objects. This is your cue to go off and create a cache node, much like the ones you created in the load phase earlier, an external representation of your managed objects.
Once we've handled all of the newly inserted objects, we start calling updateCacheNode for managed objects with all of the cached nodes that have been, or with all of the managed objects that have been edited in the context. This method should go through, take all of the attribute values out of the edited managed object and push those into the cache node.
Once we've done that for all of the updated objects, we send a willRemoveCacheNodes method to the atomic store telling it what store, what cached, what managed objects have been marked for deletion. Finally, we call save, which is your cue to find whatever set of cache nodes exists in the atomic store and write those out to file.
That's pretty much it. That's all there is to it. We provide, as I said, two classes to help you do this. An Atomic Store, which is an abstract class, provides all of the mechanisms necessary for Core Data to interact with your external data. It does, however, require that you provide the translation layer, since we have no clue what it is you want your external data to look like. And it also requires that you handle the metadata, that you somehow take care of saving the store's unique identifier and any of the versioning information.
You have to-- in order to implement an atomic store, you need to override five methods. Don't worry, if you miss one, we'll throw an exception and let you know that something really important somewhere got missed. Those five methods are load, we already explained that, new reference object for managed object, new cache node for managed object, update cache node from managed object, and save. Those are the only five methods you really need to concern yourself with.
I've mentioned cache nodes. What are they? They're an intermediary representation that lives between whatever's living in the file format and managed objects. Why do we have them? Why don't we just use managed objects? Because managed objects have to exist inside a managed object context. They can only be used by that managed object context, they're tracked by that managed object context.
A persistent stack can have multiple managed object contexts, so there's a question of, you know, where would we put those? So instead of trying to invert our stack and put things under that should be over and all that kind of fun stuff, we have this intermediate representation, one per managed object that is going to be stored in the external store, The only requirement on a cache node is that it be KVC compliant. It needs to return wrapped values for the attributes. It needs to return collections of cache nodes, related cache nodes for relationships.
This is also true of 2-1 relationships. If you have a 2-1 relationship, you need to return a collection containing one element. We provide a concrete class that allows you to wrap a dictionary. You provide, you instantiate, you alloc and edit and then hand it a dictionary, at which point everything just works. You're free to override it if you'd like.
All you need to do is make sure it remains KVC compliant. So just to prove that, you know, we're not just talking, that we can actually do this, I have a demo. This is actually, I believe, associated with this session on the ADC website. So you can go off and find the code for this. But it's a simple HTML star. I'm just going to build it and run it.
I've got a little application. It lets me enter user information, pick on one of my coworkers, whose goal is pretty much take over the world with bindings. He also wrote this demo application, so I figure he deserves some credit here. Add another user. If I can remember not to hit the middle button on the mouse. another one of my coworkers, whose goal is to make sure everybody in the world unit tests everything.
and a lot of coworkers with grandiose plans. So I've saved it. Now if I go off and look-- applications, library application support. We have something called HTML store. It's storing something that looks awfully like an HTML file. And voila, here's the information I entered in my application. But wait, you say, that might just be dummy data. Wouldn't be the first time in a demo. So we've got this image well over here. So I'm going to go off and I've got some pictures. That's where I want to be.
That's actually a pretty good picture for all of my coworkers since FOOS is really big on our floor. Save it. Reload this file. And voila, we've added an image. It's a pretty simple store format. Every entity has its own table. Attributes are stored in individual columns. If you've got relationships, they're stored as hyperlinks in a final column. I could add more and more stuff to show it off, but you get the basic point here.
We have an HTML store. We've shipped the source code for it. Like I said, you can go off and get it from the ADC website. And really, this is the meat of it. One class, HTML store, has an init method, has a dialog method, returns its identifier because we wanted to do fancy things there.
goes off and calls helper functions for load, but most of those are concerned with parsing HTML. Save. New Cache Node, Update Cache Node. and New Reference Object for Managed Object ID. And we catch this method that's being overridden. We don't really need to, but it's convenient. All of that in about 200 lines of code. Your very own store.
So you can do this too. You can go off, write your own store. Some things to think about. Because these are atomic stores, everything's going to get read up front. You might want to consider deferring some of the work of doing that loading if you know that you've got a lot of objects.
You only ever need to deal with a tiny subset. It's expensive to parse stuff. You're pulling stuff in. That's expensive enough. You might not want to add the additional load of all the parsing, so you can defer loading and wrapping values until you actually try and access those objects.
But you say you really want to be living in Core Data, you want to be using an SQL store. It still might be interesting for you to write an atomic store simply as a way to migrate your legacy data into Core Data. It's probably going to be easier for you to write an atomic store that can interpret your file formats. Use those to load your data and then call migrate on the persistent store coordinator and have it translate all of the data into one of the more standard formats like say the SQL format.
It can also give you interoperability with other standards. You can load data from a Core Data file and save it through a legacy file atomic store that will give you some kind of format that some other application you use might need. But like all atomic stores, they're a little bit problematic with large data sets simply because you have to load everything up front, save everything all at once. There's a cycles performance impact there and there's also a memory space issue because, well, if you got a significantly large data set, you're going to be using a lot of memory. So now you're in Core Data and we can start doing interesting things. things.
First thing to do if you want to make your Core Data application fast is pick the proper store. We've got a number of store types, each of which has their own characteristics, depending on what you want to do. We have an in-memory store which allows you to create objects that are only useful for the duration of that application. But boy, is it blazingly fast when you save.
We've got our atomic stores. Their big benefit is they're very easy to use. They don't require that you learn anything about SQL. They're fairly fast for small data sets. And in the case of the XML store, at least, it's human readable, so it makes it easier for you to debug. There's your atomic store class, which may or may not be human readable depending on what you've decided to use as a format.
And definitely, and this is the only store where you can get it, contains whatever requirements are important to you. We can't know, but you can when you build your store file. And then there's the SQL Lite store, which is honestly the most scalable of our stores. It's best at dealing with large data sets.
It's more efficient both in memory and on disk. Why? Because it doesn't need all the extra overhead that goes into archiving or creating an XML file. It's faster loading because you only load the objects you need. There's no upfront overhead when you add an SQL light store because, well, until you ask it for objects, we don't try and load any. You're going to have faster searching. SQL stores are generally designed for doing things like relational queries and searching. It is, however, less forgiving, and this is where we really get into the meat of optimization, if you've got a bad model.
Optimizing your model is really the place to start when you want to make Core Data application fast. Because no matter what you do, if you've got a model that really doesn't work for your applications data, there's just nothing you can do to make it fast. You need to design your model around your applications data, but to a certain extent, you also need to consider designing your application around the data you've got.
One of the things you really don't want to do is build the kitchen sink window where you try and display everything all at once. That means you're going to have to load all of your data up front. That's slow, that's painful. It's probably also going to be really confusing to your user who's going to be stuck sorting through all kinds of stuff that doesn't necessarily need to be seen at the same time.
So how do you design a data model for efficiency? This is a pattern we see a lot. People just throw everything onto one entity. Here we have a recipes entity that seems to be our theme. Recipes, we like to cook. If we look at this entity, it's actually easier often to try and optimize if you actually put some data into a table representing what you would actually create if you created instances of that entity. So I've done that, and I'm looking at my table. The first thing I see is this column. That looks suspiciously like a flattened array, an array of strings actually.
That's really kind of inconvenient and you really don't want to do that because, well, Core Data and a lot of other things get much happier if they just have NSValue type attributes to work on. So we're going to break those out into their own table. Call it normalization. Move it off. We've now got an ingredients entity. Contains all of the ingredients that were in that column in the original table and their quantities. One of the advantages to this is it's now a lot easier to search for recipes by ingredient.
Now, if I'm doing normalization, the first thing I need to do is put some kind of a foreign key on this table so I can link it back to the recipe the ingredient came from. You'll notice that just using the recipe name is going to be kind of inconvenient because I've got multiple recipes with the same name.
So I would decide, now, you'll probably want to use just numerical ID and put that back on the recipes table. And you know what? This is Core Data. Forget all of that. You don't need to know about it except insofar as you know that this is how normalization works.
There's something like that going on behind the scenes. Core Data takes care of all of it for you. So we're rid of the ingredients column. What can we do next? Well, for those of you who are familiar with Objective-C, you've probably zoomed in on this column because that looks an awful lot like an NSData. And given that the title of the column is Pitcher, you'd be right. That's a picture of the recipe. What do we know about image data? We know it's big.
We know given this data, given the entity we saw earlier, given this data, we know that we're going to load that large chunk of data even if all we want to do is display recipe names in a table view or an outline view. We're going to move that off onto its own separate entity as well.
We've just speeded up our load time a lot if we don't have to pull 5 megabytes worth of image data in for every recipe we want to display in our table view. So what can we do next? Here's a good place to start. If you're doing normalization, one of the things you're always thinking is, I want to reduce, as much as I can, duplicate data in my tables. There's some duplicate data.
This is actually kind of a deceiving example because in the interest of getting everything into Keynote, I made the table small. The chef would probably be split across several columns, first name and last name, possibly some biography information. It's going to be entered in every single row in the table. It means what? I have to load it every time. It bloats my database because I've got all of this data that's repeated over and over and over again.
And if for some reason I ever have to change one element, Ben gets sick and tired of being a T in the alphabet and decides that no, he wants to be a Ferlet too. I'm going to have to go change how many rows in my database? It's much easier if I just stick that all in its own table. Have one entry per chef, have those related. Similarly, we've got a cuisine going here.
Italian Cuisine x2. Do I really need that x2 or x10 or x1000 once for every recipe in the database? Nope. Let's make that go away too. And since we're on a roll, we'll look at the recipe names and, hey, wow, we've got duplicate data in there too. This is where you sort of have to stop and think, what does that duplication mean? In the case of a chef or in the case of cuisine, duplicate data really is, in a way, identical. It's actually referring to the same thing in all places. Is a recipe name unique in the same way? Or if I change a recipe name, it just changes that one recipe. It doesn't change all recipes with that name.
These recipe names are pretty integral to the identity of a recipe, so we're going to leave those in this table. We're going to call it done because pretty much everything we've got left really is integral to this table. So we had this and we've made it look like that. It's more complex, but odds are very good it's going to be faster. Recipes are going to load faster, querying by cuisine is going to be faster, querying by chef is going to be faster. And hey, we've added the ability to easily query by ingredients as well.
But you know, normalization isn't always what you want to do. Sometimes you actually want to go the other way. Take, for example, iCal. You might have heard in session yesterday that they've started using Core Data, which means we actually had to think about their model a little bit, just ourselves, make sure they weren't doing anything wrong. You'll notice they've got these badges. They display primarily events, but the events have information about things that are related to the events.
If I was making a first pass at their model, I'd build something that looks a lot like this. I have a base event. It has relationships off to people, alarms, and notes. But you know, that's really expensive when I start doing their UI, because in order to display any of those badges, I have to fault in all of these relationships. And that's a lot of extra work. What I'd probably want to do instead is create a Boolean that contains metadata about the relationships and put that on the base entity.
Then I can put up my badges or not, depending on what the Boolean value is set to. This is going to require that you write a little bit more code. You're going to have to actually add accessor methods and put code in those accessor methods to update the Booleans as the relationships are updated. But it's going to be faster.
So what are the design tips we can give you to go away with? If you can break entities down, put subsidiary attributes onto entities at the other end of relationships in order to speed uploading of your base entities for display and table views, outline views, or something that really wants a smaller subset of the data, do it.
Move NSData onto the other end of 2.1 relationships. It's almost always the right thing to do. Consider relationships carefully. They're very useful. They speed up searching. They allow you to minimize the amount of duplicate data you've got, which makes updates faster, makes your database smaller, makes your data set smaller, because you don't have that data in memory in multiple places. They do take a little bit of time to update.
And they're a little bit slower if you're doing key path queries across a lot of relationships. But it's probably also going to be faster, like I said, to do a strict equality check when you're querying then to do string comparisons across a table. You also want to consider inheritance carefully. There's not a lot of cost to inheritance, but if you've got a very large data set, Those costs can build up over time. So I've told you to change your model.
And I know at least some of you are thinking, but you know, it's a pain to migrate data. Or at least it was in the Tiger timeframe. We've tried to help solve that problem as well by adding a mechanism for model migration that conveniently helps you optimize because you don't need to worry about the pain of writing migration code every time you decide to change your model to make it faster. Core Data provides support for model versioning and for migration.
We put version information in the store metadata, and we provide the ability to create mapping models that allow you to describe transformations from one Core Data version to another Core Data version. If you want to see how this works, you can go look at the NS Migration Manager, NS Mapping Model, NS Store Migration Policy, and a bunch of other classes that Malcolm has kindly documented for you.
Out of the box, what you need to know is we provide facilities for adding and removing entities, adding and removing properties, for factoring properties into separate entities at the end of to-one relationships, absorbing entities that were on to-one relationships, and changing relationship cardinality, making it to-one and to-many or vice versa, as long as it was only one object and then to-many. Unfortunately, a lot of optimization is a little bit more complex, and you're probably going to have to write code.
If you're uniting objects, which you would do to improve query speed and to minimize the amount of data you've got in the database or in memory, you're probably going to have to write a custom entity migration policy. If you want to do data duplication, we'll get more into that later when we're talking about database queries, you might want to consider doing some text canonicalization up front. And for that, you're probably going to want to write a custom function expression. That's really all I'm going to say about migration other than it's there. It's very useful in the context of optimizing your data model. And Malcolm has written really great... great documentation about it.
So hopefully by now you've got a reasonably optimized model in your application. You can start getting down to the nitty gritties. How do you get data into your application quickly? Well, there's a number of ways. The fetch request cycle is pretty basic. The user does something which causes a fetch request to be created and execute fetch requests to be called on the managed object context. That's sent down to the persistent store, which sends it down to any persistent store coordinate, which sends it down to any of the stores.
It's an atomic store. The predicate is evaluated on the cache nodes in memory. Otherwise, if it's an SQL store, we create an SQL statement, send that out to the database where it's evaluated, and we see if we can find any matching data. If we find matching data, it's brought in, registered.
Managed objects are created or found depending on whether you'd already loaded the managed object, passed back to the coordinator, passed back to the context, which registers them if it doesn't already know about them, and returned to the user. It's a pretty basic process. Many of you are probably familiar with it. You've read our document.
We've done a lot of documentation. This isn't always what you want, though. This is really good for the basic, small data set, moderate data set, even the largest data set user case. But really, it's not always what you want because sometimes you don't need all of the data that ends up getting loaded.
For example, a lot of the time, you're only actually going to want to know how many objects in the database meet the criteria that were set in the predicate. With the existing API, you'd actually have to fetch all of those objects into memory, count the number of them that were in the result array. You don't need to do that.
We don't need to do that anymore. We've added API to the managed object context, count for fetch request, which allows you to just get the number back. We've also added some stuff to NSFetchRequest. If you're doing a fetch on an inheritance hierarchy, you often don't want all of the entities in that hierarchy.
You may want just one specific entity. You can tell the fetch request to include or exclude subentities to determine precisely which objects you want back. If all you want back is a managed object ID, you're interested in the identity of the object. Not any other information about them. You can tell the fetch request to return the result as managed object IDs instead of creating faults.
Related to that is whether or not you actually want to fetch the attribute values from the database when you do the fetch. Normally, we fetch all of the values for the object back, register it in the row cache, and return a fault. When you trip the fault, we go to the row cache, populate the managed object, and you now have attribute values you can work with. Sometimes that's not what you want. You may have a very large data set coming back for which you still want faults because you're going to need to treat them as managed objects to access their attribute values.
But you may not need all of that data. You may have 100,000 objects that you're trying to bring back, and you're only ever going to display 10, 20, 30. At that point, it may make more sense to leave the attribute values in the database until you actually trip the managed object fault, and then make the round trip to the database to pull in just the specific attributes you actually need.
An extension of that, or actually not an extension, sort of the other extreme is that you know up front that you're going to want all of the attribute information from those managed objects pretty much immediately after you do the fetch. At this point, you can tell the fetch request that it should not return the objects as faults. In that case, we'll go out to the database, pull all the attribute values back, and create your managed object faults, and then populate them immediately.
This saves you the trip through the fault handlers if you know up front that you're actually going to need all of those managed objects realized. Related to that is the ability to set relationship key paths for prefetching. What does this do? Sometimes you know when you load an object, you're always going to want some related object. For my recipes, I may know that because of the way I'm displaying data, not only do I always want the recipe, I'm always going to immediately, whenever I use a recipe, try and access its cuisine. I can specify the relationship key path.
I can specify the relationship key path. This will tell Core Data that whenever you fetch recipes from the database, also fetch back their associated cuisines and register those. This can minimize the number of trips you have to make to the database, minimize the amount of work you have to do after every trip to the database.
and it will make handling faulting relationships a lot faster. So there's no one silver bullet. No one of these things is going to be right for all of your applications. You're going to have to decide how it is your user is trying to use data, what it is you're trying to achieve, and which set of flags it is that's appropriate for you.
Once you've got your data, there's a couple other things you really want to think about, mostly having to do with how much of it you want to keep in memory at one time. We ended up like this at the end of the last fetch request cycle. We just had a couple objects in memory. But you know, your user's not just going to look at two objects. They're going to ask for more and more. And eventually, you're going to have lots.
At some point, if they've only looked at those objects once, you're probably going to want to release them. So first, you only want to fetch what you need, but second is you only want to retain what you need. And some things that can help you figure out what it is you're retaining, whether you knew it or not, is that contexts retain objects that have been modified, but not objects that haven't been modified. So if you're not retaining something that hasn't been modified, it's not being retained. When its release count goes to zero, it's released, deallocated, memory gets scavenged, you get to reuse it at some point.
Often it will surprise you if you've got a wild pointer and other things end up in the same place. Weirdness happens when you're trying to debug. Heisenbugs aren't fun. Undo managers retain object IDs only. There's something to think about. And you know, we're subject to the rules of Cocoa Memory Management just the way you are. So if you've tripped a relationship fault, relationships retain the objects they're related to.
This is called, in the vernacular, a retain cycle. If you want for some of those objects to go away, you're going to need to figure out how to break that retain cycle. There's a couple of ways to do that. The one you're probably going to want to use most of all is refresh object merge changes. This is on the managed object context, and it causes the object to be turned back into a fault.
And if you say merge changes no, the object remains a fault. If you say merge changes yes, it will be re-realized, and the changes will be applied, but the relationship faults won't be tripped. This allows you to break relationships between objects, break retain cycles, and allow whatever is at the other end of those relationships to be, well, released, dealloced, scavenged.
There is another option, which is much more of a nuke and pave option, which is reset. You can call that on the managed object context to blow away everything that's been done in that context and reset the state to the way it was initially. This includes blowing away the undo stack, so it's a pretty heavyweight option. But sometimes it's really what you're going to want to do, and it's going to be a lot faster than refreshing every single object in the context.
Things start to get tricky at this point. A lot of optimization really is. It becomes more a matter of looking at what you've got, trying to figure out what it's telling you, and then trying to figure out what to do about it. How do you do that in Core Data? We've added a default, com apple core data SQL debug. This allows you to tell Core Data to log all of its transactions with the database. You can set it to a value between 1 and 3.
1 displays the least information, 3 is the most. 1 is probably actually for most of you going to be the most useful. It'll log all of the SQL that's being sent to the database. It'll tell you how many objects were returned as a result of that SQL, and it'll tell you how long it took. If you find that's not enough for some reason, there's always Shark, which will tell you exactly where your application is spending cycles, and where you really want to be concentrating the bulk of your optimization work.
What kind of things might you see in the logs that Core Data prints when it goes to the database? You might see that you're doing one big fetch up front, and then a lot of individual fetches that are on related objects that are only after one object, that are only returning one object.
What this means is that you're probably firing a lot of relationship faults individually. This is a good place for you to use relationship keypath prefetching. Go grab those objects up front instead of doing it one at a time. This is a specialized subset of what we call a bad access pattern of repeated trips to the database.
Going to the database is I/O. We all know that I/O is slower than operating on stuff in memory. You don't want to do it repeatedly for very small object sets. You want to try and maximize the amount of data you get for the overhead of the trip to the database.
Try and load as much as you can. If you see you're bringing back lots of individual objects, try and find out if there's a way to aggregate that somehow. In operators are a very good way to do this. You can use the in operator to return. You can write a predicate that basically says my object ID in a random collection of object IDs that you get somehow. This will bring a whole bunch of otherwise unrelated objects back into memory all at the same time.
Sometimes you'll find out you're loading more data than you need, that the predicate you thought you were setting isn't actually being set. or you're not setting a specific enough predicate. Something else to look at. Sometimes you'll notice that you're not actually doing a lot of database queries, but boy, are you spending a lot of time in the database.
At this point, you want to look at your where clause and see what it says. If you've got a complex predicate, it might just be that you're doing things in the wrong order. You want to order the simple parts first. If you have a salary comparison and then some kind of a regex, you want to do the numerical salary comparison first. you can cut down the number of rows you actually have to do a regex on.
Well, text. That's our last bullet point up there. Red desk is expensive. There's just no way around it. It's a whole bunch of stuff that we have to do to set that up to make it work. The more flexibility you give your user in querying text, the more expensive your query is going to be.
Basically, case and diacritic sensitive comparisons are faster than case and diacritic insensitive comparisons. Equality is faster than substring comparison, is faster than regex. You may notice that you're doing case and diacritic insensitive regex. You might want to ask yourself, is that really what you need? Do you really need the case and diacritic insensitivity, or are all of your users Americans and wouldn't know an accent if one snuck up on them? Do you really need full regex or is substring support good enough? A lot of the time, substrings are actually what you really want to be using.
If you know that you really do need case and diacritic insensitivity, you might actually want to think about doing some string canonicalization up front, creating a shadow attribute that contains a canonicalized version of whatever the text is that you want to search on. This is going to make it a little bit more expensive to create your objects, a little bit more expensive to save them.
But boy, is it going to save you time when you're searching. Because every time you do a case or diacritic insensitive search, we have to normalize the text. And when we're done with the search, we just throw that away. Well, you can imagine that that would build up if you're doing 30 queries across the same column, all of which have to do a whole bunch of work and then throw it away. All of which are doing the same whole bunch of work and then throwing it away. You might want to consider uploading that up front.
Relationships are something that can also be really expensive. They can be very useful if you're just following one link. You're following a different relationship key path, and you're doing an identity-based comparison. It gets a lot uglier if you're following 13 of them and doing some kind of obscure text manipulation at each step.
You want to try and minimize the set of objects you're working with. So you might want to see if there's a way you can refactor your UI or your data set to move some of that Relationship information onto other objects and not do large amounts of key path following.
Some other stuff we've seen. This is actually a fairly common pattern if people are creating lots and lots of objects all at once. They'll hit the database repeatedly to find out if they've already created an object. This is common if they want to use some kind of unique information that is not, however, unique in the data set they started with. Cuisines are actually a good example.
When I write my code to create new recipes that have cuisines, I can look at the cuisine field, pull out that string value, and do a database query to find out whether or not I've already created a cuisine with that name. This is going to be slow. What you're probably going to want to do is take the cuisine name, create a cuisine, and then register it in a local cache dictionary that I can use to find out if I've already created that object. Avoid the database trip. Everybody's happier.
Don't save after every insert. We've also seen this. Try and batch them. Do your inserts in lots of 100 or 1,000 or whatever makes sense for you. One of the reasons people were doing a lot of saves after inserts was so they could get permanent object IDs assigned as quickly as possible.
We've added API to the managed object context that allows you to have and I'm going to show you how to use Core Data to manage large amounts of data efficiently while maintaining peak application performance. Deletions. As I mentioned before, relationships are one of those mixed blessings. If an object participates in a relationship that has an inverse, in order to delete the object, we have to fault in all the objects that are on the other end of that relationship so we can tell them that their relationship to this source object needs to be cleaned up.
Well, if you know up front that you're deleting all objects that are at the other end of that relationship, you may want to nil out the relationship first. This will save Core Data a whole bunch of work doing the delete propagation for stuff that, well, is going to be going away anyway.
If you don't want to do that, but you know that there's a whole bunch of objects at the other end of the relationship, rather than tripping those faults one by one, we mentioned that a few slides ago, you might want to consider doing some prefetching when you load those objects in the first place. Fetch whatever's on the other end of the relationship so you're not tripping those faults individually one by one by one by whatever the order of magnitude of work on the I.O. there is. And again, don't save after every delete.
So you've got all those patterns and you're looking at the numbers you're getting. You're saying, "Does this make sense? Can I do better?" A large part of knowing whether or not you can do better, whether or not it's worth spending the time, is knowing what to expect in the first place. Core Data performs really well.
for non-large data sets. What's a large data set? Unless your objects are 5 meg each, 5,000 objects is not a large data set. A large data set, depending on the size of your objects, is going to start somewhere in the hundreds of thousands of objects. Core Data can load moderately sized objects, about 100,000 of them a second. That's in the latest version of Tiger.
Ben will talk about some performance improvements we've made in Leopard. We can save about 5,000 objects a second. We can delete about 1,000 objects a second. If these are the numbers you're seeing, you're probably doing about as well as Core Data can do, which means that you're at the point where you've done all you can in a simple case. I'm going to bring Ben up here to talk about the perils and patterns of the more complicated case, otherwise known as attempting to multi-thread Core Data.
Hey, good afternoon, everyone. So I'm going to show you some stuff about multithreading. I'm going to preface this by saying that it's a lot of work, and I don't really recommend you go down this path until you really absolutely have to. But everybody asks eventually. So some of the motivations for using threads, probably the best reason to add threads to an application is responsiveness, where you have a long-running operation and you don't want your users to get the little spinning pizza.
And in this case, a long-running operation is maybe 200 milliseconds or so. So threads can be a way to avoid using a progress bar and do something in the background and get your application's UI back to the user as quickly as possible. And in Core Data, another way you can use threads are to do batch saves.
So if you have a really large block of saves, that can take a little bit of time. We know we have some more performance work to do here, but one way you can get a boost is if you're saving unrelated changes, you can break them up into different groups. different groups and you can get a performance improvement that way.
So for Core Data's thread safety, like much of Cocoa, there isn't a whole lot of intrinsic thread safety going on. You need to do some locking. And we need you to help us understand the scope of a transaction, what changes you want and don't want to be controlled by that thread.
So we need you to do some locking on the managed object context and on the persistent store coordinator to give us some scoping information. The managed objects themselves are always owned by their managed object context. So whenever you're concerned about the thread safety of managed object, you need to take a look at who owns its context. Managed object IDs are immutable, and they're always thread safe.
So here's our recommended approach when you do decide to jump into this pool. And that's basically a thread creates a managed object context. A thread can have many managed object contexts. But in particular, the managed object context stays exclusively under the control of the thread that created it. So only the creator thread ever sees it, knows about it. There's no violations of encapsulation. Basically, the managed object context is confined to that thread.
This is easier to do. So there's a lot of interdependencies that require synchronization. So there's a lot of interdependencies that require synchronization. So there's a lot of interdependencies that require synchronization. So there's more concurrency. So it basically requires that you break things up a little bit more and plan your application around larger aggregate operations. But you can get more performance this way.
So in this scenario, Core Data is actually automatically going to do some of the locking for you. So you don't have to explicitly lock a context if you're never sharing it with another thread, in the same way that if you have a mutable dictionary and no other thread ever sees it, then you don't have to lock it.
And Core Data, the managed object context, when it needs resources from the persistent store coordinator or from the database, will handle all the locking for you. So you don't have to worry about whether or not the database resources need extra synchronization. You still need to lock the persistent store coordinator when you message it explicitly. So if you're, say, adding a new persistent store or you're asking a persistent store coordinator to take a URI and give you back a managed object ID, some stuff like that, then you'll need to lock the core.
So you can lock the persistent store coordinator if it's being shared amongst multiple threads. If you have a document-based application, then you might not have to do this because the persistent store coordinator might also be confined to a thread in a similar way. And another reason why you might want to lock the persistent store coordinator is actually to get less concurrency.
So if you have a bunch of operations, you want to do some fetches, you want to do a couple saves, and you want one context to own the coordinator for a specific block and not let anybody else interpose some edits until you're done, you can lock the coordinator. that only one context has access to the database for some amount of time.
And in Leopard, we're now shipping a debug version of the Core Data Framework. And when you use the debug version of the Core Data Framework, we added some threading debugging assertions. So we basically keep track of additional state about what you've locked and what you haven't locked and which thread has got its fingers on which objects. And if you break any of our assumptions, we'll throw an assertion failure and we'll basically kill your app. And if you're running in GDB, this is really useful to get a backtrace of all the things that you've locked.
So you set up the DLD image suffix, and this turns on the debugging. If you man DLD, you can get some more information about this. And when you've set up the information correctly and you've passed this user default, com, apple, core data, threading debug, and in this our threading debug level one, we'll log a little message saying that we've enabled the multi-threading assertions. And then you can break on the NS assert functions. So if you're running a code, you can break on the NS assert functions. So there's a really big one. I just look it up in the foundation documentation and then throw my GDB in it.
And you can see an example of background threading, sorry, an example of multi-threading, where each thread has a confined context and they don't pass context between threads. This is actually a ship done tiger. It's in developer examples, core data, background fetching. And the threads pass the immutable object IDs between each other. And since they're immutable, no synchronization is required. So threads can talk to each other.
And so they can pass information about the objects they're working with without actually exposing their private copies of those objects. And it's an example of threading for responsiveness. The background fetching doesn't actually make the fetching any faster. But it does mean that when you bring up the window, the user doesn't wait for all the fetching to complete before the table view is rendered.
So there's also the hard way, which we really try to dissuade people from doing, but some user scenarios actually require it. And that's when you want to pass a managed object context between threads. So you can't confine it to the thread that created it. You need to pass it off to another thread at a later point in time. And in this case, you really have to lock and unlock the context as you move them between threads, and you want to basically transfer ownership of who's using that managed object context.
Managed objects, again, I really need to iterate this. Even if you do value for key, you do an accessor method on a managed object, that thread needs to own the lock on that object's managed object context. And reading is not thread safe. Managed objects are not immutable. They're not NSDictionary's.
And one of the real concerns here is the side effect on Core Data is caching. So you might trigger a fault. You might trigger a callback. There are any number of things that can go on. So we don't consider reading to be thread safe intrinsically. You need to lock.
and here's some sort of higher level notes. Whenever you move any kind of Cocoa object between threads, you need to retain anything you lock. So if you lock a managed object context or you lock a persistent store coordinator or you lock an NS lock, you need to own or retain on that.
If that object gets deallocated from underneath you while you're depending on its synchronization, you will be very unhappy. And then objects that get pushed across threads, whether they're managed objects or any other kind of Cocoa object, need to be retained by the sending thread and released by the receiving thread. And that's the only way you can make sure that the lifespan of that object actually survives long enough for the receiving thread to do something with it. And then managed objects go wherever their contexts go.
So again, our threading assertions work for doing things the hard way as well. You bump up the threading debug level to level three, and that kind of lets you run with scissors. And it's basically the same. You just break on the NS assertion and handle failure and method. And at that point, you can do thread apply all and then BT, and that will get you backtraced for all the threads running.
And another note, Core Data's undo facility is built on top of foundation about the NSUndoManager. We're just pushing our own functions onto the undo stack, so it's not really all that magical. We're just observing all the changes to your managed object context and doing some aggregation on them. What this means is the undo manager is not thread safe. And the groups by event facility in particular cannot be thread safe.
There's no amount of locking that will make groups by event run. So it needs to be disabled on any undo managers you're using in background threads. You can either turn off... You can either set groups by event off, or you can tell that context that you don't need undo management on a background thread. In fact, many background threads don't really need undo management.
If you're doing a batch save and something happens and you want to undo, you might just redo that work and try again. Or if you're fetching in the background, you don't need undo for that. So when you disable groups by event, you can just manually push and pop undo stack, basically bracketing yourself, and it's just the Foundation API.
And a final note on threading is about detached threads. Now, you can review some information about POSIX threading to get the full technical detail about detached threads. But basically, all NS threads are considered detached. And what this means is detached threads are optional workers. The process is not required to wait for them to finish. So an application can quit even if there's detached threads running in the background doing work.
And if you want the application, if you want the main thread, you can quit. But to remain active and not quit, then you're required to do some manual synchronization, whether it's a condition signal or a thread join or some other work there. But basically, what this means is if you save in the background, the application could quit before your save is done, halfway through the save, any point during the save. Now, this isn't necessarily a big deal for the database. The SQLite database has got full transactional support.
They're completely ACID compliant. So the database will be fine, but your users may not really appreciate that you had a clean document window and you quit, and yet their changes aren't actually saved to disk. So you'll want to do some additional synchronization if you go too far into the threading using NS threads.
and here I'm going to switch over to the demo machine now and do sort of a modified reprise of some of the demonstrations from last night and show in particular, yeah, I don't care about the HTML store, and in particular show you some stuff in Shark where you can see both how to use Shark and how I would approach Shark and look at the hotspots and what things evolved between the different versions of the application that we're working on.
So this is basically just fetching 500,000 words from a dictionary database and I'm just warming it up so it's sort of a fair comparison for everybody. And so here in the first one, this is a basic project that's using SQLite directly and And basically, it's written in Rossi.
It's using the SQLite APIs. It's using Core Foundation to do the collection management. And the key point here is we don't have a model object. We just have a generic dictionary. So this is pretty convenient to use Core Foundation. And we're running our own SQL. Pretty easy. We iterate through everything. And that's pretty much it.
So in Shark, we will-- this one. Take just a second here to run. Shark runs quite nicely on a quad processor. and I'm going to show you how to use Core Data to manage large amounts of data efficiently while maintaining peak application performance. Melissa Turner, Ben Trumbull and I'm going to show you how to use Core Data to manage large amounts of data efficiently while maintaining peak application performance.
So basically, we can see here in the main function, we're spending a lot of time releasing values and a lot of time doing the set. We spend a little bit of time in SQLite and a lot of time wrapping values. And with working in the database, it's actually pretty typical in pulling data to and from. You have to really wrap the values to put these things into a Cocoa View. Let's see if I can persuade Shark to-- eh. Shark doesn't want to open my source. How sad.
This is the penalty I get for not building on this machine. Basically, just walking you through it, a lot of this dictionary work is going on right in here as we pull back each of the values using the SQLite API to get each column value. And we need to wrap this.
And because we don't actually have a model object, we just have these keys defining sort of loosely an attribute name with a column. We need to wrap them in CFNumbers and CFStrings all the time. And as you can see in the shark trace, that starts to add up like 15% of the time just working a little dictionary.
You can see that the dictionary object is really convenient, but probably not as fast as we can possibly get. It took about 100 lines of code to do that. And in this sample here, I did some work to address the fact that, and I'm repeating it just to keep everything fair, so to speak. So this is actually a good 40% faster, so we're fetching about 200,000 lines, 200,000 rows per second this way. It's also using exclusively SQLite.
and we're doing some stuff in Coconaut because we can get some stuff for free here. So the key performance enhancement now is to throw out that generic dictionary and have a real object that we can directly assign Ivers to. So we really cut back on all the overhead we spent in hashing. And all that hashing also contributed to the retain and release overhead. You can also see that we're spending some serious time here hacking on retain and release because that came up pretty heavily in the Shark sample. So I'll run this in Shark as well. Okay.
So release is still taking up a big amount of time, but you see most of the dictionary hashing now has gone away, which makes sense. It's now focused on creating the CFString object, strap the string values. But in here, we get to assign the numbers directly into our IVARs. So that's actually a pretty big performance improvement. And we also cut down on some of the message sending going on.
So in here, where we used to be spending more time, With the dictionary hashing, we're now spending most of our time creating the string wrappers about what we expect. We basically have a CFDictionary here to unique the values. So this is kind of like how you might implement a row cache, basically. And we've cut back on some of the retaining and releasing here using these custom callbacks.
So the Core Foundation collections, you can tweak their custom callbacks. And I keep hitting the middle mouse button, excuse me. But instead of using the default callbacks, which we'll call the retain and release methods, we're using basically static functions, which are sort of the Objective-C equivalent of a non-virtual method. We've gotten rid of the overhead of spin locking using the OS atomic increment and decrement operations.
And as you can see down here, we're a good 40% faster or so. And maybe a third more code. and I will give you the performance data. So here in Core Data, basically we're doing all this for you and we're getting there just under 500,000 rows per second.
and here's my model. So this is a pretty simple model, obviously. It's just a word. There's not a whole lot of data there, but it does highlight the overhead that Core Data would have to work with because each row has a certain amount of overhead regardless of how much data you put in it. So for some of you, this is hopefully an introduction to the new Objective-C property syntax, which is really actually a very exciting addition to the Objective-C programming language.
And it allows you to do type safe compiler check key paths, which is a feature I know a lot of people have been asking for for a long time. And Core Data is going to generate all the support for accessor methods on the fly. So you don't actually have to write any of the accessor methods to do that, which doesn't look very exciting until we look at the main body of the code.
Here, and we're using the new property syntax and this method, which is the text accessor method, it's sort of syntactically equivalent. to doing that. And this method gets generated for you on the fly. We'll take a look at the object and see if there's any optimizations we can apply to do that.
And it's quite a bit faster than Tiger. And another introduction here that we're using now are the fast enumeration protocol that Objective-C is using. And this is probably the fastest way to iterate through a Cocoa collection. It's quite nice, and it's also much more succinct than in the first example where we had to ask the array. to get every object and stuff like that.
This is really a simple app. We create a Core Data stack here. We're creating a fetch request. We're using some of the new API to tell Core Data that we don't really want any of the lazy initialization that goes on with faulting, because we're going to just touch all the objects immediately. And we execute the fetch request. And if we show you in Shark, obviously being twice as fast, it will look quite a bit different.
First thing that you'll note is main actually is kind of down here somewhere. There we go. So when we execute a fetch request in Leopard, and this didn't quite make the seed, but it will be forthcoming for the ADR members who get Leopard seeds very soon now. If you ask for enough data, we're going to spawn some background threads, and we're going to do some pipelining for you.
And here, we do a much better job of load balancing across this G5, whereas in the previous apps, we basically left the whole processor idle, doing nothing. We populate some of the row values, we do some of the registration and the context, some of the fault firing, and you can see everything's broken up pretty neatly here in no one particular place.
[Transcript missing]
But most of the work is going on as we basically create wrapper objects for-- I can't show you that. Wow, Shark did something fun. All right, but most of the work is going on as we create Cocoa Wrapper objects for you for each of the column values.
and we'll be wrapping up and taking questions shortly. If we can go back over to the slides. Yeah. So we spent a lot of effort optimizing stuff for you in Leopard. Like I said, you'll get a chance to play with this in an upcoming seed really soon now. But as Melissa mentioned, the Leopard seed is already about five times faster than Tiger GM. You can get about 125,000, 150,000 rows per second on a 2 gigahertz G5. The new machines are a good twice the speed of that.
And that's what we have. We've done a lot of work in optimizing a bunch of other things as well. And I really had a lot of fun testing my app on the quad and the lab yesterday. So 800,000 rows per second. I'm happy about that. Anybody else? So... So some things to look forward to is we're really committed to fixing performance bugs, so please follow those.
And in an upcoming seed, you'll see our support for Objective-C properties. You'll get to play with some of our pipelining, and we're building across four architectures now. I don't know what else the hardware guys will spring on us, but we'll be participating in pretty much all of the Leopard platform initiatives.
and for more information, you can send an email to Derek Horn, who's the Application Frameworks Evangelist, and the documentation and all the good stuff is there. The Core Data Programming Guide is fantastic. It has a section on multi-threading. It has a section on common performance problems. I really recommend you take a look at that. Malcolm is adding new sections in response to your feedback.