Understanding ARKit Tracking and Detection - WWDC 2018

Graphics and Games • iOS • 57:59

With ARKit your app can see the world and place virtual objects on horizontal and vertical surfaces and recognize images and objects. Go beyond the API to gain insights into the innovative methods and techniques underlying these capabilities. See how ARKit combines device motion with the camera to provide accurate tracking and plane detection. Get a deeper understanding of persistence and multi-device AR and learn the recommended approach for image tracking and object detection.

Speakers: Marion Maerz, Michele Stoppa

Unlisted on Apple Developer site

Downloads from Apple

HD Video (1.8 GB)
SD Video (463.9 MB)
PDF Slides (20.9 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Hello, everybody. I'm very excited to be here today to talk about understanding ARKit Tracking and Detection to empower you to create great augmented reality experiences. My name is Marion and I'm from the ARKit Team. And what about you? Are you an experienced ARKit developer, already, but you are interested in what's going on under the hood?

Then, this talk is for you. Or you may be new to ARKit. Then, you'll learn different kind of tracking technologies, as well as some basics and terminology used in augmented reality, which will then help you to create your very own first augmented reality experience. So, let's get started. What's tracking?

Tracking provides your camera viewing position and orientation into your physical environment, which will then allow you to augment virtual content into your camera's view. In this video, for example, the front table and the chairs is virtual content augmented on top of the real physical terrace. This, by the way, is Ikea. And the virtual content will appear always virtually correct. Correct placement, correct size, and correct perspective appearance.

So, different tracking technologies are just providing a difference reference system for the camera. Meaning the camera with respect to your world, the camera with respect to an image, or maybe, a 3D object. And we'll talk about those different kind of tracking technologies in the next hour, such that you'll be able to make the right choice for your specific use case.

We'll talk about the already existing AR technologies' Orientation Tracking, World Tracking, and Plane Detection. Before we then have a close look at our new tracking and detection technologies which came out now with ARKit 2. Which are saving and loading maps, image tracking, and object detection. But before diving deep into those technologies, let's start with a very short recap of ARKit like on a high level. This is, specifically, interesting if you are new to ARKit.

So, the first thing you'll do is create an ARSession. An ARSession is the object that handles everything from configuring to running the AR technologies. And also, returning the results of the AR technologies. You then, have to describe what kind of technologies you actually want to run. Like, what kind of tracking technologies and what kind of features should be enabled, like Plane Detection, for example.

You'll then, take this specific ARConfiguration and call run method on your instance of the ARSession. Then, the ARSession, internally, will start configuring an AVCaptureSession to start receiving the images, as well as a Core Motion manager to begin receiving the motion sensor, so, data. So, this is, basically, the built-in input system from your device for ARKit.

Now, after processing the results are returned in ARFrames at 60 frames per second. An ARFrame is a snapshot in time which gives you everything you need to render your augmented reality scene. Like, the captured camera image, which would then be, which will be rendered in the background of your augmented reality scenario.

As well as a track camera motion, which will then be applied to your virtual camera to render the virtual content from the same perspective as the physical camera. It also contains information about the environment. Like, for example, detected planes. So, let's now start with our first tracking technology and build up from there.

Orientation Tracking. Orientation Tracking tracks, guess what? Orientation. Meaning it tracks the rotation, only. You can think about it as you can only use your hat to view virtual content, which also, only allows rotation. Meaning you can experience the virtual content from the same positional point of view, but no change in the position is going to be tracked.

The rotation data is tracked around three axles. That's why it's also, sometimes, called the three degrees of freedom tracking. You can use it, for example, in a spherical virtual environment. Like, for example, experience a 360-degree video, in which the virtual content can be viewed from the same positional point. You can also, use it to augment objects that are very far away. Orientation Tracking is not suited for physical world augmentation, in which you want to view the content from different points of views.

So, let's now have a look at what happens under the hood when Orientation Tracking is running. It is, actually, quite simple. It only uses the rotation data from core motion, which applies sensor fusion to the motion sensors data. As motion data is provided at a higher frequency than the camera image, Orientation Tracking takes the latest motion data from Core Motion, once the camera image is available. And then, returns both results in an ARFrame. So, that's it. Very simple. So, please note that the camera feed is not processed in Orientation Tracking. Meaning there's no computer version under the hood here.

Now, to run Orientation Tracking you only need to configure your ARSession with an AROrientation TrackingConfiguration. The results will then be returned in an ARCamera object provided by the ARFrames. Now, an ARCamera object always contains the transform, which in this case of Orientation Tracking, only contains the rotation data of your tracked physical camera. Alternatively, the rotation is also represented in eulerAngles. You can use whatever fits best to you.

Let's now move over to more advanced tracking technologies. We'll start with World Tracking. World Tracking tracks your camera viewing orientation, and also, the change in position into your physical environment without any prior information about your environment. Here, you can see on the left side the real camera's view into the environment, while on the right side you see the tracked camera motion while exploring the world represented in the coordinate system. Let's now explain better, what happens here, when World Tracking is running. World Tracking uses a motion sensor, the motion data of your device's accelerometer and gyroscope to compute its change in orientation and translation on a high frequency.

It also provides its information in correct scale in meters. In literature, just this part of the tracking system is also called Inertial Odometry. While this motion data provides good motion information for movement across small time intervals and whenever there's like, sudden movement, it does drift over larger time intervals as the data is not ideally precise and subject to cumulative errors. That's why it cannot be used just by its own for tracking. Now, to compensate this drift, World Tracking, additionally, applies a computer vision process in which it uses the camera frames. This technology provides a higher accuracy, but at the cost of computation time.

Also, this technology is sensitive to fast camera motions and this results in motion blur in the camera frames. Now, this vision only part of the system is also called Visual Odometry. Now, by fusing those both systems, computer vision and motion, ARKit takes the best of those both systems. From computer vision, it takes a high accuracy over the larger time intervals. And from the motion data it takes the high update rates and good precision for the smaller time intervals, as well as the metric scale.

Now, by combining those both systems World Tracking can skip the computer vision processing for some of those frames, while still keeping an efficient and responsive tracking. This frees CPU resources, which you can then, additionally, use for your apps. In Literature, this combined technology is also called Visual Inertial Odometry. Let's have a closer look at the visual part of it.

So, within the computer version process interesting regions in the camera images are extracted, like here, the blue and the orange dot. And they are extracted such that they can robustly all to be extracted and other images of the same environment. Those interesting regions are also called features. Now, those features are then matched between multiple images over the camera stream based on their similarity and their appearance.

And what then happens is pretty much how you are able to see 3D with your eyes. You have two of them and they are within the sidewise small distance. And this parallax between the eyes is important as this results in slightly different views into the environment, which allows you to see stereo and perceive the depth.

And this is what ARKit now, also, does with the different views of the same camera stream during the process of triangulation. And it does it once there's enough parallax present. It computes the missing depth information for those matched features. Meaning those 2D features from the image are now reconstructed in 3D. Please, note that this reconstruction to be successful, the camera position must have changed by a translation to provide enough parallax. For example, with the sidewise movement. The pure rotation does not give enough information here.

So, this is your first small map of your environment. In ARKit we call this a World map. In this same moment, also, the camera's positions and orientations of your sequences are computed, denoted with a C here. Meaning, your World Tracking just initialized. This is the moment of initialization of the tracking system. Please note that also in this moment of this initial reconstruction of the World map, the world origin was defined. And it is set to the first camera's origin of the triangulated frames.

And it is also set to be gravity aligned. It's denoted with a W in the slides. So, you now have a small representation of your real environment reconstructed as a World map in its own world coordinates system. And you have your current camera tracked with respect to the same world coordinate system.

You can now start adding virtual content to augment them into the camera's view. Now, to place virtual content correctly to an ARSession, you should use ARAnchors from ARKit, which are denoted with an A here. ARAnchors are reference points within this World map, within this world coordinates system. And you should use them because the World Tracking might update them during the tracking. Meaning that, also, all the virtual content that is assigned to it will be updated and correctly augmented into the camera's view. So, now that you've used the ARAnchors you can add virtual content to the anchor, which will them be augmented correctly into the current camera's view.

From now on, this created 3D World map of your environment is your reference system for the World Tracking. It is used to reference new images against. And features are matched from image to image and triangulated. And at the same time, also, new robust features are extracted, matched, and triangulated, which are then extending your World map. Meaning ARKit is learning your environment. This then allows, again, the computation of tracking updates of the current camera's position and orientation. And finally, the correct augmentation into the current camera's view.

While you continue to explore the world, World Tracking will continue to track your physical camera and continue to learn your physical environment. But over time, the augmentation might drift slightly, which can be noticed like you can see in the left image, in a small offset of the augmentation. This is because even small offsets, even small errors will become noticeable when accumulated over time.

Now, when the device comes back to a similar view, which was already explored before, like for example, the starting point where we started the exploration, ARKit can perform another optimization step. And this addition makes, a Visual Intertial Odometry system, makes the system that ARKit supplies to a Visual Inertial SLAM System. So, let's bring back this first image where the World Tracking started the exploration.

So, what happens now is that World Tracking will check how well the tracking information and the World map of the current view aligns with the past views, like the one from the beginning. And will then perform the optimization step and align the current information and the current World map with your real physical environment. Have you noticed that during this step, also the ARAnchor was updated? And that is the reason why you should use ARAnchors when adding virtual content to your scenario.

In this video, you can see the same step again with a real camera feed. On the left side you see the camera's view into the environment, and also, features which are tracked in the images. And on the right side, you see a bird eye's view onto the scenario, showing what ARKit knows about it and showing the 3D reconstruction of the environment.

The colors of the points are just encoding the height of the reconstructed points with blue being the ground floor and red being the table and the chairs. Once the camera returns back to a similar view it has seen before, like here the starting point, ARKit will now apply this optimization step. So, just pay attention to the point cloud and the camera trajectory. Have you noticed the update? Let me show you, once more.

This update aligns the ARKit knowledge with your real physical world, and also, the camera movement and results in the better augmentation for the coming camera frames. By the way, all those computations of World Tracking, and also, all this information about your learned environment, everything is done on your device only.

And all this information, also, stays on your device only. So, how can you use this complex technology, now, in your app? It is actually quite simple. To run World Tracking you just configure your ARSession with an ARWorldTrackingConfiguration. Again, its results are returned in an ARCamera object of the ARFrame.

An ARCamera object, again, contains the transform, which in this case of World Tracking contains, additionally, to the rotation, also, the translation of the track camera. Additionally, the ARCamera also contains information about the tracking state and trackingStateReason. This will provide some information about the current tracking quality. So, tracking quality. Have you ever experienced opening an AR app and the tracking worked very poorly or maybe it didn't work at all? How did that feel? Maybe frustrating? You might not open the app, again. So, how can you get a higher tracking quality for your app?

For this, we need to understand the main factors that are influencing the tracking quality. And I want to highlight three of them here. First of all, World Tracking relies on a constant stream of camera images and sensor data. If this is interrupted for too long, tracking will become limited.

Second, World Tracking also works best in textured and well-lit environments because World Tracking uses those visually robust points to map and finally triangulate its location. It is important that there is enough visual complexity in the environment. If this is not the case because it's, for example, too dark or you're looking against a white wall, then also, the tracking will perform poorly.

And third, also, World Tracking works best in static environments. If too much of what your camera sees is moving, then the visual data won't correspond with the motion data, which might result in the potential drift. Also, device itself should not be on a moving platform like a bus or an elevator. Because in those moments the motion sensor would actually sense a motion like going up or down in the elevator while, visually, your environment had not changed.

So, how can you get notified about the tracking quality that the user is currently experiencing with your app? ARKit monitors its tracking performance. We applied machine learning, which was trained on thousands of data sets to which we had the information how well tracking performed in those situations. To train a classifier, which tells you how tracking performs, we used annotations like the number of visual-- visible features tracked in the image and also, the current velocity of the device.

Now, during runtime, the health of tracking is determined based on those parameters. In this video, we can see how the health estimate, which can be seen-- which, is reported in the lower left, gets worse when the camera is covered while we are still moving and exploring the environment.

It also shows how it returns back to normal after the camera view is uncovered. Now, ARKit simplifies its information for you by providing a tracking state. And the tracking state can have three different values. It can be normal, which is the healthy state and is the case in most of the time. It's the case in most of the times. And it can also be limited, which is whenever tracking performs poorly. If that's the case, then the limited state will also come along with the reason, like insufficient features or excessive motion or being currently in the initialization phase.

It can also be not available, which means that tracking did not start yet. Now, whenever the tracking state changes, a delegate is called. The camera did change tracking state. And this gives you the opportunity to notify the user when a limited state has been encountered. You should, then, give informative and actionable feedback what the user can do to improve his tracking situation, as most of it is actually in the user's hand. Like for example, as we learned before, like a sidewise movement to allow initialization or making sure there's enough adequate lighting for enough visual complexity. So, let me wrap up the World Tracking for you.

World Tracking tracks your camera 6 degree of freedom orientation and position with respect to your surrounding environment and without any prior information about your environment, which then allows the physical world augmentation in which the content can actually be viewed from any kind of view. Also, World Tracking creates a World map, which becomes the tracking's reference system to localize new camera images against. To create a great user experience, the tracking quality should be monitored and feedback and guidance should be provided to your user.

And World Tracking runs on your device only. And all results stay on your device. If you have not done it already, try out one of our developer examples. For example, the Build Your First AR Experience, and play a bit around, just 15 minutes with the tracking quality in different situations; light situations or movements.

And always remember to guide the user whenever he encounters a limited tracking situation to guarantee that he has a great tracking experience. So, World Tracking is about the camera-- where your camera is with respect to your physical environment. Let's now talk about how the virtual content can interact with the physical environment. And this is possible with Plane Detection.

The following video, again, from the Ikea app, shows a great use case for the Plane Detection, placing virtual objects into your physical environment, and then interacting with it. So first, please note how, also, in the Ikea app the user is guided to make some movement. Then, once a horizontal plane is detected, the virtual table set is displayed and is waiting to be placed by you.

Once you position it, rotate it as you want it, you can lock the object in its environment. And did you notice the interaction between the detected ground plane and the table set in the moment of locking? It kind of bounces shortly on the ground plane. And this is possible because we know where the ground plane is. So, let's have a look at what happened under the hood here.

Plane Detection uses the World map provided by the world I just talked about, just talked about a moment ago, which is represented here in those yellow points. And then, it uses them to detect surfaces that are horizontal or vertical, like the ground, the bench, and the small wall. It does this by accumulating information over multiple ARFrames. So, as the user moves around the scene, more and more information about the real surface is acquired. It also allows the Plane Detection to provide and like extent the surface, like a convex hull.

If multiple planes belonging to the same physical surface are detected, like in this part now, the green and the purple one, then they will be merged once they start overlapping. If horizontal and vertical planes intersect they are clipped at the line of intersection, which is actually a new feature in ARKit 2.

Plane Detection is designed to have very little overhead as it repurposes the mapped 3D points from the World Tracking. And then it fits planes into those point clouds and over time continuously aggregates more and more points and merge the planes that start to overlap. Therefore, it takes some time until the first planes are detected. What does that mean for you?

If your app is started, there might not directly be planes to place objects on or to interact with. If the detection of a plane is mandatory for your experience, you should again guide the user to move the camera with enough translation to ensure a dense reconstruction based on the parallax, and also, enough visual complexity in the scene. Again, for the reconstruction, a rotation only is not enough.

Now, how can you enable the Plane Detection? It's, again, very simple. As the Plane Detection reuses the 3D map from the World Tracking, it can be configured by using the ARWorldTrackingConfiguration. Then, the property planeDetection just needs to be set to either horizontal, vertical, or like in this case, both. And then, just call your ARSession with this configuration. And the detection of the planes will be started. Now, how are those, the results of the detected planes returned to you?

The detected planes are returned as an ARPlaneAnchor. An ARPlaneAnchor is a subclass of an ARAnchor. Each ARAnchor provides a transform containing the information where the anchor is in your World map. Now, a plane anchor, specifically, also has information about the geometry of the surface of the plane, which is represented in two alternative ways.

Either like a bounding box with a center and an extent, or as a 3D mesh describing the shape of the convex hull of the detected plane and its geometry property. To get notified about detected planes, delegates are going to be called whenever planes are added, updated, or removed. This will then allow you to use those planes, as well as react to any updates.

Now, what can you do with planes? Like what we've seen before on the Ikea app, these are great examples. Place virtual objects, for example, with hit testing. Or you can interact with some, for example, physically. Like we've seen bouncing is a possibility. Or you can also use it by adding an occlusion plane into the detected plane, which will then hide all the virtual content below or behind the added occlusion plane. So, let me summarize what we've already gone through. We've had a look at the Orientation Tracking, the World Tracking, and the Plane Detection. Next, Michele will explain, in depth, our new tracking technologies, which were introduced in ARKit 2. So, welcome Michele.

[ Applause ]

Thank you, Marion. My name is Michele, and it's a pleasure to continue with the remaining topics of this session. Next up is saving and loading maps. This is a feature that allows to store all the information that are required in a session. So, that it can literally be restored in another session at a later point in time to create augmented reality experiences that persist to a particular place. Or that could, also, be stored by another device to create multiple user augmented reality experiences. Let's take a look at an example.

What you see here is a guy; let's name him Andre, that's walking around the table with his device having an augmented reality experience. And you can see his device now is making this seem more interesting by adding a virtual vase on the table. A few minutes later his friends arrive at the same scene. And now, they're both looking at the scene. You're going to see Andre's device on the left and his friend on the right now.

So, you can see that they're looking at the same space. They can see each other. But most importantly, they see the same virtual content. They're having a shared augmented reality experience. So, what we have seen in these examples can be discovered in three stages. First, Andre went around the table and acquired the World map. Then, the World map was shared across devices.

And then, his friend's device re-localized to the World map. This means that ARKit was able to understand in the new device that this was the same place as the other device, computed the precise position of the device with respect to the map, and then, started tracking from there just like the new device acquired the World map itself.

We're going to go into more detail about these three phases. But first, let's review what's in the World map. The World map includes all the tracking data that are needed for the system to be localized, which includes the feature points as Marion greatly explained before. As well as local appearance for this point.

They also contain all the anchors that were added to the session, either by the users, like planes, for example. I mean by the system-- like planes. Or by the users, like the vase, as we have seen in the example. This data is serializable and available to you so that you can create compelling persistent or multiple user augmented reality experiences. So, now let's take a look at the first stage, which is acquiring the World map.

We can play back the first video where Andre went around the table that you can see his device on the left, here. And on the right, you see the World map from a top view as acquired by the tracking system. You can recognize the circle is the table and the chair around it.

There's a few things to pay attention to during this acquisition process. First, everything that Marion said during tracking also applies here. So, we want enough visual complexity on the scene to get dense feature points on the map. And the scene must be static. Of course, we can deal with minor changes, as you have seen the tablecloth moving by the wind. But the scene must be mostly static. In addition, when we are specifically acquiring a World map for sharing we want to go around the environment from multiple points of view. In particular, we want to cover all the direction from which we want to later be localized from.

To make this easy, we also made available a world mapping status which gives you information about the World map. If you guys have been to the What's New in ARKit talk, Arsalan greatly expand this to quickly recap. When you start the session the World map status will start limited. And then, will switch to extending as more of the scene is learned by the device. And then, finally, we go to mapped when the system is confident you're staying in the same place. And that's when you want to save the map in the mapped state.

So, that's good information. But this is mostly on the user side applied to acquire the session. So, what does this mean to you as a developer? That you need to guide the user. So, we can indicate the mapping status and even disabling the saving or sharing of the World map until the mapping status goes to the mapped state.

We can also, monitor the tracking quality during the acquisition session and report to the user if the tracking state has been limited for more than a few seconds. And maybe even give an option to restart the acquisition session. On the receiving end of the device, we can also guide the user to better localization process.

So, when we are, again, in the acquisition device, when we are in the map state we can take a picture of the scene and then, ship that together with the World map. And on the receiving end we can ask the user find this view to start your shared experience.

That was how to acquire the World map. Now, let's see how you can share the World map. First, you can get the World map by simply calling the getCurrentWorldMap method in the ARSession. And this will give you the World map. The World map is a serializable class. So, then we can simply use NSKeyedArchiver utility to serialize it to a binary stream of data, which then, you can either save to disk in case of a single user persistent application.

Or you can share it across devices. And for that, you can use the MultiPeerConnectivity framework, which has great feature like automatic device, nearby device discovery, and allows efficient communication of data between devices. We also, have an example of how to use that in ARKit called Creating a Multiuser AR Experience that you can check out on our developer website.

On the receiving end of the device, once you've got the World map let's see how you can set up the World Tracking configuration to use it. Very simple. You just set the initial World map property to that World map. When you run the session, the system will try to find that previous World map.

But it may take some time, even because the user may not be pointing at the same scene as before. So, how do we know when localization happen? That information is available in the tracking state. So, as soon as you start the session with the initial World map, the tracking state will be limited with reason Relocalizing. Note that you will still get the tracking data available here, but the world origin will be the first camera, just like a new session.

As soon as the user points the device to the same scene, the system will localize. The tracking state will go to normal and the world origin will be the same as the recorded World map. At this point, all your previous anchors are also available in your session, so you can put back the virtual content.

Note here that because of what's happening behind the hood, behind the scenes, is that we're matching those feature points, there needs to be enough visual similarity between the scenes where you acquired the World map and the scene where you want to relocalize. So, if you go back to this table at night, chances are it's not going to work very well. And that was how you can create multiple user experiences or persistent experiences using the saving and loading map. Next, image tracking.

So, augmented reality is all about adding visual content on top of the physical world. And on the physical world, images are found everywhere. Think about art pieces, pieces of art hanging on the wall the world, magazine covers, advertisements. Image tracking is a tool that allows you to recognize those physical images and build augmented reality experiences around them.

Let's see an example. You can see here; two images being tracked simultaneously. On the left, a beautiful elephant is put on top of the physical image of the elephant. On the right, the physical image turned into a virtual screen. Note also, that the images can freely move around the environment as tracking around at 60 frames per second.

Let's talk about looking at what's happening behind the scenes. So, let's say you have an image like this one of the elephant and you want to find it in a scene like this. We're using grayscale for this. And the first type is pretty similar to what we do in tracking. So, we'll track those interesting points from both the reference image and the current scene.

And then, we try to go in the current scene and match those features to the one on the reference image. By applying some projected geometry and linear algebra, this is enough to give an initial estimation of the position orientation of the image with respect to the current scene.

But we don't stop here. In order to give you a really precise pose and track at 60 frames per second, we then do a dense tracking stage. So, with that initial estimate we take the pixels from the current scene and warp them back to a rectangular shape like you see on the right-- top right there. So, that's a reconstructed image by warping the pixels of the current image into the rectangle.

We can then compare the reconstructed image with a reference image that we have available to create an error image like the one you see below. We then optimize the position orientation of the image, such that that error is minimized. So, what this means to you that the pose would be really accurate. Thank you. And will still track at 60 frames per second.

So, let's see how we can do all of this in ARKit. As usual, the ARKit API is really simple. We have three simple steps. First, we want to collect all the reference images. Then, we set up the AR Session Configuration. There are two options here. One is the World Tracking configuration that gives, also, the device position. And this is the one we have talked, so far. And in iOS12, introduced a new configuration, which is a standalone image tracking configuration.

Once you start the session you will start receiving the results in the form of an ARImageAnchor. We're now going into more details of these three steps, starting from the reference images. The easiest way to add reference images to your application is through the, Xcode asset catalog. You simply create an AR Resource Groups and drag and drop your images in there. Next, you have to set the physical dimension of the image, which you can do on the property window on the top right.

Setting the physical dimension is a requirement and there's a few reason for that. First, it allows the pose of the image to be in physical scale. Which means, also, your content will be in physical scale. In ARKit, everything is in meters, so also, your visual content will be in meters. In addition, it's especially important to set the correct physical dimension of the image in case we combine the image tracking with the World Tracking. As this will give immediately consistent pose between the image and the world. Let's see some example of this reference images.

You can see here, two beautiful images. These images will work really great with image tracking. They have high texture, high level of contrast, well distributed histograms, as well as they do not contain repetitive structures. There are, also, other kinds of images that will work less good with the system.

You can see an example of this on the right. And if we take a look at these top two examples, you can see that the good image we have a lot of those interesting points. And you can see that the histogram is well distributed across the whole range. While on the Snow image, there's only a few of those interesting points and the histogram is all skewed toward the whites.

You can get an estimation of how good an image will be directly in Xcode. As soon as you drag an image in there, the image is analyzed and problems are reported to you in the form of warnings to give you early feedback, even before you run your application.

For example, if you click on this bottom image that could be a magazine page, for example, we can see that the Xcode says that the histogram is not well distributed. In fact, you can see there's a lot of whites in the image. And it would also say that this image contains repetitive structures, mainly caused by the text.

Another example, if you have two images which are too similar and are at risk of being confused at detection time, also, Xcode warns you about that. You can see an example of these two images of the same mountain range, the Sierra. There's a few things that we can do to deal with this warning. For example, let's go back to this image that had repetitive structures and not well distributed histograms.

You can try to identify a region of this image which is distinctive enough, like in this case, for example, the actual image of the page. And then, you can crop that out and use this as the reference image, instead. Which will give you, of course, all the warnings are going to be removed and will give you better tracking quality.

Another thing that we can do is use multiple AR Resource Groups. This allow many more images to be detected. As we recommend to have a maximum of 25 images per group to keep your experience efficient and responsive. But you can have as many groups as you want. And then, you can switch between groups programmatically. For example, if you are want to create an augmented reality experience in a museum that may have hundreds of images.

Usually though, those images are actually physically located in different rooms. So, what you can do is put the images that physically will be present in the room into a group. And images of another room into another group. And then use, for example, core location to switch between rooms.

Note also, that you can have similar images, now, as long as they are in different groups. So, that was all about reference images. Let's now, see our two configurations. The ARImageTrackingConfiguration is a new standalone image tracking configuration, which means it doesn't run the World Tracking. Which also, means there is no world origin. So, every image will be given to you with respect to the current camera view.

You can also combine image tracking with a World Tracking configuration. And in this case, you will have all the scene understanding capability available like Plane Detection, light estimation, everything else. So, what is more appropriate to use which configurations? Let's see. So, in the ARImageTrackingConfigurations is really tailored for use cases which are built around images. We can see an example on the left here.

We can have an image that could be a page of a textbook. And to make the experience more engaging, we are overlaying dynamic graph. In this case, on how to build an equilateral triangle. So, you can see that this experience is really tailored around an image. If you have, let's see this other example. Image tracking is used to trigger some content that then goes beyond the extent of the image. In this case, you want to use the ARWorldTrackingConfiguration as you will need the device position to keep track of that content outside the image.

Also, note that the image tracking doesn't use the motion data, which means it can also be used on a bus or an elevator, where the motion data don't agree with the visual data. So, let's see now, how we can do this in code. You can easily recognize those three steps here. The first one is to gather all the images. And there's a convenience function for that in the ARReferenceImage class that gathers all the images that are in a particular group. In this case, it's named Room1.

We can then simply set the trackingImages property to those images in the ARImageTrackingConfigurations. And run the session. You will then start receiving the results, for example, to the session:didUpdate anchors delegate method, where you can check if the anchors is of type ARImageAnchor. In the anchor, you will find, of course, the position and orientation of the image, as well as the reference image itself. Where you can find, for example, the name of the image as you named it in the asset catalog so that you know which image has been detected. There's also another Boolean property, which tells you if this image is currently being tracked in the frame.

Note here that other than these use cases that we have seen so far when you build experiences around images, image detection and tracking allows a few more things. For example, if two devices are looking at the same physical image, you can detect this image from both devices. And this will give you a shared coordinate system that you can then use as an alternative way to have a shared experience.

Another example, if you happen to know where an image is physically located in the world, like for example, you know that the map of this park is in the physical world. You can use image tracking to get the position of the device with respect to the image and, therefore, also the position of the device with respect to the world, which, you can then use, for example, to overlay directions really attached to the physical world. So, that concludes the image tracking. Let's now go and look at the Object Detection.

So, with image tracking we have seen how we can detect images, which are planar objects in the physical world. Object detection extends this concept to the third dimension allowing the detection of more generic objects. Note, though, that this object will be assumed to be static in the scene, unlike images that can move around. We can see an example here. That's the Nefertiti bust. It's a statue that could be present in a museum. And now, you can detect it with ARKit. And then, for example, display some information on top of the physical object.

Note also that in the object detection in ARKit, we are talking about specific instances of an object. So, we're not talking about detecting statues in general, but this particular instance of the Nefertiti statue. So, how do we represent these objects in ARKit? You first need to scan the object.

So, really, there's two steps to it. First, you scan the object and then you can detect it. Let's talk about the scanning part, which mostly is going to be on your side as a developer, to basically, create that representation of the object that can be used for detection.

Internally, an object is represented in a similar way as the world map. You can see an example of the 3D feature points of the Nefertiti statue there on the left. And to scan the object, you can use the Scanning and Detecting 3D Objects developer sample that's available on the website. And note here, that the detection quality that you will get at runtime, later, is highly affected by the quality of the scan. So, let's spend a few moments to see how we can get the best quality during the scanning.

Once you build and run this developer sample you will see something like this on your device. The first step is to find the region of space around your object. The application will try to automatically estimate this bounding box, exploiting different feature points. But you can always adjust this box by dragging on a side to shrink it or make it larger.

Note here, that what is really important that when you go around the object you make sure that you don't cut any of the interesting points of the object. You can also, rotate the box with a two-finger gesture from top. So, make sure that this box is around the object and not cutting any interesting part of it.

The next part is the actual scanning. In this phase what we want to do is really go around the objects from all the points of view that you think your users will want to detect it later. In order to make it easy for you to understand which part of the objects have been, already, acquired like this beautiful tile representation. And you also can see a percentage on top which tells you how many tiles have already been acquired.

And it's really important in this phase that you spend time on the regions of the object which have a lot of features that are distinctive enough. And you go close enough to capture all the details. And again, that you really go around from all the sides. Like you see here.

Once you're happy with the coverage of your objects, you can go to the next step, which is allows you to adjust the origin by simply dragging on the coordinate system. And this will be the coordinate system that will be later given to you at detection time in the anchor. So, make sure that you put it in a place which makes sense for your virtual content. So, at this point, you have a full representation of your object, which you can use for detection. And the application will now switch to a detection mode.

We encourage you to use this mode to get early feedback about the detection quality. So, you may want to go around the object from different points of view and verify that the object is detected from all these different point of view. You can point your device away, come back from another angle, and make sure that the scan was good to detect the object. You can also, move these objects around so that the light condition will be different. And you want to make sure that those are still detected. This is particularly important for objects like toys that you don't know where they're actually going to be physically located.

We, also, suggest that you take the object and put it in a completely different environment and still make sure that it is detected. In case this is not detected you may want to go back to the scanning and make sure that your environment is well lit. We really like, well lit environment during the scanning is very important. If if you have a lux meter, it will be about 500 lux will be best. And if that is still not enough, you may want to keep different versions of the scans.

So, at this point, once you're happy with the detection quality you can simply drop the model to your Mac and add it to the AR Resource Groups, just like you did for the images. Also note that there are some objects that will work really great with this system. Object like you can see on the left. First of all, they are rigid objects and they are, also, rich of texture, distinctive enough.

But there are also certain kinds of object that will not work well with the system. You can see an example of this on the right. And for example, metallic, transparent, or metallic or reflective objects will not work. Or transparent objects like glass material object will also not work because the appearance of these objects will really depend on where they are in the scene.

So, that was how to scan the objects. Again, make sure that you have well-lit environment. Let's now see how we can detect this in ARKit. If this looks familiar to you, it's because the API is pretty similar to the one of the images. We have a convenience method to gather all the objects in a group. This time is in the ARReferenceObjects class. And to configure your ARWorldTracking configuration, you simply pass this object to the detectionObjects property.

Once you run the session, again, you will find your results. And in this case, you want to check for the ARObjectAnchor, which will give you the position and orientation of the object with respect to the world. And also, the name of the object as was given in the asset catalog.

So, you guys may have noticed some similarities between the object detection and the world mapping relocalization. But there's also few differences. So, in the case of the object detection we are always giving the object position with respect to the world. While in the world map relocalization is the camera itself that adjusts to the previous world map. In addition, you can detect multiple objects. And object detection works best for objects which are tabletop, furniture sized. While, the world map is really the whole scene that's been acquired. With this insight, we conclude the object detection. Let's summarize what you have seen, today.

Orientation tracking tracks only the rotation of the device and can be used to explore statical environments. World Tracking is the fully featured position and orientation tracking, which will give you the device position with respect to a world origin. And enables all the scene understanding capabilities like the Plane Detection, which will make you able to interact with the physical, horizontal, and vertical planes where you can then put virtual objects.

We have seen how you can create persistent or multiuser experiences with the saving and loading map features in the ARKit2. And how you can detect physical images and track them at 60 frames per second with the image tracking and how you can detect more generic objects with the object detections.

And with this, I really hope you guys have a better understanding, now, of all the different tracking technology that are present in ARKit and how they work behind the scenes. And how you can get the best quality out of it. And we're really looking forward to see what you guys are going to do with that.

More information can be found at the session link in the developer website. And we have an ARKit Lab tomorrow, 9 a.m. We will both, me and Marion will be there answering any question on ARKit you may have. And with that, thank you, very much and enjoy the bash.

[ Applause ]