Performance Optimization Techniques using Intel Libraries - WWDC 2008

Tools • 51:14

Scaling the performance of your application to get the most out of a multi-core Mac requires an understanding of the various tools and libraries available at your disposal. Intel engineers will demonstrate, with the help of a real-world use case, how you can employ Intel's Threading Building Blocks and Intel Integrated Performance Primitives to achieve sophisticated and scalable applications on Mac OS X.

Speakers: Phil Kerly, Pallavi Mehrotra, Justin Landon, Mohamed Ahmad

Unlisted on Apple Developer site

Downloads from Apple

SD Video (627.1 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. I'm Phil Kerly, and I'm from Intel's Software and Solutions Group. And today I'm going to be talking about performance optimization techniques using Intel Libraries. This is actually going to delve into the topic previously by James. I hope that you had a chance to listen to that topic as well, because this will definitely fit right into his one, two, and three recommendations, focused on using libraries that are already optimized for threading, using OpenMP, and then building on top of that with things like the Intel Threading Building Blocks.

So I had an opportunity, this is actually my fourth WWDC, the third one that I've had an opportunity to talk at. Last year, I talked in quite a bit of detail on P-Threads, OpenMP, and the Intel Threading Building Blocks. And we actually had a demo last year that was a little bit far-fetched.

It was more just the demo of the different techniques. And this year, I thought, you know, we ought to look at something that's a little bit more closer to real world. Something that, you know, kind of addresses what James talked about in terms of, you know, looking at increasing complexity, functionality, and quality of software.

So we're going to talk about a face tracker application, which is using the open computer vision library, which Intel has open sourced. We're going to look at the existing threading model. The interesting thing about the existing threading model is I figured, you know, Intel released OpenCV as an open source application or a library to support computer vision.

It released the face tracker application, and I thought, wow, this will be a great opportunity to look at its threading model and see how it scales on the Mac Pro up to eight cores. And then we're going to look at how we can actually improve that using the Intel Threading Building Blocks. So the existing application doesn't use TBB, and so we're going to look at how we can actually improve that, and then we'll wrap that up with a summary.

So first of all, OpenCV is an open-source computer vision library. It's a collection of algorithms and sample applications. Space Tracker just happens to be one of them. You can actually download the source code. It's available--prebuilt libraries available for the Mac OS X, Linux, as well as Microsoft Windows. So it's cross-platform supported.

It is released under the Intel License Agreement for Open Source Computer Vision Library. You can go to the website and you can find out exactly what that license is, but it's pretty much almost a free use for both private as well as commercial use. And here are the references for the open source cv.org. Again, you can go and check out the actual details of OpenCV. But what OpenCV actually targets is the computer vision domain. And so at the base of it is the object identification.

Being able to recognize different objects. That can be extended to facial recognition. If you can do facial recognition and you can do motion tracking, then you can start looking at gesture recognition and can improve on the human-computer interaction. So this is an area where I think there's a lot of opportunity for parallelism, and there's a lot of opportunity to take advantage of the processing power that is available on the Intel multi-core platforms.

First of all, before we get too far into it, we might as well just give you a little demo that actually shows you what the Face Tracker application looks like, and you can begin to see what we're talking about. So Palavi's joining me on stage, and she'll take over for the demo. Well, hello, everyone.

I'm also an engineer with the Software Solutions Group back at Intel, and I'm going to demo the open-source version of the Face Tracker application, which you can download from the opencv.org site. So back to the demo machine. So this demo, which I'm going to show you, is going to take input from a camera, which we have here. And let me just run it here.

So as you can see, it's taking input from the camera. It's me, who you see on screen. And it's tracking my face as I'm moving across the screen. It tracks my face. So this is the original version. And look at the chart. Look at the CPU utilization on the right side. And as you can see, it's about 100%. This is an 8-core machine, 3.2 gigahertz processors. And back to Phil.

We're going to switch back to the slides. So this application is fairly straightforward. If you download the executable and look at the source code, there's basically a face tracker executable. It includes or takes advantage of the QuickTime framework, and it's using the OpenCV library for object detection. Again, the application itself is not threaded, per se.

It's strictly a sequential loop or a continuing loop, which basically does image acquisition. We actually call the face or object detection where we pass a heuristic, which is looking for a particular pattern to identify faces. And then we actually display it. We actually filter that. We identify where it's located.

The return from the object detection library will actually give us a rectangle or coordinates around the face. Returns that back to the face detection routine. Returns that back to us. We put a little circle around it, display it. There's actually a short delay if you go download the actual source code. There's a short delay before it actually loops around.

There's no software in that tries to do any kind of real-time performance, so it's not trying to schedule these at 30 frames a second. But as you can see, the CPU utilization was very low. And in fact, if you actually noticed, it really wasn't all that great in terms of performance. There tended to be a little bit of hesitation and drag on this particular application.

Now, because Intel released OpenCV, it was designed to actually work with the Intel Performance Primitives Library. And this library supports a number of domains, cryptography, small matrices, typed algorithms, image processing, signal processing. And the way this library was actually designed, it was designed to work with the Intel Performance Libraries. And that's why, you know, the OpenCV library was actually released, was to actually show the benefit of actually using the libraries instead of writing or rolling your own version of these particular routines. So how's it actually implemented?

Within the OpenCV library, there's a master array, which is a function pointers back to functions. And so the way it was actually implemented is that all of the routines that are available or needed. To do the object detection are actually within the CB library. So you can actually get the source code to that.

If it doesn't detect the Intel performance libraries on your system, which would be in the default location for frameworks within the Mac OS X or 10, it'll actually just point all the pointers directly into the original functions that were written. But if you have the libraries installed, it'll actually go and load those modules, it'll do a search on the names of the functions that are available, and it finds the ones that it's interested in, and it actually just changes the function pointer to point into the Intel performance libraries.

Now, this is a little bit of a... and David Kahn, and David Kahn, and David Kahn, and David Kahn, and David Kahn, and David Kahn, and David Kahn, and David This is what it would actually look like when you're all done, if you have the Intel Performance Primitives installed, and we'll go ahead and access it. One of the things that we noticed when you actually ran the routine is that it's actually using the EyeSight camera.

And so one of the problems with doing performance optimization is kind of having a workload that's repeatable so that you know that when you make an improvement to the application that you've actually improved it and that something in your data hasn't actually changed that gives you false positives or false negatives.

So one of the things that we do is we actually replace the code that used the iSight camera with actually pulling in a QuickTime video file. And it's actually very easy using the QuickTime framework, so it was just a couple lines of code different in order to be able to implement that.

The other thing we wanted to do is that the code doesn't really tell you exactly how fast anything is running. It's got this timing loop that's in there that basically just has a fixed delay. And so what we wanted to do was actually have a way that we can actually measure the performance improvement from the optimizations that are being actually made to the application. And so we added a little bit of timing code at the very bottom of it.

And we wanted it to run full throttle. So right now, there's like a 10 millisecond delay at the end of the routine. We really wanted to look at this from a full throttle performance standpoint to maximize the actual performance. And then, when we're done, we can go back and implement a real-time synchronization routine to make it so that it runs at 30 frames a second.

So this is kind of the changes that we made from a very high level. Again, we just changed the image acquisition to a file decode. We eliminated the delay and we added some timing at the end. And so we're using the QuickTime. So we have a little funny video that Palavi is going to show us in the demo. And we'll see where that takes us.

Okay, so Phil talked about having a way to measure repeatable performance, having some kind of metric to see how fast we are running, and thirdly, eliminate any delays that are there. So I'm going to show you the version which I showed you before, but it's a little different. It now takes input from a video stream instead of taking from the EyeSight camera, and so that we can establish a baseline version of how fast we are running.

We have this funny video running. It's tracking the phase. It's going to complete pretty soon. Look at the CPU utilization on the right. It's again about close to just 100%. We are not fully utilizing the majority of the eight cores that we have on hand. And it's going to be done soon.

Okay, so there were about 309 frames in the stream, and it took about just 24 seconds to complete the video. I'm sure we can do better. Phil talked about using performance libraries and such, but before we do that, why don't we see how does a shark profile look like?

a load very soon. So looking at a shark profile, the very first hot spot that we have is coming from the OpenCV libraries that we are utilizing here, and I'm sure we can do something about it. Phil talked about that OpenCV supports Intel performance libraries, so the next thing I'm going to show you a version where we actually are using the Intel performance libraries and see what kind of performance gains we can get with that.

So keep that in mind that we ran 309 frames in about 24 seconds. We had all about 100% CPU utilization using probably just about one core out of the eight, and we are spending about 76% of the time in the OpenCV libraries. So I'm going to quit out of here.

So this is the IPP version where now the OpenCV calls are being replaced by the Intel Performance Libraries. Look at the CPU utilization on the right. We are all about using close to all the eight cores that we have on hand. And let's see how much time it takes. Oh, it took only about 15 seconds to complete the same workload. Oh, we are about 50% faster, about 1.5x speedup. Let's see how does the profile look like.

As you can see, we don't see that 76% time being spent in the OpenCV libraries. Instead, we are calling some of the IPP libraries, if I expand this. Plus, you notice we see something else. We see some of the KMP calls here, which are coming from something called OpenMP, since Intel Performance Libraries uses OpenMP to implement the threading. And that's it for this demo, and Phil is going to come back and talk more about OpenMP, and then we're going to have some more demos following it.

So I think it's great that the Intel Performance Libraries produced about a 50% improvement in performance. But I was really shocked when I saw that we had an eight-core machine and we're running almost 100% across all eight cores. So we got from 15 seconds to about 24, 25 seconds, but it took us eight times the CPU processing power.

So, the reality is that the Intel Performance Libraries also are using OpenMP. And we'll talk a little bit about what those additional functions were actually doing and why. But OpenMP is a specification which was targeted at symmetric multi-threaded processing systems. So, it's really targeted for these types of processors where you have shared memory.

And it really is focused on encouraging you to write threaded code incrementally. So, one of the things that James Reinders talked about was the fact that whenever you write code, that in addition to having parallel code or threaded code, you would want to be able to go back to the single-threaded code. Well, OpenMP actually supports that paradigm. You write it initially in a straight serial fashion. And then you describe to it exactly how you want to do the threading.

So, you can actually go to the OpenMP. And look up OpenMP and the specifics on it. But the interesting thing is that both Apple's GCC 4.2 compiler and Intel's C and Fortran compiler support OpenMP. So, there's a lot of advantage. And the Intel compilers are supported across the Mac OS as well as Linux and Windows.

So, what is OpenMP? It's really a set of programming directives. So you describe how you want your single-threaded implementation to run in parallel. There are some support routines that allow you to get additional information, so you can query things like how many CPUs are actually in your system.

You can set how many CPUs you would like to be in your system, how many threads you actually want to use if you don't want to use the default, which would be the number of cores that are available. There are also environment variables, so when your application starts up, it will look at the environment variables. Again, you can control the number of threads, you can control the scheduling paradigm that is used, and the specifications allow vendor extensions.

So let's take a real quick look at what the code looks like in OpenCV for the object detection. It's basically a simple for loop. It breaks up the image into a number of strips across the image and basically applies a heuristic for object detection across each of those strips. If you had a single core, the number of strips would be one. You would just actually do the full image.

But if you had multiple cores, you could actually break up the strips and have one per core or even more than one per core, depending on how you wanted to implement your code. So a simple way to do this in OpenMP is you could just add the pragma, OpenMP parallel 4.

And this code would actually then, if you compiled it with a compiler that supported the OpenMP pragmas, this code would actually become threaded. In fact, that's what you get when you look at the code. So we went from single-threaded performance. To eight threads actually running 100% on the CPU.

Now, you can extend these pragmas they support so that you can specify the number of threads that are actually supported. There are some cases where you don't necessarily want your application to necessarily use all of the cores on your system, or there are times when you actually want to have more threads running than you have physical cores available.

So if you have a thread that's going to go out and do some synchronization across the network and there's going to be a fair amount of delay, you might actually run 100 threads of these, and you can actually do that with OpenMP if you like. The default, if you don't specify this, is it's going to launch the same number of threads as you have cores available on your platform.

You can go further in terms of scheduling how you want your threads to be able to interact with the workload in terms of dividing up the workload. I'm going to talk a little bit more about how you can influence the scheduling and why you would want to do that.

So, if you didn't define anything at all in that pragma, and you just left it blank, and you had the first statement, which was pragma OpenMP parallel 4, you would basically get static scheduling. Now, depending on how many strips you actually had in your image, how you wanted to define that, you would essentially, this is what you would have, you would have, Each of these rows between the serial code would be operation on a frame, so it's frame-based.

And you would divide up the number of stripes that you're actually going to operate on and assign those to threads. So that's the static scheduling. It says if I have 16 stripes, I only have four threads. I'm just going to divide each of them up. The first thread is going to get 0 through 3, and the next one 4 through whatever and continue.

And that would be static scheduling. So it's already fixed by the framework. The OpenMP framework has already decided exactly how the threads are going to operate on those particular strips of the image. And that's great, because there's really no synchronization. The synchronization only happens at the end, when all of the threads are done. There's no coordination between any of the threads on which stripe they're actually gonna work on. It's predetermined before they even start.

The problem is, what happens if that workload is not exactly balanced? So, in this particular case, maybe a slight exaggeration, but suppose that the image was such that you had a very white background that was very uniform at the very top of the image, but a lot more detail and a group shot of people actually in an image. There would be a lot more processing that would need to take place on the lower part of the image to detect the face.

So, you could actually have this broken up such that, for a given image, one thread takes a lot longer than the rest of the image actually takes. And so, you would not want a static scheduling in this case. You would want more of a dynamic. Now, you can actually change it a little bit and still have static scheduling by basically doing round robin.

So, you can change the chunk size. So, I told you that if we had 16 stripes in the image and we broke that up by four threads, the first thread would get zero through four. Well, maybe those are the first four stripes on the bottom that have a lot of detail, and the top part of the image, which is all white, would be the last four. Well, they would still go really fast, but if you round robined that such that the first thread got zero and the second one got stripe number one, then you kind of even out the workload a little bit.

The problem is that you can still end up with an imbalanced workload, but not nearly as severe. But again, we can improve on that. So, what we can do is we can actually say dynamic. It says that each thread works on one stripe, and only when it's complete does it go back and actually get the next one.

Now, the problem is with this is that you have a little bit more overhead because now every time I finish a thread, I have to do some synchronization with the framework to say, okay, what is the next stripe that I have to work on? Is it, you know, is it one or is it two? Is it number 16? But you get a more balanced view. So, as long as the workload per stripe is fairly large versus the synchronization of overhead, then dynamic actually works pretty well. And that's actually what's actually been implemented in the OpenCV framework library.

So, I told you that Intel Performance Libraries support OpenMP as well, because we saw that in the Shark profiles. So, the problem that we have is that we have our main application, which has no threading. We're using the QuickTime framework, which has its own implementation, whatever that may be, that Apple has provided. Then we have the Object Detection OpenCV Library framework, which is using OpenMP. Thank you.

So every time you crawl that library, it's going to basically, if you don't give it any parameters, it's going to invoke as many threads as there are cores on the system. But for every one of the libraries that we load in the Intel Performance Libraries, it's also, those are independent modules. When they get initialized, they are also using OpenMP, and so they're also creating as many threads as are in the libraries.

When we actually looked at the Shark profile, we saw that KMP fork was like 20% or 25% of the actual workload. Turns out that one of the things that OpenMP does is it likes to not have threads be context switched out in case its synchronization objects are available immediately or soon after it needs it. So we actually enter spin weights, and that's what you're actually seeing is spin weights. But the problem with spin weights is that as long as you're holding on to the CPU, nothing else can actually get scheduled.

So one of the things is, Intel has a vendor extension environment variable, which allows us to change the spin weight block time. And so in this case, we wanted to see what the real performance improvement was, or what the real CPU utilization was, and we can actually set it to zero. And we'll eliminate the OpenMP spin weights. So we'll go ahead and switch to the demo again and show you what the performance looks like with OpenMP and setting it to zero.

Okay, so in the previous demo, we utilized the Intel's performance libraries, which also uses OpenMP, and we showed you the Shark profile. Now, in this demo, what I'm going to do is, as Phil mentioned, the OpenCV libraries also use OpenMP. So we're going to run a version with that and also show you the Shark profile, and we're going to see same kind of calls showing up in the profile. And then I'll follow it with another version of the demo where we'll set the KMP block time to zero and see how it changes our profile as well as our CPU utilization. So let's run the first version. This is where the OpenCV is running with OpenMP enabled.

So keep an eye on the CPU utilization. It's the same funny video, the bobblehead video. Look, we are utilizing almost close to all the cores and, yeah, we are tracking the face and etc. But I'm sure we can do better. And in order to do better, let's first eliminate all the KMP wait calls that we were seeing. But before that, let's see how the profile looks for this version.

You see the CV calls are there, which we had seen in the original demo, but we also see the KMP calls. And all these are actually coming from the OpenMP weights, which Phil just talked about. So, now let's set next the KMP block time to zero, like Phil mentioned, and see how it changes.

So I'm going to keep this on the background. And also notice we are running about 10 milliseconds, which is good. I mean, we started with about 24 milliseconds. And we now, instead of using Intel's performance libraries, we are utilizing the OpenMP feature on the OpenCV. So we did good.

So now we are, what this version is where the KMP block time has been set to zero. And it's the same thing which I ran just previously. Look at the CPU utilization. It's only about like, I'll run it one more time, about like just 400 or 500 percent. Not the 700 or close to 800 percent which we were seeing before.

You can see a lot of the random circles, and I'm going to talk about that and how we actually utilize the multi-core feature of the Intel platform and solve that with other threading implementations. Before I turn over to Phil, let me show you the profile for this version.

So as you can see, if you compare the previous profile, all the KMP, fork, barrier, KMP pause, KMP yield, all those calls are actually gone from the top four, five, or 10, or 15 hotspots. And that's due to the KMP block time setting equal to zero. So like I said, as far as the random circles we see, I'm going to talk about it later. Back to slides and Phil. Thanks.

So one of the things that we saw in that demo was, first of all, when we set the spin weights to zero, I don't know if you noticed, we actually went from like 10.7 seconds to over 11 seconds. So there is benefit from those spin weights, right, in terms of actual performance. But the problem with seeing those spin weights when you're trying to do performance analysis is the fact that it hides the potential. You don't know exactly where you're spending your time.

If you just look at the CPU utilization, you would have been up at 800%. When you actually look at the final result with the KMP turned off, it was much lower. It was down probably around, I don't know, 300% or 400%. So... The fact that we have, that we got rid of all that extra spin weight, we can clearly see that there's a lot more performance power that's available to take advantage of and actually improve their performance. The video obviously goes a lot faster, but we also are introducing a little bit of false positives, and we'll talk about that a little bit further. But what we decided to do was actually implement a pipeline approach.

This is one of the things that James talked about earlier, was changing the kind of the paradigm on how you look at the actual problem. The original code was very serial, and we relied on the library to provide us with the parallelism. But the reality is that if you really look at this serial code, you can break it up into a file decode step, or task, as James called it, the face detection.

So once you actually have the image, you can do the face detection on the image. You don't have to worry about file decode at that point. And then the display task. So that's why we decided to come up with the pipeline approach, where each of those steps are actually separate tasks, and they're all operating or delivering a frame.

And that's instead of actually dividing up the frame into stripes. So the other thing that we decided to do was instead of operating on strips of the image, so that we can get rid of a lot of the overhead, was to actually then say, hey, once we have the image from the file decode, then we can start the face detection immediately. And these blocks are fairly representative of the time that it actually takes to do each of these steps. So if you looked at file decode and display, that combined is about a third of the total processing time for the file decode, face detection, and display time.

If we can actually eliminate the face detection or get that reduced down, the file detection or decode and the display are basically serial operations. There's not much that we can do from our application standpoint. Now, QuickTime may actually be threaded and provide some additional benefit. But from our perspective, out of the QuickTime, we're basically getting a frame at a time. We can't get a second frame until we get the first frame. We can't display the second frame until we've displayed the first frame.

So those are serial operations. But clearly, the face detection itself can be a parallel operation. And the fact, instead of doing this by stripes, in other words, making the workload smaller for each thread, we're basically giving each thread the full set of workload on an image and allowing that to be very localized. And there's a lot of benefits to that as well.

Now, each frame that's having face detection, all of that data that's associated with that thread can run on that same CPU, can take advantage of all the cache that's available for that. And there's not a whole lot of data migration from one core to another or from one cache to another.

So, how did we actually implement this? We used the Intel Threading Building Blocks. It really targets more complex threading models than what you can get with, certainly with P-threads, which really doesn't provide you much support other than the ability to implement your own model. But certainly over OpenMP, which was much more data decomposition-oriented with the parallel four, does allow for some functional decomposition, but it doesn't give you the higher-level threading paradigm that you would really like to do. And in this case, we actually used the linear pipelining with filter stages to implement that.

You can also do things like parallel scan, parallel sort. You can do the parallel for, which is what's available on OpenMP, as well as reduction and parallel while implementations. Now, the Threading Building Blocks are optimized libraries for the Intel-based platforms, again, that are available on the Mac, Linux, and Windows.

They're not limited to the Intel compiler because it's actually available in source. They are C++, so if you're using Fortran, you don't have TBB available because we're actually using templates to implement it. But it is cross-platform, and the best benefit is the fact that, unlike last year, TBB is now open-sourced. So you can actually download it, compile it yourself, modify it, implement it, and use it as you see fit under the GPL license.

So, we saw those extra rings that were showing up. And so the idea was that, hey, we're only using, you know, three or four cores out of the eight-core system. Why not add more value to the end user? And the end user, you know, is going to see those additional false positives. So what we decided to do was, let's take advantage of that and add a frame detection history.

So, in other words, as we get these frames or these images in, and we've detected these regions that we think they are faces, we'll actually keep a history of that. And if we see from frame to frame that we have a fairly consistent view that that image or that face has shown up multiple times through that history, then we keep the circle. But if it's a flash, if it only shows up one time or in just a couple of frames and isn't there consistently, then we actually add a frame detection history.

And so we can actually remove that. And so we actually implemented, again, some more stages, which could be done in parallel. So as soon as we went to the frame detection history, we determined what that history was. Then on each frame, we can decide whether we actually want to keep the circle or not keep the circle. And then display it.

But we can actually go further. If you actually look at what Palavi showed before when she was using the video camera, was that when she turned sideways, the circle went away. So what we were really doing is only using the heuristic for frontal facial recognition. So only if you were basically looking straight ahead would we actually detect the face. But there's also a heuristic in the OpenCV, which allows you to have profile face detection.

So if you turn sideways, it'll actually show up as well. And so we decided that, hey, that's an end-user benefit. Why not add both frontal as well as profile detection, add some quality filtering to the capability, and then display that? So that's what we've done, and we'll let Palavi show you the result.

Okay, so this version which I'm going to show you is where we implemented a pipeline approach using the Threading Building Blocks. And I'm going to show you a part of that code and how it looks. Phil talked about the various filters in the various stages. And in the previous talk, James talked about how you can increase. One of the ways to increase scaling with parallelism is by increasing your workload or making it more complex or value-added, as Phil mentioned.

So not only are we increasing the scalability, we are doing the parallelism using the pipeline approach, where we would end up probably hiding some of the latencies of the more serial or the slower stages of the pipeline, but we are going to also add more and more stages for improving quality or just increasing features, like adding side-face detection in addition to the front face as an additional stage to the pipeline. So I'm going to show you first. some of that code and then I'm gonna demo you how it looks like.

So this is just the, and as mentioned earlier, I mean, we are working at a very higher, at an abstract layer rather than going into the nits and bits of the details of writing and managing your native threads. This is a TBB code and see how clean it looks. I mean, we start with the, let's say, initializing the task scheduler for the TBB.

And we follow it by just adding the various stages of the pipeline, starting with the input filter, where we are reading from the video file, followed by the face detection filter, where we are detecting the front face, followed by the side face detection, and then adding some more quality filters for maybe removing some of those random circles, which we were seeing as false positives.

And then ending up with the drawing stage and then the output stage. And finally, we have this run pipeline, where we are running the, where we actually get the TBB pipeline going. So as you can see, it looks pretty neat, pretty clear, pretty well encapsulated, the various stages can be just implemented very neatly.

So I'm going to now demo the TBB pipeline version. And this will include the additional feature of detecting the side faces, as well as a quality feature where it's removing some of the random circles which we saw. And again, we are reading from a video file. Look at the CPU utilization. We have many more features added for quality and functionality. And we actually were down to like 300% or 400% utilization. And now we are back to using most of the cores, but this time with additional functionality. I'm going to run it one more time.

The red circles are for the side faces, and the front ones are for the green. You also saw some of the blue ones, and those are where we detected a side face and a front face, and just to show you that how you can add more functionality. Also look at the time it took to complete some of those 408 frames. It's only about like seven seconds, so you can see a vast improvement in your speedup for the video.

So we started this whole presentation by demoing you the original version where we were taking input from the camera. And I just showed you, then we kind of switched back to, switched to like reading from a video file so that we can see more repeatable performance and we can get some performance metrics. So we were trying to run it like full throttle, you know, as fast as we could.

And now I'm going to show you a version where we go back to where we started. Now we are again going to take input from the EyeSight camera. We are going to run real-time and see how we do. And again, this is the TB pipeline version, but now this time taking input from the camera.

So I'm like sideways, probably it's reading, it's trying to see, maybe detected a front face and a side face, so it's looking blue. If I look forward... It kind of switches to green in the middle, but I think it's also detecting it as a side face in addition to a front face. That's why it's looking blue.

It's running pretty fast. It's running real-time. Look at the CPU utilization. We are probably about, what, 500, 400%. So as you can see, we can utilize Intel's Threading Building Blocks to probably get the same kind of performance speedups in terms of CPU utilization and overall time. Okay, I'm out of the camera, so I'm going to do it again. That's pretty much it for this video. Phil? Thank you.

So I don't know if you noticed, there was actually green circles, there was blue circles, and there were red circles. What that was really trying to highlight is the fact that the green circles were the front facial detection. Now, the profile also could pick up some frontal views as well.

And so what we decided to do was, in addition to drawing the green circles, is that if both heuristics detected the face, then we would make it blue. But if you actually had a side profile, and it was only the side profile, then it would turn red. So at the very end, Palavi was actually looking a little bit more sideways, and actually you could see the red circles showing up as well.

So, one of the things that was interesting is we also got rid of the spin weights, which is, other than improving task or context switching between threads, we actually, by getting rid of that and improving the quality and adding features, we were down to less than somewhere around seven seconds in terms of performance. So, it was actually the best performance.

So, in summary, by adding IPP, we definitely got an improvement, about a 50% overall performance improvement. And again, that's from a single-threaded perspective. Keep it in mind, the fact that the video decode and the display were not threaded at all from IPP. So, we already started with less of the total performance. We were only focused on the face detection part.

With OpenMP, we saw that we got a scaling from 1p to 8p, and that was about a 2x improvement with the OpenCV and the Intel Performance Libraries using OpenMP. Now, the problem is, is not that OpenMP was bad, right? That's the wrong message. If you look at it from a library perspective, the only thing the library has is an image to work on.

So their point of view was the fact that they actually would take the image and subdivide the image in terms of how they were going to actually do parallelism. And they did see performance. The other thing is that if you looked at this on a dual-core or a quad-core system, you would never notice all of the performance improvement that you can get on an 8-core, because you're already doing fairly well with OpenMP and IPP already. But 2.5 is actually... be fairly disappointing just to an OpenMP.

But by introducing TBB and up-leveling the problem and actually focusing on, instead of data decomposition of an image, we actually up-leveled to the image level, we were able to actually improve the real-time performance on the facial detection. We could add filtering and profiling to get rid of the false positives. You could add additional profile detection and additional features.

And that's not the end of it. I mean, we were probably up around 5x in terms of performance. We were already at about 600% CPU utilization. But clearly, you can't do that on a quad-core system. And there's still room for improvement in terms of capability. And the nice thing about the pipeline, it was very easy to add stages. Each of those stages, actually, in terms of implementation, was probably done in a half a day from the original code.

Because we were able to basically cut and paste the original code into the filter stages and then just added those lines to the pipeline. And what you really want to do is focus on the fact that, you know, if you're targeting an eight-way quad-core system, that you're adding value to the people that have eight-way quad-core systems.

But at some point, you're going to be also marketing to people who only have a dual-core or a single-core. So you want to be able to dial back the features and functionality. And one way to do that is just... to simply not have those stages added or removed to the actual implementation. So in summary, first of all, develop for the high-end quad-core platforms.

And then test and reduce the amount of functionality that you're actually going to provide to the lower end. You want to be ahead of the curve. You have platforms already out there for eight-way. James already talked about, you know, the vision of Intel adding more and more quad-cores. Or adding more cores in general.

Eight cores is here today, and most applications don't take care of that already. 16, 32, 40, 100. The vision's already there within Intel. And if you're not developing on high-end, you can't actually see or think about or get the benefit of being able to innovate on those higher-end platforms.

It's not just about data decomposition. It's not just about taking your problem today and making it faster. It's really about increasing the complexity. So in terms of images, making the images larger. In terms of video, it's taking simple video codecs into the high-definition range. Adding functionality, things like the filtering capability or side profile detection.

I mean, at some point, you could even envision that on an 8 or a 16 way that you would move to facial recognition, that you would actually be able to look up in a database and extract the image and do a compare to see, you know, if that image matches somebody in your database. And of course, the improving quality. So high-definition video, the filtering that we did to remove those false positives. So use the available compute resources that you have to really innovate and improve the end-user experience. Thank you.

There's a lot of new techniques that are out there. Some of them are newer than others. Certainly, OpenMP has been around for quite a while, but TBB is also available, and that's actually open source. So that's something that you can take and utilize in however you see fit going forward. Certainly, Apple's innovating in the same region, moving that threading paradigm higher and higher so that you're focused on the task and not at the implementation or the details of managing your individual threads.

You have to take advantage of those capabilities. The benefit of doing that is as processors get better and better and hardware innovation improves and various curveballs get thrown at you from a NUMA architecture perspective, those threading libraries are going to be evolving and taking advantage of that. They're going to be optimized for those particular changes in the architecture going forward. If you're using those libraries, you're going to get the benefit when those things become available.

And for me, you know, here was a prime example of an application that Intel released as open source, built on OpenCV using the Intel Performance Primitives. While it was very much focused on image processing and added value at that level, the reality is that as more threading becomes implemented, not only in your own applications, but in other libraries that you might be using, there's going to be conflict, right?

So libraries need to be able to allow you to control how threading happens, even at the library level. And as you move into those areas of multi-threading, you know, think about extracting. Just don't take the benefit that you get from the libraries for threading, but start thinking about your own applications and how you can abstract that higher into your own.