Inside the Accelerate Framework for iOS - WWDC 2011

Core OS • iOS • 30:47

Introduction to the Accelerate framework APIs, which consists of state of the art signal processing, matrix and linear algebra computation and now image processing and array based elements.

Speakers: Steve Canon, Luke Chang

Unlisted on Apple Developer site

Downloads from Apple

HD Video (84.2 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So good morning everyone. My name is Steve Canon. I'm a senior engineer in the Vector Numerics group. And today I'm going to be talking to you about the Accelerate Framework for iOS. So just a little bit of introduction. What is the Accelerate Framework? We like to think of it as being sort of one stop shopping for high performance computational libraries. What kind of libraries?

We introduced it on iOS in iOS 4 and we introduced three libraries in iOS 4. The first was VDSP, which is a digital signal processing library. It can do everything from sort of basic operations on vectors, adding them and multiplying them and things like that, on up to some one dimensional convolutions and fast Fourier transforms.

The second library that we introduced in iOS 4 is called BLAS, which is an industry standard library of basic linear algebra routines. And that includes things like matrix-matrix operations and products between matrices and vectors, all that kind of stuff. And the last one was LAPAC, which stands for linear algebra package. That's also an industry standard library.

and LAPAC provides sort of high level linear algebra operations. If you want to solve a system of equations or that kind of stuff, then LAPAC is what you want to use. On the Mac, where we've had Accelerate for a very long time, we also have a few other things.

We have Vimage, which is a image processing framework, and we have Vforce, which is a vector math library. It provides the same kind of operations that you find in the system math library, but on vectors of data instead of on scalar data. And today, we are bringing all of those same libraries to iOS 5. So we're adding vimage and we're adding vforce.

So my goals for this session are sort of to introduce you to the new to iOS components of the Accelerate Framework. Those of you who are primarily iOS developers who don't have a lot of experience on the Mac may be asking, you know, what's in these new libraries?

So we're just going to give you a broad overview of what's there. And the other thing we want to do is spend a little bit of time just talking about the improvements we've made to the stuff that was already there, the components that we had in iOS 4. and what I really want you to do, or what I really want to do, is to help you identify places in your code that you might be able to use the components we provide to you in the Accelerate Framework.

So let's dive right in. We're going to start with vImage, which as I mentioned, this is new in iOS 5. And vImage stands for the Vectorized Image Processing Framework. We introduced it on the Mac in 10.3. And it's proved to be pretty popular on the Mac. In fact, it's used in six out of seven of the top grossing apps on the Mac App Store. So it's been very popular on the Mac. And we've had a lot of requests to bring it to iOS, so we're really excited that we have it there now.

So just to start out, for those of you who aren't familiar with it, I'm going to give you a small example. Let's start with by looking at convolution, which is one of the most important and also one of the most complicated operations in V-image. And it forms the core of a lot of common image processing tasks. Convolution is basically the weighted average of nearby pixels. For each pixel in an image, you're going to compute a new image based on the pixels that surround the pixel you're interested in.

and it gets used to do blur, sharpen, edge detection, emboss filters, lots and lots of image processing algorithms are fundamentally convolutions. So just to explain a little more what I mean, if we look at sort of what we do, if we were going to evaluate the convolution on that center pixel in the image, what we do is we have a matrix of weights, it's called a kernel and we multiply that by the window around that pixel.

So here we have a three by three kernel, so we take a three by three window around the pixel that we're interested in and we multiply each weight by the corresponding pixel and then we sum those all up and that gives us a new color in the output image.

Now here, there were some white pixels and some darker purple pixels, so when we mix them together, we get kind of a lighter purple color and then we just do this process across the whole image. So you get something like this. This is--this is a small blur kernel, so you can see it's smoothed out a hard edge from sort of dark purple to white to be a smoother transition. That's very basic convolution. Now you could write a convolution yourself. It's not terribly hard. This is sort of the simplest kernel of a convolution that you could--you could write.

But you really shouldn't do this. So why not? Well, that simple example I showed you there has a lot of problems with it. It doesn't handle edging properly. When you look at a pixel that's close to the edge of an image, that kernel is actually going to hang off the side of the image. And you need to handle that properly. It results in a lot of edging code. It's kind of a pain. It doesn't handle integer overflow. This is just a basic thing. But it's something you have to deal with.

And more to the point, and what I care about, it's really slow. So instead of writing your own--

[Transcript missing]

and it uses about a tenth the energy to use the vimage convolve as it does to use a good high quality scalar implementation. So this is really nice. It means your app is going to last longer and, you know, your users can use your app longer, which is great. That's good for you. That means they like your app and they get to play with it.

So that's sort of a brief introduction into why you should use vImage. I think three good points. It's simple, it's fast, and it uses less energy. So what else do we provide? We provide a big variety of operations and we're just going to kind of walk through each of the categories and see roughly what's there. We just talked about convolution. Just to give you a little bit more about that. As I mentioned, it's the weighted average of nearby pixels.

We also let you to supply a bias that can be applied to it. That's often useful for certain kernels. And we allow you to use a different set of weights for each color channel. And we give--I mentioned the problem of edging before. We give you a bunch of different ways to handle edging. You can do--use the background color. You can do edge extend where you basically treat the edge pixels as though they went off to infinity. We also have a mode called truncate and a sort of do nothing mode where you just assume the data's there.

And then we also have geometry operations. These are pretty self-explanatory, but we'll have some pictures here to help out. We have rotation operations. We have shearing operations. can reduce and enlarge your image. I should mention that a lot of these geometry operations need to do resampling. We use Lantos sampling, so you're going to get really good high quality output images from this. Not only is it fast, but you get really nice image quality. We also provide Affine Warp. which is just a general affine transformation applied to the image. And we have Reflect which is very straightforward.

We have transformations. These operate on the color space of the image. So they operate pixel by pixel. They do some transform to the colors. We have matrix multiplications, which you can use for color space conversions, like if you want to move from RGB to YUV or HSV. And you can also use them just to directly manipulate the hue or the saturation of an image or to just sort of twist the color space.

We also have gamma correction operations and we have fast polynomial and rational evaluators. Since you can approximate any function well with polynomial or rational approximation, this sort of lets you do just about anything you want to the color space of the image. So these are just a few examples of the sorts of things you can do with the transform operations, the basically color space operations.

We have morphology operations. These are kind of cool. Morphology operations basically give you a way to sort of grow either the highlights or the shadows of an image. Min and max grow them sort of in rectangular ways. Erode and dilate apply a probe that you supply, which can be an arbitrary shape.

And you can do some really cool tricks like this. Like here we're going to use a star shaped probe to sort of grow the highlights into stars in this image. You can do things that are a little less gaudy than this with it too, but this is kind of a cool example.

Another cool thing is histograms. We have a really nice set of histogram functions, which you can use to calculate histograms, but you can also use to perform a lot of operations on the image. For example, you can take one image and sort of infuse it with the color palette of another image. That's kind of a cool trick, which I like a lot.

And the last category of sort of image operations we have is called alpha operations, which are basically, in other words, for transparency. We give you a lot of ways to work with the alpha channels on images and do those transformations. Sort of all the standard ways that you might want to have. So what data types do we support? There are two main core formats. We support 8-bit unsigned integer, four channels per pixel, and also four channel per pixel, 32-bit floating point formats.

Within those data types, we have a couple data layouts that we support. We have interleaved data layouts where you have the first pixel alpha channel, red channel, green channel, blue channel stored in some order, then the second pixels channels, then the third pixels channels. We also have planar data layouts where you have all the red channel image data, all the green channel image data, all the blue channel image data.

Sometimes planar formats are really nice to work with, so these are an important thing. And we provide conversions between all of these core formats. So we have conversions from planar to interleaved. We have conversions between 8-bit and floating point. And we have conversion operations that just let you swap around channel orders. So if you have an RGBA image and you need a BGRA image or whatever, we give you ways to do that.

Now we also support a lot of other data formats as storage formats. So we have conversions that take most common image formats and convert from those into one of the core formats and then also from the core formats back to the storage formats that you're going to want.

So when you're working with image data that's not in a core format, what your workflow usually looks like is you have a conversion to a core format and then you do a bunch of processing in that core format. And then if you need to get back to one of the other formats at the end, you convert back after you do all your processing. You generally don't want to be sort of going back and forth all the time. We have very lenient data requirements as far as alignment and things like that. 8-bit data only needs single byte alignment, so you don't really need to worry about that at all.

The floating point data types, we require 4 byte alignment, which is just native float alignment. So there's no heavyweight vector alignment you have to satisfy. And we don't need containerized data. We're not passing around objects, none of this stuff. We basically want a pointer to your image data. We want to know how tall the image is, how wide the image is, and what the offset between consecutive rows of the image is.

With these two features, what this means is that we can usually operate on your image data in place. You don't need to copy it into some special structure for us to work on it. You just give us a pointer to your image. Call the function, and you're fine. So this reduces our overhead a lot.

And as I mentioned, vImage is vectorized, so we give you really good performance. And each function has some dispatching, so it will use the best implementation available for the hardware that it's running on. On the Mac, this means that we take advantage of SSE3 or supplemental SSE3 and SSE4.1 when they're available. and on iOS we take advantage of Neon, which is the vector engine on ARM.

If you're running on the A5 processor in the iPad 2, then we'll also take advantage of some other stuff like hardware, half precision to floating point precision conversions and a few other things like that. If you're not running on the A5, you'll fall back to a very fast software path, so you still get good performance. So the thing I want to sort of drum in here is that we take care of all these hardware specific details behind the scenes so you don't have to worry about them. You write one code path and you can get fast performance across the supported devices.

and David Shull. We have a lot of things to talk about today. We have a lot of things to talk about today. We have a lot of things to talk about today. And you can provide your own scratch space to avoid getting hidden malloc calls or anything like that. If you're going to provide your own scratch space, you need to first query the function that you're going to use to find out how much space is needed.

You can use the KV image get temp buffer size flag to do that. So the way you use this flag is that you call the function exactly the same way that you're going to call it with all the same arguments, except the final argument to most V image functions is a flags argument. And here you just add in the KV image get temp buffer size flag and it returns the amount of scratch space that it requires.

So typically the way you use this is sort of what's illustrated here, where you get the amount of scratch space that's needed, you allocate that much space, maybe you have your own custom allocation or something like that, you grab the space that's needed and then you can use that to apply the filter numerous times using that same scratch space. The image is threaded using GCD. So on devices that have multiple processors, we take advantage of that. And again, this just happens without any involvement from you. You don't need to do that. to worry about it.

So as I said, just you write your code once. And if there's more than one processor, you're going to get multi-processor performance. If there's only one processor, that's fine too. You get good performance either way. and you can disable threading if you already have your own threading engine or your own tiling engine and you don't want us to conflict with that. You pass the kvimage do not tile flag to the functions and then we just turn off tiling and you can take care of all of that. So if you've already written some of your own image processing algorithms, you already have your own threading model, that's fine.

You can still work with vimage. You can just swap us in for whatever routines you're already using internally. And our APIs also provide ways to handle edging that will support your tiling model. So if you look here, suppose you had already written your own tiling model and you had broken this image up into tiles. If you look at, say, tile six, on the right hand edge of that tile, there's an edge. But on the top, there's no edging and on the left, there's no edge. But maybe the bottom has a couple of pixels that are influenced by the edge.

We give you ways to handle this even in the face of tiling. So you can specify that you only need the edging to occur this way. So that's really nice. And so that's the quick introduction to V-force. And at this point, I'd like to bring up my colleague, Luke Chang, who will tell you about V-force.

Thank you, Steve. Hi, everybody. My name is Luke Chang. I'm an engineer in Vector Numerics Group. Today I'm going to tell you about VForce. In the following about eight to ten minutes, I'm going to tell you what is VForce, how to use VForce and why you want to use VForce. So let's get started.

What is VForce? We made VForce to satisfy your computational need. We know smartphones are not just a phone anymore. There are tons of apps out there. Many of them process a lot of data. Don't even mention the iPad. Doctors are using it. Pilots are using it. So how do we process those data efficiently? This is where VForce comes into play. VForce has elementary math functions for arrays, so it handles array data efficiently. If you're a Mac developer, you probably are familiar with it already.

We introduced it in Mac OS X Tiger. Now we bring it over to iOS 5. Let's say you want to write an app to generate sine wave, a signal generator. Here's one way you can do it in plain C. It's very simple, straightforward. You have output buffer and you have input buffer indices because you probably want to generate a frequency modulated sine wave instead of just a fixed frequency sine wave. And then you write a for loop. to calculate the sign value of each index and store all the results to OutputBuffer. This is very simple. Nothing can be more simpler than that. But how can we use v-force for this? How can we improve this code? Here's how.

First, you have to include AccelerateHeader to have access to AccelerateFramework. And then you just have to replace your for loop by a single function call to vvsignf. Vv is a function prefix to v-force functions followed by its functionality. In this case, it's signf, single precision sign. Takes three arguments. The first argument is the OutputBuffer. Second argument is the indices. And the third argument is the pointer to the array length. So vvsignf will do the same thing as a for loop, calculating the sign value of each index and store all the results to the OutputBuffer.

I said it's doing the same thing. Why did I say it's better? Well, let's look at the performance comparison. This chart shows how many signs you can calculate in one microsecond. As you can see, vForce is more than twice faster than using a for loop. This means your app will be more responsive. The user of your app don't have to wait long enough to get the results. That's nice. But this is not the only thing vForce has to offer. Let's also look at the energy consumption. This is the energy comparison.

The chart shows how much energy is consumed per one signed result. Again, v4 is more than twice more energy efficient than using a for loop. This means your app will run longer. If you're writing apps for doctor or pilots, you really want your app to run longer, right? So, We've seen that VVSciX is pretty good. What else is also available in VForce? We have transcendental functions, power, sine, cosine.

[Transcript missing]

V4 supports both single and double precision folding point numbers.

[Transcript missing]

Thanks, Luke. I just want to say we really put a lot of effort into making a great math library on iOS. So the fact that VForce is able to go even faster is actually really quite impressive. So now, that was the quick overview of what we just introduced in iOS 5.

I want to spend a little bit of time talking about my favorite thing in the Accelerate Framework, which is LAPack and BLAS. And we've improved these tremendously for iOS 5. They're really nice. So I mentioned we improved them. How did we improve them? We improved the performance a lot.

So, LAPAC and BLAS now, the A5 processor and the iPad 2 has a great double precision floating point unit. It's really fast. And we want to take advantage of that in LAPAC and BLAS to give much better performance than we've been able to give so far. How do you measure performance of these libraries? One of the most commonly used benchmarks is something called LINPACK. and this is a very old benchmark. It actually derives from an old linear algebra library called the LINPACK library.

and it basically measures how fast can you solve a system of linear equations. When people talk about the LINPACK benchmark, there's a lot of confusion because it's not actually just one benchmark. It's three separate benchmarks that all go by that name. The first one, you measure how fast can you solve a system of 100 equations using a reference implementation of the solver. Now, that reference implementation comes from, as I mentioned, the LINPACK library. And it's not tuned to the hardware.

It's nothing like that. The benchmark formally is in Fortran, so it would be Fortran reference code. People have ported it to C and Java and a lot of other languages. Most of the time you see people talk about LINPACK benchmark numbers, this is the benchmark they're talking about in the mobile space.

When you see someone talk about a new device's LINPACK score or something like that, this is the kind of thing they're talking about. They're talking about reference code running for a fairly small size matrix. The second benchmark is a thousand equations, where you get to use your tuned implementation. You get to go ahead and write the fastest solver you can write and solve a system of a thousand equations.

And the third benchmark that goes by this name, this is the benchmark people are usually talking about in the high performance computing world when they talk about LINPACK. If you hear about some new supercomputer that has, you know, a hundred petaflops or something, that's this benchmark. This is sort of the no-holds-barred division to LINPACK. You get to choose your problem size. If where your computer runs fastest is on a 1.2 million by 1.2 million matrix, so be it. You get to use the fastest solver you can come up with.

It's the no-holds-barred division. So that's LINPACK. And when you see people talk about LINPACK on mobile devices, you often see numbers like, you know, 35 megaflops or 50 megaflops. That kind of number gets tossed around a lot. iPad 2, as I mentioned, it's got a nice processor and the compiler does a pretty good job. So with the reference code, we do a little bit better than that typically. We get about 90 megaflops on LINPACK.

I think this is basically criminal because you've got this beautiful hardware and you're going to run this reference code on it that doesn't take advantage of it at all really. And so we want to see, you know, what can we get on LINPACK using Accelerate, using a tuned implementation, you know, actually take advantage of the hardware, make it as good as we can.

So on the iPad 2 using Accelerate, Forget mega flops. We give you 1.5 giga flops on Linpac. That's more like it. I should note that not all of this work is in the seed. In the seed, you'll get about 1.3 giga flops. When iOS 5 ships, you get the full 1.5.

Even 1.3 is pretty good, though. That's 12 times faster than the reference code or something like that. So I think that's great. And in fact, the iPad 2 would have been one of the 500 fastest supercomputers in the world in 1994. How cool is that? Yeah, that's awesome, right?

[Transcript missing]

and all the way up to a Mac Pro where you get 80 something gigaflops of double precision performance. So, you know, three lines of code, it's portable across all these devices, we give you really nice performance on all of them. And that's really, that's what we want from Accelerate. We want to give you something that's easier to use than it would be to write your own code. We want to give you something that gives you great performance across all the hardware we provide.

And we want to give you something that lets you use less energy, save the battery, make your app last longer. If you want more information about any of this, we have a couple people you can contact. There's documentation online. The V-Image programming guide is really nice. And you can also, if you have questions, post them to the Apple Developer Forums or just file a bug if something doesn't work the way you think it should.