Core OS • iOS • 30:47
Introduction to the Accelerate framework APIs, which consists of state of the art signal processing, matrix and linear algebra computation and now image processing and array based elements.
Speakers: Steve Canon, Luke Chang
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it may have transcription errors.
So good morning, everyone. My name's Steve Canon. I'm a senior engineer in the Vector Numerics Group. And today I'm going to be talking to you about the Accelerate framework for iOS. So just a little bit of introduction. What is the Accelerate framework? We like to think of it as being sort of one-stop shopping for high-performance computational libraries. What kind of libraries?
We introduced it on iOS in iOS 4, and we introduced three libraries in iOS 4. The first was VDSP, which is a digital signal processing library. Can do everything from basic operations on vectors, adding them and multiplying them and things like that, on up to some one-dimensional convolutions and fast Fourier transforms.
The second library that we introduced in iOS 4 is called BLAS, which is an industry standard library of basic linear algebra routines. And that includes things like matrix-matrix operations and products between matrices and vectors, all that kind of stuff. And the last one was LAPACK, which stands for Linear Algebra Package. That's also an industry standard library.
And LAPACK provides sort of high level linear algebra operations. If you want to solve a system of equations or that kind of stuff, then LAPACK is what you want to use. On the Mac, where we've had Accelerate for a very long time, we also have a few other things. We have vImage, which is a image processing framework, and we have vForce, which is a vector math library. It provides the same kind of operations that you find in the system math library, but on vectors of data instead of on scalar data. And today, we are bringing all of those same libraries to iOS 5. So we're adding vimage, and we're adding vforce.
So my goals for this session are sort of to introduce you to the new to iOS components of the Accelerate framework. Those of you who are primarily iOS developers who don't have a lot of experience on the Mac may be asking, what's in these new libraries? So we're just going to give you a broad overview of what's there. And the other thing we want to do is spend a little bit of time just talking about the improvements we've made to the stuff that was already there, the components that we had in iOS 4. And what I really want you to do, or what I really want to do, is to help you identify places in your code that you might be able to use the components we provide to you in the Accelerate framework.
So let's dive right in. We're going to start with vImage, which as I mentioned, this is new in iOS 5. And vImage stands for the Vectorized Image Processing Framework. We introduced it on the Mac in 10.3. And it's proved to be pretty popular on the Mac. In fact, it's used in six out of seven of the top-grossing apps on the Mac App Store. So it's been very popular on the Mac. And we've had a lot of requests to bring it to iOS, So we're really excited that we have it there now.
So just to start out, to those of you-- for those of you who aren't familiar with it, I'm gonna give you a small example. You start with by looking at convolution, which is one of the most important and also one of the most complicated operations in Vimage. And it forms the core of a lot of common image processing tasks. Convolution is basically the weighted average of nearby pixels. For each pixel in an image, you're going to compute a new image based on the pixels that surround the pixel you're interested in.
And it gets used to do blur, sharpen, edge detection, emboss filters. Lots and lots of image processing algorithms are fundamentally convolutions. So just to explain a little more what I mean, if we look at sort of what we do, if we were going to evaluate the convolution on that center pixel in the image, what we do is we have a matrix of weights. It's called a kernel. And we multiply that by the window around that pixel. So here we have a 3 by 3 kernel, so we take a 3 by 3 window around the pixel that we're interested in. And we multiply each weight by the corresponding pixel, and then we sum those all up. And that gives us a new color in the output image. Now here, there were some white pixels and some darker purple pixels. So when we mix them together, we get kind of a lighter purple color. And then we just do this process across the whole image. So you get something like this. This is a small blur kernel, so you can see it's smoothed out a hard edge from sort of dark purple to white to be a smoother transition. That's very basic convolution. Now, you could write a convolution yourself. It's not terribly hard. This is sort of the simplest kernel of a convolution that you could write.
But you really shouldn't do this. So why not? Well, that simple example I showed you there has a lot of problems with it. It doesn't handle edging properly. When you look at a pixel that's close to the edge of an image, that kernel is actually going to hang off the side of the image. And you need to handle that properly. It results in a lot of edging code. It's kind of a pain. It doesn't handle integer overflow. This is just a basic thing, but it's something you have to deal with. And more to the point, and what I care about, it's really slow, so instead of writing your own, Oh, and I forgot to mention that you're all capable developers, I'm sure you could fix this up, but once you take care of all these things and you write a really good convolution, you're going to have hundreds, maybe even thousands of lines of code. So instead of doing that, we already did that for you.
So instead of writing thousands of code, you just write a couple lines of code and call vimage-convolve argb8888. So this is really nice. It's much simpler than writing your own convolution. But even better than being simpler, it's a lot faster. I mentioned that Vimage vectorized. Compared to high quality scalar code, it's about 14 times faster to call the Vimage convolve than it would be to use that. And not only is it faster, it also uses less energy. So if we look at this, this is, again, a 7 by 7 convolution on a 1024 by 768 image. We're running this on the iPad 2.
And it uses about a tenth the energy to use the vimage convolve as it does to use a good, high-quality scalar implementation. So this is really nice. It means your app is going to last longer, and your users can use your app longer, which is great. That's good for you. That means they like your app and they get to play with it.
So that's sort of a brief introduction into why you should use vImage. I think three good points. It's simple, it's fast, and it uses less energy. So what else do we provide? We provide a big variety of operations, and we're just gonna kind of walk through each of the categories and see roughly what's there. We just talked about convolution. Just to give you a little bit more about that. As I mentioned, it's the weighted average of nearby pixels. We also let you supply a bias that can be applied to it. That's often useful for certain kernels. And we allow you to use a different set of weights for each color channel. And we give--I mentioned the problem of edging before. We give you a bunch of different ways to handle edging. You can do--use the background color. You can do edge extend, where you basically treat the edge pixels as though they went off to infinity. We also have a mode called truncate and a sort of do-nothing mode, where you just assume the data's there.
And then we also have geometry operations. These are pretty self-explanatory, but we'll have some pictures here to help out. We have rotation operations. We have shearing operations. You can reduce and enlarge your image. I should mention that a lot of these geometry operations need to do resampling. We use Lantos sampling, so you're going to get really good, high-quality output images from this. Not only is it fast, but you get really nice image quality. We also provide affine warp. which is just a general affine transformation applied to the image, and we have reflect, which is very straightforward.
We have transformations. These operate on the color space of the image. So they operate pixel by pixel. They do some transform to the colors. We have matrix multiplications, which you can use for color space conversions, like if you want to move from RGB to YUV or HSV. And you can also use them just to directly manipulate the hue or the saturation of an image, or to just sort of twist the color space. We also have gamma correction operations. and we have fast polynomial and rational evaluators. Since you can approximate any function well with polynomial or a rational approximation, this sort of lets you do just about anything you want to the color space of the image. So these are just a few examples of the sorts of things you can do with the transform operations, the basically color space operations.
We have morphology operations. These are kind of cool. Morphology operations basically give you a way to sort of grow either the highlights or the shadows of an image. Min and max grow them sort of in rectangular ways. Erode and dilate apply a probe that you supply, which can be an arbitrary shape. And you can do some really cool tricks like this. Like here we're going to use a star-shaped probe to sort of grow the highlights into stars in this image. You can do things that are a little less gaudy than this with it, too, but this is kind of a cool example.
Another cool thing is histograms. We have a really nice set of histogram functions, which you can use to calculate histograms, but you can also use to perform a lot of operations on the image. For example, you can take one image and sort of infuse it with the color palette of another image. That's kind of a cool trick, which I like a lot.
And the last category of sort of image operations we have is called alpha operations, which are basically another word for transparency. It would give you a lot of ways to work with the alpha channels on images and do those transformations, sort of all the standard ways that you might want to have. So what data types do we support? There are two main core formats. We support 8-bit unsigned integer, four channels per pixel, and also four channel per pixel, 32-bit floating point formats.
Within those data types, we have a couple data layouts that we support. We have interleaved data layouts, where you have the first pixel alpha channel, red channel, green channel, blue channel stored in some order, then the second pixel's channels, then the third pixel's channels. We also have planar data layouts, where you have all the red channel image data, all the green channel image data, all the blue channel image data. Sometimes planar formats are really nice to work with, so important thing. And we provide conversions between all of these core formats. So we have conversions from planar to interleaved. We have conversions between 8-bit and floating point. And we have conversion operations that just let you swap around channel orders. So if you have an RGBA image and you need a BGRA image or whatever, we give you ways to do that.
Now, we also support a lot of other data formats as storage formats. So we have conversions that take most common image formats and convert from those into one of the core formats, and then also from the core formats back to the storage formats that you're going to want. So when you're working with image data that's not in a core format, what your workflow usually looks like is you have a conversion to a core format, and then you do a bunch of processing in that core format. And then if you need to get back to one of the other formats at the end, you convert back after you do all your processing. You generally don't want to be sort of going back and forth all the time. We have very lenient data requirements as far as alignment and things like that. 8-bit data only needs single byte alignment, so you don't really need to worry about that at all. The floating point data types, we require 4 byte alignment, which is just native float alignment. So there's no heavyweight vector alignment you have to satisfy. And we don't need containerized data. We're not passing around objects, none of this stuff. We basically--we want a pointer to your image data. We want to know how tall the image is, how wide the image is, and what the offset between consecutive rows of the image is. With these two features, what this means is that we can usually operate on your image data in place. You don't need to copy it into some special structure for us to work on it. You just give us a pointer to your image, call the function, and you're fine. So this reduces our overhead a lot. And as I mentioned, vImage is vectorized. So we give you really good performance. And each function has some dispatching, so it will use the best implementation available for the hardware that it's running on. On the Mac, this means that we take advantage of SSE3, or supplemental SSE3 and SSE4.1 when they're available. And on iOS, we take advantage of Neon, which is the vector engine on ARM.
If you're running on the A5 processor in the iPad 2, then we'll also take advantage of some other stuff like hardware, half precision to floating point precision conversions, and a few other things like that. If you're not running on the A5, you'll fall back to a very fast software path. So you still get good performance. So the thing I want to sort of drum in here is that we take care of all these hardware specific details behind the scenes so you don't have to worry about them. You write one code path, and you can get fast performance across the supported devices.
The image is designed to have low latency. We don't use JIT. All the code is precompiled so that the execution time is fairly deterministic. There's not-- you know, the first time you call it, you have to wait several microseconds. None of that. You just call it. It's normal function. Great.
And you can provide your own scratch space to avoid getting hidden malloc calls or anything like that. If you're going to provide your own scratch space, you need to first query the function that you're going to use to find out how much space is needed. You can use the kvimage getTempBufferSize flag to do that. So the way you use this flag is that you call the function exactly the same way that you're going to call it, with all the same arguments, except the final argument to most vimage functions is a flags argument. And here you just add in the kvimage getTempBufferSize flag, and it returns the amount of scratch space that it requires. So typically, the way you use this is sort of what's illustrated here, where you get the amount of scratch space that's needed, you allocate that much space, maybe you have your own custom allocation or something like that, you grab the space that's needed, and then you can use that to apply the filter numerous times using that same scratch space. The image is threaded using GCD. So on devices that have multiple processors, we take advantage of that. And again, this just happens without any involvement from you, you don't need to worry about it.
So as I said, you write your code once, and if there's more than one processor, you're going to get multiprocessor performance. If there's only one processor, that's fine too. You get good performance either way. And you can disable threading if you already have your own threading engine or your own tiling engine, and you don't want us to conflict with that. You pass the kvimage do not tile flag to the functions, and then we just turn off tiling, and you can take care of all of that. So if you've already written some of your own image processing algorithms, you already have your own threading model, that's fine. You can still work with vimage. You can just swap us in for whatever routines you're already using internally. And our APIs also provide ways to handle edging that will support your tiling model. So if you look here, suppose you had already written your own tiling model and you had broken this image up into tiles. If you look at, say, tile 6, on the right-hand edge of that tile, there's an edge. But on the top, there's no edging. And on the left, there's no edge. But maybe the bottom has a couple of pixels that are influenced by the edge. We give you ways to handle this, even, in the face of tiling. So you can specify that you only need the edging to occur this way. So that's really nice. And so that's the quick introduction to V-force. And at this point, I'd like to bring up my colleague, Luke Chang, who will tell you about V-force.
[Luke Chang]
All right, thank you, Steve. Hi, everybody. My name's Luke Chang. I'm an engineer in Vector Numerics Group. Today I'm going to tell you about vForce. In the following about eight to 10 minutes, I'm going to tell you what is vForce, how to use vForce, and why you want to use vForce. So let's get started. What is vForce? We made vForce to satisfy your computational need. We know smartphone is not just a phone anymore.
There are tons of apps out there, many of them a lot of data. Don't even mention the iPad. Doctors are using it, pilots are using it. So how do we process those data efficiently? This is where vForce comes into play. vForce has elementary math functions for arrays, so it handles array data efficiently.
If you're a Mac developer, you probably are familiar with it already. We introduced in Mac OS X Tiger. Now we bring it over to iOS 5. Let's say you want to write an app to generate sine wave, a signal generator. Here's one way you can do it in Plan C. It's very simple, straightforward. You'll have output buffer, and you'll have input buffer indices because you probably want to generate a frequency-modulated sine wave instead of just a fixed-frequency sine wave. And then you write a for loop to calculate the sine value of each index and store all the results to output buffer. This is very simple. Nothing can be more simpler than that. But how can we use vforce for this? How can we improve this code? Here's how. First, you have to include an accelerated header to have access to accelerated framework. And then you just have to replace your for loop by a single function call to vvSinF. vv is a function prefix to vforce functions, followed by its functionality. In this case, it's sinF, single precision sin. takes three arguments. The first argument is the output buffer. Second argument is the indices. And the third argument is the pointer to the array length. So VVSignF will do the same thing as a for loop, calculating the sign value of each index and store all the results to the output buffer.
I said it's doing the same thing. Why did I say it's better? Well, let's look at a performance comparison. This chart shows how many signs you can calculate in one microsecond. As you can see, vForce is more than twice faster than using a for loop. This means your app will be more responsive. The user of your app don't have to wait long enough to get the results. That's nice. But this is not the only thing vForce has to offer. Let's also look at the energy consumption. This is the energy comparison.
The chart shows how much energy is consumed per one signed result. Again, v4 is more than twice more energy efficient than using a for loop. This means your app will run longer. If you're writing apps for doctor or pilots, you really want your app to run longer. So We've seen that VVSciX is pretty good. What else is also available in VForce? We have transcendental functions, power, sine, cosine.
We also have the rounding function. All four rounding modes are supported. And there are a lot of other stuff, like square root, remainder, and if you dig into VForce header, you'll find yourself pleasantly surprised by the nice functions we put in there. I'll talk about that in a bit, but let's first, let me show you what's the performance of some of my favorite functions in VForce. The first function, trunk_f, VForce is more than three times faster than using a for loop. For the other functions, they're about twice as fast. And I want to bring your attention to the last one, sine cos f. What is sine cos f? Sine cos f calculates sine and cosine at the same time for the same input.
Why do you need this? A lot of times when you calculate sine, you also need cosine. For example, if you're doing polar to Cartesian coordinate conversion, you'll need a sine value for y-axis and cosine value for the x-coordinate. this function will come in handy. So let's look vForce in more detail.
v4 supports both single and double-position floating-point numbers. It handles edge cases correctly, so if your input has infinity or nans, positive, negative, zeros, you don't have to worry. Put your data into the input buffer. V-Force will handle those for you. And again, there's minimal alignment requirement, just like other Accelerate functions. We only require native data alignment. For single precision flowing point number, that's four bytes aligned, and double precision flowing point number is eight bytes aligned. Nothing more than that. so you can concentrate on getting your data ready in the input buffer. And your result will be ready in the output buffer. So let's recap. vForce is fast. vForce is energy efficient. And it's quite easy to use. There's not a lot of requirement you have to satisfy.
So how did we do it? We know that ARM LibM is already highly optimized on iOS platform. How do we make vForce even faster? Well, vForce has its own advantage. vForce operates data from an array. So there will be multiple data available in the array. We could use the powerful vector engine, Neon, to generate multiple results at the same time. And also, we did loop unrolling and software pipelining to minimize the processor stall. So the processor won't be waiting for the result from previous instruction to fit into the next instruction. And we did it in assembly. Bottom line is, we did all the work, so you don't have to. All you need to do is after the presentation, check out the VForce header, see what's available in VForce, and even integrate some of the VForce function into your apps. You'll see the performance improvement yourself. So that's it for VForce. Thank you. Let me hand the presentation back to Steve.
Thanks, Luke. I just want to say, we really put a lot of effort into making a great math library on iOS. So the fact that VForce is able to go even faster is actually really quite impressive. So now, that was the quick overview of what we just introduced in iOS 5. I want to spend a little bit of time talking about my favorite thing in the Accelerate framework, which is LAPack and BLAS. And we've improved these tremendously for iOS 5. They're really nice. So I mentioned we improved them. How did we improve them? We improved the performance a lot.
So, LAPACK and BLAS, now, the A5 processor in the iPad 2 has a great double-precision floating-point unit. It's really fast, and we want to take advantage of that in LAPACK and BLAS to give much better performance than we've been able to give so far. How do you measure performance of these libraries? One of the most commonly used benchmarks is something called LINPACK. And this is a very old benchmark. It actually derives from an old linear algebra library called the LINPACK library.
and it basically measures how fast can you solve a system of linear equations. When people talk about the LINPACK benchmark, there's a lot of confusion, 'cause it's not actually just one benchmark. It's three separate benchmarks that all go by that name. The first one, you measure how fast can you solve a system of 100 equations using a reference implementation of the solver. Now, that reference implementation comes from, as I mentioned, the LINPACK library. And it's not tuned to the hardware.
It's nothing like that. The benchmark formally is in Fortran, so it would be Fortran reference code. People have ported it to C and Java and a lot of other languages. Most of the time you see people talk about LINPACK benchmark numbers. This is the benchmark they're talking about in the mobile space.
When you see someone talk about a new device's LINPACK score or something like that, this is the kind of thing they're talking about. They're talking about reference code running for a fairly small-sized matrix. The second benchmark is 1,000 equations you get to use your tuned implementation. You get to go ahead and write the fastest solver you can write and solve a system of a thousand equations. And the third benchmark that goes by this name, this is the benchmark people are usually talking about in the high-performance computing world when they talk about LINPACK. If you hear about some new supercomputer that has, you know, a hundred petaflops or something, that's this benchmark.
This is sort of the no-holds-barred division to LINPACK. You get to choose your problem size. You can--if where your computer runs fastest is on a 1.2 million by 1.2 million matrix, so be it. You get to use the fastest solver you can come up with. It's the no-holds-barred division. So that's--that's LINPACK. And when you see people talk about LINPACK on mobile devices, you often see numbers like, you know, 35 mega flops or 50 mega flops. This--that kind of number gets tossed around a lot. iPad 2, as I mentioned, it's got a nice processor, and the compiler does a pretty good job. So with the reference code, we do a little bit better than that, typically. We get about 90 megaflops on LINPACK.
I think this is basically criminal, because you've got this beautiful hardware, and you're going to run this reference code on it that doesn't take advantage of it at all, really. And so we want to see, you know, what can we get on LINPACK using Accelerate, using a tuned implementation, you know, actually take advantage of the hardware, make it as good as we can.
So on the iPad 2, using Accelerate, Forget--forget megaflops. We give you 1.5 gigaflops on Linpack. That's more like it. I should note that not all of this work is in the seed. In the seed, you'll get about 1.3 gigaflops. When iOS 5 ships, you get the full 1.5. Even 1.3 is pretty good, though. That's, you know, 12 times faster than the reference code or something like that. So I think that's great. And in fact, the iPad 2 would have been one of the fastest supercomputers in the world in 1994. How cool is that? Yeah, that's awesome, right?
Just like everything else in Accelerate, we give you this performance without you needing to write a lot of code. You don't have to be an expert in high-performance computing to get 1.5 gigaflops on the iPad 2. You include Accelerate, and you call-- the first routine is to factor a matrix. This is just a standard LAPACK function, industry standard function. And the second function is to solve the matrix system. And that's it. You know, three lines of code. That's great. And also, just like everything else we've talked about today and everything else in Accelerate, not only do you get great performance on the iPhone, the iPad, and the iPod Touch, but if you decide you want to port your app to the Mac, or if you already have an app running on the Mac and you want to port it to iOS, the same code gives you great performance on all of those devices. So we scale from an iPhone with one processor to an iPad with two processors right up to a new MacBook Pro where we give you 35 gigaflops.
and all the way up to a Mac Pro where you get 80-something gigaflops of double precision performance. So three lines of code. It's portable across all these devices. We give you really nice performance on all of them. And that's what we want from Accelerate. We want to give you something that's easier to use than it would be to write your own code. We want to give you something that gives you great performance across all the hardware we provide. And we want to give you something that lets you use energy, save the battery, make your app last longer. If you want more information about any of this, we have a couple people you can contact. There's documentation online. The V-Image programming guide is really nice. And you can also, if you have questions, post them to the Apple Developer Forums, or just file a bug if something doesn't work the way you think it should.