Core OS • iOS, OS X • 52:12
The Accelerate framework has always provided fast, energy efficient signal and image processing and linear algebra libraries. Learn about new interfaces that are easier to adopt and give you the same speed, and about new primitives that make it easier to write your own vector code for iOS and OS X.
Speakers: Geoff Belter, Steve Canon
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
Good morning everyone. My name is Geoff. I'm an Engineer in the Vector and Numerics Group where we maintain the Accelerate Framework. The Accelerate Framework is a collection of routines which deliver a huge range of functionality. All this functionality is going to be extremely fast and be very energy efficient. Today I want to introduce some new features and functionality to the Accelerated Framework which are designed to really simplify the way that you access this high-performance functionality.
So what are you going to find in the Accelerated Framework? We break this into four broad categories. The first is image processing. Here you're going to find conversions between various pixel formats, warp, shears, convolution, etc. We've got digital signal processing, FFTs, DFTs, biquads, various vector operations. Vector math functionality, so a lot of things that you're going to find in math.h for example, operating on vectors, so sign of a vector, co-sign of a vector, etc. And then finally, linear algebra. Solving systems with linear equations, eigen values, matrix, matrix operations, a lot of functionality in here as well.
The Accelerate Framework brings a lot more than just functionality to the table. First, it's extremely high performance. When we say this, there's two main metrics that we pay a lot of attention to. The first is speed. It's going to be extremely fast. There's two key tools that we use to achieve this.
The first is short vector units. So on Intel we're taking advantage of SSE and AVX and on ARM we're taking advantage of NEON. Also in some situations we're utilizing multiple cores. We're going to do this automatically. So we're really going to take advantage of all of the processing that's available for you.
The other metric that we spent a lot of time looking at is energy efficiency. So we're increasingly relying on our portable devices. It's important that we keep an eye on this. Generally, when we improve speed and performance, energy efficiency improves as well. So when you adopt the Accelerate Framework, you're going to be fast and energy efficient.
The Accelerate Framework is available on both OS X and iOS. And it's optimized for all generations of hardware. So when you adopt the Accelerated Framework, you're going to write once. You're going to get a code that runs extremely fast and is energy efficient no matter where it ends up running. So it's really convenient for you.
Today I want to talk about the new features and functionalities to make it easier to get to this high performance. We've got some great new features in vImage which really round out what you can do with vImage. And then I want to spend the rest of the time introducing two new pieces of work. The first is designed to really simplify the way that you access high performance LinearAlgebra. We're calling this LinearAlgebra. It's a part of the Accelerated framework.
The other piece is not actually a part of the Accelerated framework. It's a collection of vector programming primitives. It's found in simd.h. And for those of you that want to roll your own high-performance vector implementations, there's going to be some great, great tools in here to help you do that.
So now let's jump right into vImage. This is our high-performance image processing library. It's got a huge range of functionality. I want to show you some of the things that you can do with a short video. You can perform alpha blending, dilatation, erosion. You can create sobel filters to do edge detection. Convolutions for blur and de-blur. You can create multikernal convolves. There's min and max filters. Various color transformations.
And warps and shears. This is just some of what you can do with the vImage. Really you can do almost any of your image processing needs with the tools that are available in vImage. I want to move now into some work that we introduced last year. And this is about getting your image into a format that the vImage can consume. Specifically, if you're coming from a CGImageRef.
So until last year this was a difficult task. If you didn't know exactly what the pixel format of your CGImageRef was for whatever reason, it could be difficult to get it into the 8 bit ARGB or whatever format that you saw in vImage that you wanted to work with.
So last year we introduced a single routine that allows this to happen. I'm just going to move through this at a high level to make you aware of it. For further details, please see last year's talk. But all you do now is you create a structure which describes the pixel format that you're trying to get to.
And then you're going to make a single function call, vImage buffer, and it was CGImage. This takes an uninitialized vImage buffer. It takes the structure describing the format and the CGImage. At the end of this, it's going to return in a fully initialized vImage buffer, and you can do whatever you need to do.
The round trip is just as easy, single function call. So now we've performed all the operations on the vImage buffer. I stayed in the same format so we can use that same structure describing the pixel format. And this is going to return a CGImageRef. So some really great inoperability with CGImageRef. It's really easy to get your image data in and out of vImage this way. Last year we also introduced some high level entry points to some really amazing conversion support. And this through vImageConvert AnyToAny.
It does exactly what it sounds like it's going to do. It allows you to convert between nearly any pixel format and any other pixel format. Again, just at a high level for further details, see last year's talk. But the way that it works is you're going to create two of these structures describing the pixel formats. One for the source format, one for the destination type.
Then you create a converter. And then with that converter you can create image. You can convert between the two image formats. You can convert as many as you want with a single converter. So, this allows you to convert between nearly any pixel format. To the power user, this means you can get almost any of your image formats into a format that the image can consume very easily, very efficiently, and it's going to run extremely fast.
You guys had some really great things to say about these two new features. One Twitter user said, "functions that convert vImage objects to CGImage objects and back," thumbs up. Another Twitter user said, "vImageConvert AnyToAny is magical. Threaded and vectorized conversion between nearly any two pixel formats." We really appreciate the feedback. We're very happy that you guys are using this and find it useful. Please keep the feedback coming.
So with that I want to introduce video support to vImage. This is new in both iOS 8.0 and OS X 10.10. And I'm going to start with the high level functionality from a CVPixelBufferRef. So this is a single video frame. And we're introducing the same interoperability and the same ease of use that we saw with core graphics.
So now if you want to get your CVPixelBufferRef into a format that the image can operate on, it's a single function call. You're going to use that same structure which describes the format which you're trying to get to. And then you're going to call vImageBuffer and it with CVPixelBuffer.
It takes an uninitialized vImageBuffer. It takes the structure describing the format, and the CVPixelBuffer. There's some additional arguments for the power user which we'll see a little bit more about in a second. At the end of this you've got a freshly initialized vImage buffer, and you can perform any operation you want.
The round trip is just as easy. So vImageBuffer back to CVPixelBufferRef is vImage copied to CVPixelBuffer. Takes the vImageBuffer that you just finished working with, the buffer format describing the pixel format. And then the CVPixelBuffer that you're trying to copy back to. So there's some great interoperability now with core graphics as well.
To support the high level functionality that we just saw, there's a lot going on behind the scenes. All of this is exposed to you as well. So, the lower level interfaces. There's forty-one new video conversions which are supported. You can, through some of the other arguments that we saw, do things like manage the chroma siting, work with transfer functions, and conversion matrices.
There's a lot that you can do with this. Another one that is really neat if you've worked with video formats before is RGB colorspaces. So there's some subtleties and some, it's just a little bit tricky and complicated to get an RGB colorspace. And vImage makes this really simple and easy to do.
vImageConvert AnyToAny is extended to support all of the video formats now. And there's two great new convenience routines which allow you to create converters to convert back and forth between core graphics and core video. So now with video support in vImage we've got great interoperability with both core graphics, core video, really fast conversions for both image and video pixel formats. And really fast operations once you're in vImage.
I want to show you some typical performance. So what I have here is performance from VideoToolbox. This is available in CoreMedia, and what I've got on this graph is showing the speed and megapixels per second on the Y axis to convert from VRGA 8 bit pixel format to the pixel format shown on the X axis.
The gray bar is OS X 10.9. This is before VideoToolbox had adopted vImage. And then the blue bar is OS X 10.10 after VideoToolbox adopted the image. We see a few things here. First we see some really great performance improvements. So vImage conversions are going to be really fast. In some cases we're up to five times faster.
The other thing that we see all the way at the right, the v210 image format wasn't even supported before. The image supports a wide range of formats, and it made it really easy for them to produce new features once they adopted the image video support. So this is what you can expect out of vImage. Great performance. Simple, easy to use, good interoperability with core graphics and core video.
Now I want to move on to LinearAlgebra. This is a new sub-framework in the Accelerate Framework. It is designed to be simple access to high-performance LinearAlgebra. I want to begin with a motivating example. How do you solve a system of linear equations? Let's look at how you do this with a LAPACK, also available in the Accelerate Framework. And this is saying if we've got a system of equations in sub-matrix A on the right-hand side and sub-matrix B, how do we find AX=B. So with a LAPACK, it's going to look something like this. It's not terribly straight forward.
The naming convention in a LAPACK uses short names, so you're going to have to figure out that sgesv means solve system of linear equations. Once you're there, the argument names are not going to be much better. You're passing by reference here. All of the argument types are CLPK integer. So there's going to be a lot of explicit casts.
Additionally, there's going to be a lot of memory management that you need to do explicitly, workspaces, or in this case a pivot vector that you need to create. So there's a lot to just finding the right routine and then using it correctly. We think it should be much simpler than this. Let's look at how you solve the system of linear equations with LinearAlgebra. It's going to be really simple. It's simply going to be la-solve. All of the details are going to be managed for you behind the scenes.
So with that let's dive into what exactly you're going to get out of the LinearAlgebra sub-framework. So it's new in both iOS 8.0 and OS 10.10. It is designed to be simple with good performance. It has got single and double precision support. So it's not going to be mixed, much like BLAS and LAPACK. It is, got support for Objective-C, so the object is going to be a native of Objective-C Object.
What are you going to find in LinearAlgebra? There's a huge range of functionality. We've got element-wise operations, add, subtract. Matrix products. This could be inner product, outer product, matrix, matrix. Transposes. There's support for norms and normalizations. Support for solving systems with linear equations. And then two pieces which are unique to the LinearAlgebra sub-framework, and those are slice and splat. And we'll see about those in further detail a little bit later. Well let's begin with a new LinearAlgebra object. The LinearAlgebra object is a reference counted opaque object. As I said it's an Objective-C Object in Objective-C. It still works in C though.
It manages a lot of things for you. So in that initial LAPACK example we saw that for each argument you're tracking a pointer, the row and column dimensions, leading dimension or a stride. There's a lot of things for each argument. It means you have a lot of arguments. There's a lot going on.
Here the object is going to keep track of the data buffer. It's going to keep track of the dimensions of each of these objects. Errors and warnings are attached directly to the object making it really convenient. And then finally scalar type. So with BLAS and LAPACK you've got all the APIs duplicated, one for single and one for double. We can collapse all that down to half the number of APIs.
Memory management for these LinearAlgebra objects. Again these are reference counted objects. There's a lot of documentation about reference counted objects. There's nothing new here. Just very briefly from C, you're going to use la- release and la-retain. You do not ever free these. From Objective-C, they take the standard release/retain messages. And then finally, Objective-C with ARC, which is what we recommend. Just lets you write exactly what you want with no explicit memory management. From here on out, all the examples that I show are going to be Objective-C using ARC, so there's going to be no memory management.
So how do you get your data into one of these LinearAlgebra objects? In this example, we're going to allocate a buffer. It's going to be some number of rows by some number of columns, and we know the row stride and number of elements. We're going to fill that as a row major matrix. Then to get that matrix into the LinearAlgebra domain, we're going to just call LA matrix from float or double buffer. It takes the pointer, the dimensions of the matrix, the row stride.
And then hints which we'll see on the next slide a little bit more details about those. And then attributes, which are attached to objects. These attributes allow you to do things enable additional debug logging. In this particular case, the data is copied out of A so the users retained all rights to A. In this case they need to free it.
So hints, when you're passing data to LinearAlgebra, there's some information that can be beneficial to the framework to deliver the maximum performance. So hints are designed to allow for this, to allow for you to give us details and insights about the buffer so that we can use the right routines behind the scenes.
So for example, if you know that you're a diagonal or a triangle or matrix, we can leverage that. These are hints, so if you pass the wrong hint it's not going to give you a wrong result. It may just add additional overhead. If you don't know, just use LA no hint.
The next piece I want to talk about is Lazy Evaluation. I want to do that with a fairly large example for a slide. So it's not important that you understand exactly what's going on in all of this code. I just want to walk through it at a high level so that you can understand what's going on behind the scenes.
LinearAlgebra uses an evaluation graph. When you create an object, evaluation is not necessarily going to occur. It's going to be added into this evaluation graph. So at the start of this function, we've got two evaluation graphs with a single note in each of them. And as we step through this code we're going to create additional objects. So in this case we create a transpose. We add that to our evaluation graph.
Then we take the sum of the odd elements of x and the even elements of x. Again, we just add that to the evaluation graph. And we continue. This time the product of At and x2, all scaled by 3.2. All of this is just added to this evaluation graph. At no point has any evaluation occurred or any temporary data structures been allocated.
So no computation is going to occur until you trigger it. This allows us to not perform a lot of frivolous memory allocations and computations. And right now we don't trigger a computation until you've explicitly ask for data back. This is going to happen with la-matrix to float or double-buffer, or la-vector-to-float or double-buffer. So again, creating these objects is going to be lightweight. We're going to do a lot of work behind the scenes to make this run extremely fast. And we're only going to compute the data that you request at the end.
I want to show you some performance results for the routine that we just saw. Before I do that, I want to introduce Netlib BLAS. This is an open source implementation of BLAS. I said if you weren't aware that BLAS was available in the Accelerated Framework, this is probably the implementation that you would find yourself using.
So now let's look at the performance of that routine that we were looking at before. On the X axis we've got various matrix sizes. On the Y axis we've got gigaflops, so higher is going to be better. Here's the performance of the LinearAlgebra Framework. We can see it's pretty good. Let's compare it to the accelerated BLAS, an extremely high performance benchmark here.
What we see here is LinearAlgebra is getting most of the performance that the Accelerated Framework can deliver. Much simpler to get all of the performance from LinearAlgebra. There is a discrepancy on the small end. There are fixed costs associated with these objects which are magnified for smaller matrices.
But overall, you're getting most of the performance with a really simple clean API. I just want to put this performance comparison into perspective. What if you had used that open-sourced NetLib implementation of BLAS? Your performance would look like this. So you can see, you're getting a lot of the possible performance from LinearAlgebra.
Next I want to talk about error handling. So what I've got here is just a sequence of operations with LinearAlgebra. After each operation we're checking the error status. We don't recommend you doing it this way. What we recommend you doing is checking the error once at the end.
So errors are going to be attached to and propagated through these evaluation graphs. So if we have an error in the first statement, that error is going to be attached to the object AB. Sum is going to see that there is an error there and just propagate it through.
Additionally with Lazy Evaluation, there's a class of errors that may not be triggered until computation time. So it's always best to check the status as late as possible. In this case we're trying to write back to the buffer before we even check the status. The way that we recommend you checking the status is if the status is zero or LA-SUCCESS, then everything went well. In this case, you've got data in your buffer.
If it's greater than zero, there was some warning, you're going to have data there but you may not have full accuracy. And then finally less than zero some hard error occurred. In this case there's going to be no data in that buffer. This might be something like a dimension mismatch or something we just can't recover from.
So this sort of begs the question, how do we debug this if we've got all this late error checking, Lazy Evaluation? The best way to do this is to enable debug logging with LA-ATTRIBUTE-ENABLE- LOGGING. When you do this and you encounter an error or warning, you're going to get a message like this to standard error.
This is going to help you determine what the error was and where it occurred, which really helps you to quickly narrow down where the problem is coming from. I want to talk a little bit about the details of the solve. So if you're familiar with LinearAlgebra, if you've worked with LAPACK before, you know there's a lot of options here.
So I just want to talk about what our solve is doing at this point. So if A is square and non-singular matrix, it's going to compute the solution to Ax = b. If A is square and it's singular, it's going to produce an error. So right now, it's pretty straightforward to do, and this is what you're going to get out of it.
The next piece which is unique to LinearAlgebra is slicing. So slicing is light weight access to partial objects. I say lightweight access, so there's going to be no buffer allocation and no copy. Things that you can do with slices are for example, taking the odd elements of a vector.
We shouldn't have to allocate a temporary buffer and copy those odd elements out into that buffer if we don't need to. And when I say that there's no allocation and no copy, don't confuse this with Lazy Evaluation, this is added evaluation time. We're going to do everything that we can just to access that data in place.
There's three pieces of information that you need to create a slice. That is offset, stride and dimension. And let's look at an example. Let's say we wanted to slice and get some of the elements out of an existing vector already. The first argument is going to be the offset. This is a zero based offset. So if you start with the 8th element it's going to be 7.
The stride is the direction and number of elements that we're going to move. In this case it's negative 2, so we're going to move back two elements. And then finally, the dimension is 3. So we're going to have this view of a three element vector, which is really elements out of some larger vector. Again, no copy, no allocation here, just a lightweight access of elements in some larger object.
One of the ways that you might use this is to create a tiling engine. Let's just look at a simple example. You want to sum two matrices together. One of the ways you can do this is with this simple nested loop. And you would put your slices inside the loop. And you're slicing the two operands. A and B in this case here. And you're creating a partial result C. Just using that C and then getting the next partial sum.
So you can do it this way. And it's going to work. But we can actually do a lot of this work for you. So instead what we recommend doing is hoisting that sum out of the loop. With a Lazy Evaluation, nothing is going to happen here. And instead to just put the slice on the result. So our picture has changed a little bit. It looks like something different is happening here. But behind the scenes, you're actually getting what you saw on the previous slide.
So you're getting exactly what you want. We're doing all the work for you behind the scenes. So it's really easy to work with these slices. And the rule of thumb is to put them as close to the result as possible. The next piece is a splat. A splat is way to work with scalar values with vectors and matrices. So let's say you want to add 2 to every element of a vector.
The way that you're going to do this is you're going to call la-sum with your vector object. And then you're going to splat the scalar value 2. So it's really easy to do certain operations now with scalars on matrices and vectors. So that's a high level summary of LinearAlgebra. It's got a really simple, easy-to-use API. It's got some great modern language and runtime features. And it's going to deliver really good performance. With that I want to turn it over to Steve to talk about LINPACK.
Thanks Geoff. So I'm Steve Canon. I'm a Senior Engineer in the Vector and Numerics Group. I work with Geoff. And I'm going to talk about our other new feature shortly, but before I do that, I thought we'd have a little bit of fun and talk about LINPACK real quickly.
So LINPACK is a benchmark that originated in the high-performance computing community. And what it really measures is how fast are you able to solve a system of linear equations? Now this might seem like kind of an arbitrary benchmark. But it turns out that lots of computations that we do every day boil down to solving linear problems.
So this is really an important thing to be able to do quickly. Now when you talk about LINPACK, it's important to keep in mind that LINPACK is measuring the speed of both hardware and software. You can't have great performance on LINPACK without good hardware and without good software that takes advantage of the hardware that you have.
The past few years, we've shown you a shootout between Accelerate running on iOS devices and what we like to call Brand A. Last year we showed you a chart that looked like this comparing Accelerate running on the iPhone 5 against the best LINPACK score that we were able to find anywhere for any Brand A device.
So this is performance in gigaflops. This is double precision. See that Accelerate on the iPhone 5 gives you about 3 1/2 gigaflops on LINPACK which is a really impressive number. It's great. Now the past few years we showed you a chart like this, and then the next year Brand A hardware would have improved enough to make the comparison more interesting. And then we could blow you away again with how much faster Accelerate was.
But since last year, Brand A hardware hasn't changed that much. And so the great software primitives that we give you in Accelerate, well this is still on the iPhone 5, and you can see, it's not that interesting of a comparison. So this year we thought we'd do something different.
We're going to find some new competition. Instead of comparing current iOS hardware against Brand A, we're going to compare Accelerate running on the iPhone against Accelerate running on some other device. What should we pick? We chose to look at the 2010 MacBook Air. Now this was a sweet laptop. I had one of these. It's fantastic. This was like the first one that we shipped with the current hardware design on the outside. It's a really nice machine.
It's just a few years old. You can see it's more than twice as fast as the iPhone 5 was. So, how do you think the iPhone 5s stacks up? Well, you should have some clue. I probably wouldn't be showing you the graph if it wasn't at least going to be close.
But on the other hand, this is a pretty sweet laptop from just a few years ago. And we're going to compare it against the phone that fits in your pocket like so. I don't know. Who thinks that the iPhone 5s is faster? Who thinks that the MacBook Air is faster? Ok. So let's see what happens. The iPhone 5s would give you 10.4 gigaflops double precision LINPACK.
And we have other iOS devices too. On the iPad Air, we give you 14.6 double precision gigaflops. And you don't need to be an expert in high-performance computing in memory hierarchies, in vectorization, in multithreading to get this. You just use the simple primitives that we give you for matrix operations, and you get this kind of performance. So I think this is really cool.
With that, I'm going to move on to our last new feature for the day, which is called SIMD. Now SIMD traditionally is a name used to talk about hardware, and it stands for single instruction multiple data. And that's not exactly what we're talking about here. This is a new library that we're introducing in iOS 8.0 and OS X Yosemite. And it has three primary purposes. The first one is to support 2D, 3D and 4D vector math and geometry.
The second purpose for SIMD is to provide a lot of the features of Metal in C, C++ and Objective-C running on the CPU. So it's going to make it easier to prototype code. Maybe you want to run the CPU before you deal with GPU stuff. Maybe you want to move code between the CPU and the GPU. Makes it a little bit easier to do that.
And finally, SIMD library provides an abstraction over the actual hardware SIMD and the types and intrinsics that you often use to program against it to make it easier to write your own vector code when you need to. So I think the most interesting thing about this is the vector math and geometry, and I'm going to dive right into that.
There are already a couple of vector math and geometry libraries on the platform. There's all the features in Accelerate, which can do just about anything you want. There's DLKit, SpriteKit, SceneKit, the physics library that goes with them. So if we're going to do a whole new one, we had better get some things right.
So, my wish list of what a library like this should look like is kind of like this. First off, we should have inline implementations of everything we possibly can. Because when you're doing, you know, a 40 dot product or something, there's not a lot of arithmetic. It's just four multiplies and three adds. So you have to actually make an external function call to a jump. That's not what you want to do. You're only going to do seven arithmetic operations.
And because of this, essentially everything in SIMD is header inlines. So it just gets inserted into your code. We give you a really nice performance. Next, we should have concise functions that don't have a lot of extra parameters. If you want to do a dot product, 3D dot product using BLAS, it looks like this. You've got all these extra parameters. We don't think you should need to write this.
If you're going to use it using GLK, which is a great library; I love GLK, but the compiler should know that x and y are three-dimensional vectors. You shouldn't need to tell it that in every function you call. With SIMD, you just write this, vector dot(x, y). Functions overloaded to support all the different vector types that we have.
It just works. It inserts the correct implementation into your code. You get great performance. If you're writing C++, then we have even shorter names under the SIMD namespace. And these look just like Metal. So you can take Metal code, add the using namespace SIMD, and a lot of it will just work using SIMD headers. This is really convenient when you're writing your own code.
The last feature that I think is important is that arithmetic should use operators. So if you want to average two vectors, rather than needing to write this, you should just be able to write 0.5 times ( x + y). Now you have the average of two vectors. This is a lot easier to write. It's a lot easier to read. It makes your code more natural.
Alright, so let's dive into what's actually available here and what we're doing. First, the basic types. We have a lot of vector types available in SIMD. But the ones that you're going to use most often when you're doing vector math and geometry are the 2, 3, and 4 dimensional float factors, which are just vector float2, vector float3, and vector float4.
If you're writing C++ code, again we have the names that match Metal. They're in the SIMD namespace. You can just say float2, float3, float4. And these are based on a clang feature called extended vectors. And that gives us a lot of functionality for free that made writing this library really pleasant.
So first off, arithmetic on vectors pretty much just works. You can use all your favorite arithmetic operators on vectors and on scalars. Everything is nice. It makes your code easy to read. And I'm going to show you another example of that right now. So, a pretty basic function for a graphics library is a vector reflect. So we take a vector x, and we take a unit vector n. That unit vector determines the plane. And we're going to reflect x through that plane. This is a really common operation in graphics. And there's a simple mathematical expression that gives the result.
Now, before we might have had to have a lot of verbose function calls to compute this expression. But with SIMD, it's really simple. We just write x minus twice the dot product of x in the normal vector times the normal vector. This is just as simple as the mathematics is. It makes your code, again, really easy to write, really easy to read. I think it's much nicer.
There are bunch of other features that we get with these vectors without needing to call any functions or do anything. We get access to vector elements and subvectors really easily. Array subscripting just works. If you want to pull out the second element vector, you just subscript just like you would if it were an array.
Named subvectors just work. So if you have a vector of 4 floats, you can get the low half of it, the first two elements by just using the name of the vector dot low. The high half is just dot high. You can get the even elements. You can get the odd elements.
And I should point out that these subvectors and elements, they're L values. So you can assign to them as well as reading from them. And this is real useful when you're writing your own vector code, especially if you're doing perspective coordinates or something like that. A lot of times you need to just set some value in the fourth coordinate for example.
This is really nice. If you go totally hog wild with this, it will make it harder for the compiler to generate great code for you. But used sparingly, this is really a powerful feature. We have some other, that's about what you get for free with the types. Now we also give you lots of functions that give you the operations that you want. We have three headers that have tons of stuff that comes up all the time for math and geometry. Math, common and geometry.
In C and Objective-C, those functions look like this. Notice the math functions look just like the math functions that you use for scalars. They're overloaded, so now they work for floats, for doubles, for vectors of floats, for all our other floating point vector types, just works. You want the square root of a vector? Just call square root.
Everything is there. The common functions you may be familiar with if you've written shader code before or if you've done a lot of graphics programming. These are operations that are really useful when you're dealing with coordinates or colors. If you haven't done a lot of that before, they may be new to you. But don't worry about that. They're easy to understand and there's a lot of documentation for them in the headers. And then there's the geometry functions as well.
Now in C++ and Metal, again we have shorter names available in C++. These are under the SIMD namespace. And these exactly match the Metal functionality. So again, this makes it really easy to move code between C, C++ and Objective-C and Metal when you need to. Now I want to call out that some of these functions come in two variants. There's a precise version. And there's a fast version.
Now precise is the default because if you don't know which one you need, it's better to be safe and give you the most accurate one we have. But, there is also a fast version. If you compile with ffast-math, then you get the fast ones by default. The fast ones just may not be totally accurate to the last bit. We give you about half the bits in a floating point number with the fast variance.
Now even if you compile the ffast-math, you can still call the precise ones individually when you need to by just introducing precise into the name. And similarly vice-versa. If you don't have ffast-math specified, you can always call the fast variant. And in C++ we do this with namespaces. There's a sub-namespace called fast and a sub-namespace called precise that you use so that you can just override the defaults really easily.
Now last, when we talk about vector math and geometry, wouldn't really be complete if we didn't have matrices. So we have a set of matrix types, which are matrix floatNxM. This could be 2, 3, or 4, and they don't need to be square. You can have a 4 x 2 matrix or a 2 x 3 matrix. I want to point out that N is the number of columns.
M is the number of rows. If you're a mathematician this may be a little strange to you. But 2 x 3 matrix has two columns and three rows instead of vice-versa. But if you come from a graphics background, this is very natural. This follows the precedent that Metal and open CL and DX and GLSL and all of these libraries have always used. So that's why we do it.
There are lots of operations available on matrices as well. You don't get the operators for free in C and Objective-C. Sorry. So you do have to make some function calls. But we have a nice set of functions to create matrices. We have a nice set of functions to operate on matrices and matrices and vectors. This is just sort of the broad overview. We have some other stuff as well.
In C++ you get operator overloading. So you can add and subtract, multiply by scalars, multiply matrices and vectors. We have some nice constructors that make it easier to create these objects. It's really nice to work with. Really easy to write your vector code. So that's, that's sort of the vector math and geometry story. And now I want to talk a little bit about writing your own SIMD code using the library.
So we also have lots of other types. I mentioned this at the beginning. Vector float is just a few of them. We also have vectors of doubles. Vectors of signed and unsigned integers. We've got 8 bit, 16 bit, 32 bit and 64 bit integers. We support longer vector types, 8, 16 and 32 elements. This is really useful to write just a little bit of code and have the compiler effectively unroll your loops for you.
We also have unaligned vector support. All of the normal vector types are aligned by default, which is great when you're doing geometry because you're not usually getting the data from somewhere else. You're, you know, we just want to align it. We want to give you the best performance you can. However, when you're writing your own vector code, usually you're operating on data buffers that came in from somewhere. And those buffers may not be aligned. So we also provide unaligned types for you to work with. And I'll show you an example of that a little bit later.
Now, just like the floating point vectors I showed you, you get lots of operators for free. You get the normal arithmetic operators. These just work. You also get the bitwise operators. Those just work on vectors. They work with vectors and scalars so you can shift every element right by 3, by just writing vector, shift right, 3.
We also have a big set of conversion functions for you. These let you convert from one vector type to another. I want to point out that you should use the conversion functions. Don't cast vectors, because it almost surely doesn't do what you want it to do. When you cast vectors, it reinterprets the data in the vector as the other type.
This means that you can't even cast say a vector of 416 bit integers into a vector of 432 bit integers because they have different sizes. So rather than casting them, call the conversion functions, which will convert one vector type to another vector type for you, give you the right behavior.
You also get comparisons. So comparisons just work on vectors. It's a little bit strange though because I can't really talk meaningfully about one vector being less than another vector, right. That doesn't make sense geometrically. So comparisons produce a vector of results. Where each lane of the result vector is minus 1, that's all 1s if the comparison is true in that lane.
And it's zeros if the comparison is false in that lane. I'll show you an example. Here's a vector of 4 floats. Compare it against another vector of 4 floats, we'll see if x is less than y. So, in the first lane, zero is not less than zero, the comparison is false. We'll get a result of zero.
Now 1 is less than 3.14159, so the result is all 1s. So 2 is not less than minus infinity, 3 is less than 42. Now I just went through this, but it's going to turn out this doesn't matter a lot to you most of the time because almost always when you do a comparison, you're going to consume the result of that comparison with one of three operations; vector any, vector all, and vector bitselect.
Vector any is true if the comparison is true in any lane of the vector. Vector all is true if it's true in every lane of the vector. And bitselect lets you select between the elements of two vectors based on some result of the comparison. So most of the time, these give you the functionalities that you really want from comparisons anyway. You don't need to worry about the nitty-gritty details of what the type of the result is.
So now I'm going to show you an example of using this to write your own vector code. I'm going to choose an example that's something that we normally don't really think about vectorizing. It's not that hard to vectorize, but it's something that you know, is outside the realm of sort of floating point computations that we normally think of. We're going to look at string copy. So here's a simple scalar implementation of string copy that's sort of right out of KNR.
And all we do is we iterate through the bytes of the source and we copy them to the destination. And when we reach a byte that's zero, we stop copying. That's it. Complete implementation right there. Now, as I said, this isn't too hard to vectorize. Here's sort of a typical SSE intrinsic implementation string copy. I haven't pulled out all the crazy stops here, but this is sort of a reasonable implementation. And this is fine. It wasn't too hard to write. It's a little bit ugly. I find it kind of a pain to read.
The big problem with this is that this works for 32 bit and 65 bit Intel, but we might want to now run our code unarmed. We have to write, either write a whole new implementation, or just fall back on the scalar code. So we want to give you the tools to write fast implementations that you can run on all of our platforms.
Here's what a SIMD implementation of string copy looks like. First off, it's a little bit shorter than the SSE intrinsic version. And I think it's a little bit cleaner. I'm going to walk you through it. The first part here we just go byte by byte until the source has 16 byte alignment. That's going to enable us to use lined goods from the source from that point on.
I'm not going to get too much into the nitty-gritty of the details of why it's important to do this. But when you're dealing with implicit length data objects like strings, you do really need to align your source buffers. Having aligned our source buffer, now I'm just going to cast the source and destination pointers into pointers to vectors. And you notice I used two different vector types here. Remember I aligned the source buffer. So it's a vector char16. That's a line that has 16 byte alignment.
The destination vector is not necessarily aligned. There's no guarantee that by aligning the source that the destination is aligned. So instead, I'm going to use this packed char16 type which is an unaligned vector type for the destination. So now that I've set up my types, the actual meat of the copy is really just these two lines. All we do is load the vector from the source, compare it to zero.
If any lane is equal to zero, then we stop the copy, right. So if not, any lane of the vector is zero, we're done. We continue. But as soon as a lane is zero, we're done. So, then in the copy, we just copy that vector from the source to the destination and advance both the pointers to the next vector. Really simple.
And then finally, if we did find a zero, if we found the end of the string in the next 16 bytes, well, let's just copy it byte by byte from there until we reach the end. This is a really simple implementation. It's not the best implementation that's possible to write, but it was really easy and it's going to give us a nice performance string.
So let's look at that. We're going to look at the performance measured in bytes per nanosecond for a variety of string lengths. Now this bytes per nanoseconds is how fast we're copying so that the more data we copy, the better off we're doing. Higher is better on this graph.
We start with that scalar code we had. And you can see we get up to about half a byte per nanosecond, which is, that's still 500 megabytes a second. We're moving a lot of data. But we're going to do a lot better. Let's look at that SIMD implantation that you recall is just a few lines of code. It's almost ten times faster.
And, as I said, it's possible to do better if you really pull out the stops. Here's the performance that we get from libc on the platform, which is also a vectorized implementation, and it does some really clever things about etching an alignment to get more performance. But you notice, we're getting most of the performance of libc.
We got nearly a 10X win. We're within 80 percent of the performance of libc for long vectors. And we got that with just a few lines of code that were really easy to write. The libc implementation is an assembly. We wrote four lines of C basically to get the performance we see here.
And that's really what our message for today is, that Accelerate has always given you really fast vector code. And what we're doing now is try to make it even simpler for you to get at that. To make it so that more developers can easily take advantage of the performance that the hardware offers.
Now I want to note that LinearAlgebra and SIMD are both brand new libraries. They do have some rough edges. I'm sure you'll try to do things that we haven't thought of. But that also means that you can tell us what use cases are really important to you, and we'll take that into account. You can have an enormous impact on the way that these libraries develop.
If you want more information about Accelerate or SIMD, we have two great contacts, Paul Danbold and George Warner. There is a bunch of documentation available for vImage and vDSP online. I would also recommend looking at the headers in Accelerate if you need documentation. VImage and vDSP, LinearAlgebra and SIMD all have lots and lots of documentation in the headers.
It's a fantastic resource if you want to know more about how things work. The developer forums are a pretty good place to ask for help with things. If you're going to file any comments in the developer forums, the place to file them is under Core OS. That's where you're most likely to get the attention that you want.
And the Bug Reporter is also a great way to report issues, make feature requests. You don't only need to use this if there's a bug in the conventional sense. You can say hey, it would be great if, you know, I could do this thing that's a little bit different from what you're doing. Or make an entirely new feature request.
Or say I tried to this, and the performance wasn't quite as good as I thought it should be. Those are absolutely bugs, and those are things that we want to look at. So file bugs early and often. We love to get them, and we love to get feature requests from you guys.
A ton of the stuff that we've done in the past few years has been motivated by feature requests we got from external developers. There are some related sessions that are worth checking out. If you're here, you're almost certainly going to be interested in the Metal sessions that are tomorrow morning. Those are a great thing to check out. Thanks a lot for coming by guys.