OS Foundations • 42:53
Optimize your performance-sensitive applications with guidance from Intel's performance engineers. Discover how to take advantage of the Intel SIMD (SSE/SSE2/SSE3) instruction set and other features of the Intel architecture. The session will emphasize moving PowerPC-optimized code--including highly-tuned Velocity Engine routines--to the Intel architecture. Additional in-depth coverage on the use of Intel's compilers will help you round out your optimization toolkit.
Speakers: Justin Landon, Phil Kerly
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning. It's amazing. If you had asked me a year and a half ago, would I be on stage talking to a bunch of Apple developers at WWDC, I thought you would have been crazy. But it's here, and I'm overwhelmed but excited. I've had a wonderful time actually attending, talking to a lot of people, doing a lot of great things on the Mac, and I think it's fantastic that it's finally materialized on Intel hardware. In fact, it wasn't until about a year and a half ago that I had first walked into an Apple retail store for the first time. And I have to tell you, I'm a convert.
So when I was at-- So when we came up with the topic for performance optimization for Intel-based Macs, I took a lot of time to think about exactly what that meant. There's a lot of different avenues that that can take, top to bottom, from algorithms to implementation to instruction set, down to the actual hardware and how the hardware architecture is actually implemented. And I thought, well, first of all, we should focus on things that are unique to Intel versus what's to Mac. So that would rule out things like algorithms. Most of those are going to be common across both architectures.
And I thought that, well, I could talk about the hardware and the architecture, but there's another talk that actually talks about the architecture. Plus, I don't think that you actually get the bang for the buck if you look that low level. The first part is you have to look at algorithms, then look at implementation, then you get down to the hardware and the actual architecture and tuning for the architecture. So I wanted to pick someplace in between. And one of the unique places where you can get one of the biggest bangs for the buck is actually looking at your AltaVec implementation and converting that to SSE.
So that's really what we're going to focus on today, is really demonstrating the steps to optimizing something that's already been well thought out of in terms of the algorithm, has already been implemented in AlteVec, and really looking at the steps of what do we need to do to get to SSC quickly and gain the performance from the SSC implementation.
I'll highlight, hopefully mostly, similarities, because if you look at the instruction set, there's a lot of overlap. It's almost straightforward, one-for-one implementation, with a few differences. And then, you know, GCC has been optimized for the PowerPC for a very long time. Apple's done a great job for tuning for PowerPC.
And it's only been a year, year and a half, that we've actually looked at the Intel processor. And so that's where the low-level architecture comes in. And the experience from the Intel compiler group at Intel in terms of optimizing for our architecture. So we'll look at, you know, what are the benefits of, in this particular case, of using the Intel compiler on this particular application.
So we're going to introduce Noble Apes. If you were here at WWDC last year, I know that it was a topic of using Shark and how to do optimization. In that presentation, they started out with a scalar implementation, converted that to AlteVac, got significant speedups, and then actually focused on optimizing that. So I thought that would be a great place to start because the application's already been optimized for PowerPC. It's already gone through the implementation to AlteVac. It's been optimized for AlteVac.
We'll take that as a starting base. And now we get into something that's more unique to Intel, which would be the SSE or the Streaming SIMD instruction or extension instructions. So we'll translate that for Noble Apes. We'll look at the performance, and then we'll apply the Intel compiler and then close with questions at the end.
So what is Noble Apes? Noble Apes is a very simple application that simulates a bunch of apes running around an island. There's one particular ape that has a view of his terrain. That happens to be the one that's actually with the red dot. And when I get into the demonstration, I'll point it out a little bit more.
And then the AlteVec part is actually focused on the brain of the ape as it's thinking, as it goes into its sleep cycle, as it wakes up, as it eats, as it wanders around during the day. And then we measure how many thoughts or how many calculations we're doing per second for that ape to actually think about what it's doing. And this is an application that ships with the Chud tools in order to be able to use with Shark and follow along with the tutorial that goes with that. It's also multithreaded.
So that was another thing that I didn't really feel like we needed to talk about in terms of optimization. Even though Intel has multi-core, Apple Mac has had had multi-core for a long time as well, and so that's not something that we were going to necessarily cover. But it's great for this particular application. So let me go ahead and switch to Noble Apes just to give you a little preview of what it looks like.
OK. So see if I can zoom in on this a little bit, if that helps. OK. So up here in this corner, you can see that the-- oops, my mouse is moving as I move along-- that the daytime, nighttime is changing. The apes are wandering around the screen.
This is the particular ape that we're actually looking at that happens to be wandering around the screen. This is the terrain view that that particular ape is looking at. And these are the thoughts that the ape, or actually what the brain is looking like as it's wandering around this particular island. And this is what we're looking at. So we have op thoughts per second.
Right now, we're at about 1,300 or so for the scalar implementation. There was also a vectorized implementation that got a significant improvement. We're up to-- over 3,000 op thoughts. There's a vectorized optimized version, which gets even more. We're up around 48, 49, 5,000 timeframe. And then it's also threaded.
So I thought this was a good application to start with. It kind of covered a lot of the ideas for general optimization techniques. It had multi-threading in it, was already AlteVec, and then I would take a look at it and see what we could do from an Intel perspective for optimizing this particular application.
[Transcript missing]
So a real quick brief overview of what SSC means. It's the streaming SIMD instructions or extensions for the Intel processor. We started out with the multimedia extensions, which was focused on 64-bit versions for integers. We added SSC, which was the first incantation of the SSC instructions, which was focused on single precision floating point instruction set.
SSC2 went back and added the 64-bit integer versions that were similar to the MMX version but were actually implemented in the 128-bit registers. And then SSC3 added some additional complex math and some horizontal add/subtract instructions, a very much smaller delta. All three of these instruction sets are supported on the Intel Mac-based processors today.
So as I was going through the porting, I thought, well, maybe it's good to put together a little bit of a recipe of what I did in terms of actually porting the application. The first one was I wanted to make sure that we supported universal binaries. We've already done all the work, or Apple has already done all the work, to implement the AlteVec version. There's still a lot of platforms out there that are going to be PowerPC-based. So we definitely don't want to take that away from them. So we want to make sure that we support universal binaries.
Then we really need to look at the data structures and the data and modify the types that are being used between AlteVec and SSC. Then it was fairly straightforward for, I would say, almost 90% of the instructions to go through and just implement or transcribe the one-to-one correspondence between the AlteVec instruction to the SSC instruction.
There's a few instructions that we don't support. We don't support on a one-to-one basis, but with a combination of two, three, or four instructions, you can achieve the same thing. It's pretty much of a recipe type of implementation. There's really not a great alternative unless you can modify your algorithm to avoid it. So there are some just very standard ones that you just go in and replace with the single one.
And then I would say there is nonstandard combinations, things that use instructions like permute, some of the multiplies. There are some just very standard ones that you just go in and replace with the single one. And then I would say there is nonstandard combinations, things that use instructions like permute, some of the multiplies. So there's a little bit more harder part to it.
Not all of the data type variants are supported, so you start having to think about what is your precision requirements? Why did you do the algorithm the way that you did? So there's a little bit more thinking to it. And then actually looking to see how you can optimize the implementation to take advantage of the SSC instructions. So the AlteVec implementation was done.
They looked at the set of best instructions to use. And the same is true for SSC. You can look at the full set and see if there's an alternative approach that you might be able to take better advantage on SSC that you might not have done with AlteVec.
And then we'll apply the Intel compiler and see what the speedup is. So the first step was really just to go into the project and flip the switch from PowerPC on the architecture type to I386. Try compiling it. You'll get a lot of errors. Wherever there's an alt-devec instruction, you're certainly going to get an error.
And go in there and actually weed out that part or partition that part into conditional compilation by using the standard defines that the compiler supports. So on alt-devec or on the PowerPC side, the alt-defec define is automatically enabled by the compiler. On Intel compiler and GCC running on I386, SSC and SSC2 are automatically defined. SS3 is turned off by default, but you can easily turn it on. All of the Intel-based Macs support SS3. From the laptop on up. And so let's go ahead and just kind of show what we actually did there. Let me take this back out of zoom mode.
I have a lot of code that's in here, but we'll go up to the first vectorized implementation. start at the beginning. Now one of the things that I wanted to do as part of partitioning was to make sure that I kept as much common code common between the two implementations.
There's no sense just partitioning the full function, because there's a lot of C code, setup code at the beginning, prolog, epilog at the beginning that was common that we didn't need to comment out, and if you made changes to one, then you'd have to remember to make changes to the other. Here's the beginning of the function for the scalar implementation. Let's go ahead and search through for... The vector. Okay. And so here's the vector implementation. We have some initial declarations, okay, that we kept exactly the same.
There's one instruction here that we had to modify for initializing the vector. You'll notice that I'm actually using the data type declared here using the Accelerate Framework data type. And so I decided to just partition out these two into the implementation for initializing the variable. But for the rest of this function, a lot of this setup code at the beginning is just straight standard C code.
It's all within the function itself. Didn't have to do anything different until we got to the alt-vect section. In the original code, they had commented out the scalar implementation, so you could see what that scalar loop would look like. But then immediately, we start getting into these vect unsigned char declarations.
A bunch of more declarations and initialization. So I just went to the beginning of where I saw all the errors to begin with for this routine and put an ifdef around the full block of code. Cut and pasted that so I would have a reference to go through. Let me scan down here.
So this is all the AlteVec code. We get down there some debug. And now we get into SSC. So I ended up just cutting and pasting the original code. So that I could use that as the template to start my Alphabet implementation. So if we go back to the slides.
So the next step after having done that was to really start looking at the data types and modifying them to fit the Intel architecture. There's three basic intrinsic data types to SSC. That's 128-bit, which is an integer-based. It doesn't matter whether it's a vector of char, shorts, ints, longs. It's all the same. It's just a vector of integers, and that's the way the instructions look at them.
There is a M128, which was the original SSC data type, which was a vector of four floats, and then an M128d. So it's underscore, underscore, M128d for a vector of doubles. Now, you could stick with those data types and actually just replace the vector unsigned int or unsigned char declarations that were in the AltaVec.
But one of the problems that you have if you use those particular data types is that you can lose track of exactly what your variable type is and what you're actually operating on. So one of the recommendations is to actually use the AlteVec--or the Accelerate framework and their vector types, because they've already been type-deafed for the Intel SSE data types. And it also helps support the readability of the code.
One of the other things that you'll run into is the fact that the SSE instructions are C-style. That means that no matter what variables you give, as long as they're M128s, the functions--you have to specify exactly what the function is going to operate on within that vector. For example, here, this is a compare-equal extended packed integer of 8 bits.
So it's no longer a C++ style where you could just give it the data type, the arguments will be over--or the functions will be overloaded, take those--the arguments to the functions and know exactly which function to actually implement. So that was the next step, was really just to go through and modify the data types.
And so here you can see we have vector-unsigned chars. We just went and basically declared this table. In this particular case, it was a table of M128Is. It was a lookup table. It didn't seem like it was necessary to specify exactly what their data type. This really wasn't being used in any of the particular calculations.
But then I have a special initializer that allows me to set all of the vector types, all of the values within the vector to the same value using a set1, extended, packed, integer 8 instruction. And that's the equivalent to the initialization that we see with the vector-unsigned char.
Did that for a couple of lookup tables. Here, there were a couple of constants that were defined, which, because you would have to, in all of the functions, actually cast them to M28s if I defined them differently, I decided to also keep these as M128s as well, just because I'm not writing to them. They're going to be constant throughout, and I knew exactly how they were going to be operated on. There was no sense actually casting them.
There were a couple of constants that actually get generated on the fly in the AlteVec, but in this case I used constants and declared them up front so that we would load them from cache when we needed them. It was cheaper to load them from cache than to actually try to generate the data on the fly.
Keep going down through the routine. And now we start getting into the inner loop. Okay, and this is where I just took exactly what the AltaVec instruction said. If it said vector signed short, I replaced it with the Accelerate Framework data type vector signed in 16. And in fact, if you were actually implementing this between a PowerPC with AlteVec and the Intel architecture, you could actually move these declarations out such that you would have one set of declarations for both architectures.
You wouldn't need to separate them out in the if-def between the two, AlteVec versus SSC. Make the code a little bit shorter. Puts all that stuff in a common area, and you're not rewriting the code. or if you make a change in one, you don't have to modify it in the other.
And so that was, you know, for all of the declarations, was very straightforward to actually implement. There was just a few cases where I decided to go ahead and use the M128 because of what it appeared to be more applicable for. In this particular case, this happened to be differences between the brain versus the old brain settings. And so it was actually a bit--a string of bits that you actually compared against, and so I kept it as an M128i.
Let's go ahead and switch back. So after we had the data types all declared and transferred and got rid of all of those warnings, then I looked at all of the instructions that were in the AlteVec and did the transcribe from one instruction to the other one. It's just a one-for-one. There's a significant overlap. You'll find that the modulo, integer, add, subtracts are fairly straightforward.
Pretty much all of the compare instructions, the packs, unpacks, integer, floating point conversions are all there. Floating point operations, pretty straightforward. There's a little bit less limited in terms of the overlap in the min/max instructions. So if you look at it, we're not quite as orthogonal in terms of the data types that we support.
But in any case, all of those were pretty much a one-to-one translation. Again, the SSE intrinsic instructions have to be data type aware, so you actually have to understand essentially what the data type that you're operating on to the end of the instruction. So let's go ahead and look at a couple of those instructions as well. We switch back.
Okay, so some of the first instructions that we actually run into are these Vector Merge High and Vector Merge Lows. Again, we do the exact same thing, only now instead of using the Merge High and Merge Low, we're using the Unpack High and Unpack Low. I'm just casting the value here because if I didn't, I would get a bunch of warnings about vector signed int 16 values not knowing how to convert that to an M128.
So I just go ahead and do it so that if you actually--you know, in my code, I like to get rid of all the warnings that I possibly can so that if this is something that I know that I need to do, I just go ahead and do the casting. So we had a whole series of unpacks. It was very straightforward. We came down to the vector adds. Okay. Again, a series of two adds together. Exactly the same thing. We have the two adds together. One-to-one correspondence.
Again. So it's a very straightforward translation up to this point. Even these particular unpacks And these subtracts, still again very straightforward, up until we get to the vec-multi or the multiply-even and multiply-odd. So that's where we ran into the first difference that was not really a one-to-one translation between the two.
And so that's the first one where we hit where it's not really a non-standard one. But the standard translations, we also hit a little bit further down in the code, which I'll show you. And one of them is the vector select, which is essentially the implementation is here in AltaVec. It's just a straight vector select, A, B, and C, for the C, B, and the mask, on which you want to select out of the two registers, and you're just picking the bits between each one.
The implementation here for SSE is just doing an and not and an and, and then or-ing those values together. A great resource for migrating AltaVec to SSE is actually in Apple's own documentation on this translation. They have a lot of these very standard types of translations between AltaVec and SSE.
So here's basically a graphical display. You see that the C variable has some 1s and 0s. Those are just the selector values. In this particular case, we just represent them as 0, 1, but they would actually be 8-bit values or 16-bit values, depending on the input. And here we have the same implementation in SSE. We're doing the AND, the AND NOT, and then OR-ing the two to get basically the same result that you get from the SELECT. So let me go ahead and just show you that instruction real quick. Down here.
Okay, so here's the VEX Select instruction, taking the three arguments. Again, we have our AND NOT with our AND instruction. And then we're ORing them together. So that's just a, you know, straight recipe formula implementation for that particular instruction. But now I'm going to go back to the non-standard implementation, which is using the mole-high and the mole-even and the mole-odd. Which is right here. Okay. So let's go back to the slides again.
So this is really the hardest part of the non-standard part of translating AltaVec to SSE. There are some integer multiplications that have limited support for the data type. So you have to look at what your precisions are. You may have to scale to support the precision that you're looking for.
The PERMUTE instruction is a very significant instruction that's supported on AltaVec that is not currently supported on Intel processors. And you really have to implement a series of packs and unpacks and shuffles in order to be able to get basically the same equivalent implementation to the PERMUTE instruction.
Then there's some also miscellaneous instructions that are non-standard between the two that make it a little bit tougher. And that's the hard part where you have to actually think about the algorithm that you might implement. And then there's also some unique instructions that you might take advantage of for Intel that you might not actually have on the AltaVec that'll get you some improvement as well. So here's the MUL-E and the MUL-ODD, or even and odd.
And essentially, they multiply the two vectors. In the case of the even, you take the even vector input and you come up in this case it's a 16-bit implementation you end up with a multiplication of the two 16-bit values for a 32-bit value and you get them in an all odds or all even vector and this is very different from the way the intel style works in that we tend to always do everything in order so you get all of the vectors zero through seven if you happen to be doing this particular operation instead of the even and odds but we take it in the upper half and lower half So here's the MOL low. And this is a case where we actually do the same multiplication, but we take the lower 16 bits.
And we put that in a vector. And if we do that to the high, then we get the upper 16 bits in the high. If we unpack those together, then we get the 32-bit values. So we do an unpack high and an unpack low, and we can get the 32-bit values.
But they're still in the same vector order that they originally were in the original vectors. So we're still going from, in this case, 0 to 4, or 0 to 3, and then from 4 to 7 in terms of the vector, which is very different than the MOL even and odd. So let's go ahead and switch back to the code.
So here is a case where we implemented the mole high and mole low, and the high unpack and the low unpack, and this basically gives us our 32-bit values. If we wanted to model the AltaVec exactly, one-for-one instruction, we would have to do a further unpack of these instructions in order to get all the evens together and all the odds together. But when we actually looked at the algorithm, it turned out that further down in the algorithm, after they did a bunch of calculations, they ended up doing that same packing and unpacking in the result.
So right here, they ended up doing the merge at the end between the high and the low to get them back into vector order from the original straight linear vector sequence. So we decided that instead of trying to mirror the odd and even, we would go ahead and just leave them in our particular format and wait to the end. And then if we still needed to do that, we would do that at the end, and it turned out that we actually could avoid a couple of instructions by not having to do that.
[Transcript missing]
So that was pretty much the full implementation. It actually vectorized very well following the standard implementation that we had for AlteVec. But there were some additional things that we could do. One was we didn't have to do the additional unpacking on the unpacks in order to get it into the even and odd order, then to further have to go back and do packs to get it back into the normal sequence. And so we could take advantage of that SSC feature and avoid having to do that extra packing and unpacking.
We also looked at re-evaluating the precision requirements. A lot of this stuff is done in 16-bit, goes back to 32-bit. It's possible that maybe in this particular case the precision isn't quite necessary. You could actually keep it all in 16-bit. But to be compatible with the implementation as it was, and to be fair in terms of the comparison between the two, we decided to keep it at 32-bit precision.
There are a couple other things that you have to be aware of that we looked at. Actually, in this particular case was looked at memory alignment, watch for cache line splits, because when you're loading data and storing data, you have to be careful of that on the Intel architecture. There was no floating point, but if you were doing floating point, you would want to make sure that you looked at or had caution for denormals. On the Intel architecture, denormals are enabled by default. On PowerPC, they are disabled.
And then we looked at partial register stalls. It turned out there was no partial register stalls in this particular application. And then I already showed you that we actually generated--instead of generating constants on the fly, we actually declared them and initialized them up front. But there was an additional instruction that we could take advantage of that, once we looked at the algorithm, to actually improve the performance of this particular sample app. And that was the packed fused multiply add instruction.
And what that does is basically takes the two 16-bit values, multiplies them to get a 32-bit result, and then adds it to its neighbor. And it turned out that's exactly what this application was doing towards the end. So let me go ahead and show you that particular piece of code. So here, we did all our multiply, even, and odds--in our case, high and lows. But down here, we had this addition. Okay, that happened to take the result of that multiplication.
For the two. And so what we decided to do was instead of going through all of these multiplies and unpacks, we were able to combine that into the multiply-add--the fuse multiply-add instruction. We were able to get rid of the unpacks. Bring it down here to the implementation. We're actually in the optimized version now.
So here's the equivalent that we were able to do instead. Okay, so we did a couple of unpacks, but now we did the multiply and the add, and we got the 32-bit value that we wanted, plus we were able to complete the addition that we needed to do in the instruction further down.
So we replaced all of those mole-highs and mole-lows with the multiply-add on all four implementations here. And then down here at the bottom, we no longer had to do the addition. All we're doing is the subtraction that was actually implemented in the original code, which is up here. So originally, we had this shift-right arithmetic, the subtraction, and the add. Now we just had the shift and the subtraction. We've already done the addition in the multiplication part, in the multiply-add instruction. And it gave it to us in 32-bit values.
Once we made that optimization, then we were doing pretty well in terms of the performance. Let me go ahead and switch back. And so I'm going to bring up Justin Landon, my colleague, who will go through the Intel compiler. He'll talk about the actual performance that we got for this implementation of Noble 8s.
Thank you, Phil. Now that Phil's done all the hard work of translating this application over to use SSE, let's take a look at what the performance of the application is. This is Noble Ape, and this was built with a GCC compiler that ships with Xcode 2.3. And as you can see, we have three different computation modes. We have a Scalar, a Vector, and a Vector Optimize.
What we're comparing here is G5 iMac versus the 2.0 GHz Intel Core Duo iMac. And we basically have the three-- Scalar, Vector, and Vector Optimize. And these are the ape thoughts per second that we're seeing on each platform. As you can see, we're doing quite well on the Scalar.
We're doing pretty good with the Vector, which is just basically a Vector-optimized-- or not an optimized version, but just a scale-- a Vector implementation of the Scalar version. And then we have the Vector-optimized version, which, you know, we take math tricks and try to optimize as much as possible.
As you can see, we're not doing so great on the last case there. But fear not. Before we take a look at the algorithm and try to change things around, we'll try to use the Intel compiler to see if that can improve performance for us. So if we can switch over to the demo machine. machine.
So what I'm going to show here is the ease in which you can use the Intel compiler to recompile your application and try to obtain additional performance gains. This is as simple as going to the target. Going to the Rules section here. And under the System C rule, we can change the compiler. We can just change it to the Intel compiler.
It will ask me to make a copy, and I can change it again here. And now we're all set there. The other thing that we need to take a look at is we want to make sure that we evoke certain optimization settings. So I'm going to go ahead and change those. And you can see right now, now that we've had the Intel compiler selected, we have the ability to set some switches within the Intel compiler.
The biggest ones that we want to take a look at here is we want to make sure the optimization level is set to O3. We also want to make sure that we have inline function expansion turned on, and OB2 is the highest setting for that. And we also want to have the IP flag, and this is a checkbox right here, and this allows function expansion within single files. And that's all we're going to do. So I'm going to go ahead and close that. And I'm going to clean and build.
So we're going to go ahead and let that build for a second. And as you can see on the bottom, you can tell that the Intel compiler is being used because it says "compile ICC." So you'll know that the compiler is actually being invoked. Now, if you're doing a universal build, what will happen is the Intel part of it will build the I native code, and then it will use GCC to build the PowerPC code.
So I'm going to go ahead and switch the Vector Optimize mode here. And as you can see, we've gotten a substantial speed-up just from doing that. This is on the MacBook Pro laptop. So if we switch back to the slide set, This is now what we're seeing as far as GCC versus ICC performance.
So as you can see in the vector-optimized case, we went somewhere around 8,800 apethots per second, and we've increased that to almost 12,000 apethots per second. So that's given us a 34% performance gain just by recompilation alone. So there was no extra engineering requirements or time investment needed in order to get us that extra performance gain. We've also gotten a substantial increase on the scalar and vector portions as well. So this is a pretty good case for using the Intel compiler. It's easy to use, and you can get substantial performance gains.
I'm going to go back to this slide so we can take a look at how we're going to look at the G5 comparison. So now if we're comparing against the previous iMac model using the G5, we're about 19% better on the vector-optimized case. So this is indeed showing that we can translate the code over to use our SC instruction set, and at the same time get better performing code. I'm going to go back to the demo and show you one more optimization switch that can easily be thrown.
So if we go back to the target here and make sure that we have the Intel compiler selected, If we scroll all the way to the bottom--looks like it's already been set--this additional flag that you can set is IPO. And this is basically set in the additional C options flag.
And what this does is it allows inline expansion within multiple files. So this will give you increased performance. So since I already built it with that, there won't be really, really need, but we'll do a clean anyways. While it's building, we'll go back to the slide deck here.
So now that I've thrown the IPO switch, this is what we were seeing on the 2 GHz iMac. So as you can see, I have the three columns here, the first one being GCC. The second column there is what we just compiled with the O3 and OB2 switch and the IP switch. And then the last one is where we have the O3, OB2, and IPO. And as you can see, we've gained an additional 6% just from throwing that switch. So that really is a really simple way to get additional performance within your application with minimal time investment.
Here again, we're comparing against the previous iMac model using the G5. And as you can see, now we're 24% better. So this is a clear indication that, you know, with a little bit of time and using the Intel tools, you can really get good performance gains out of your applications.
So, in summary, you basically want to use care in translating. You don't want to really translate verbatim. Going one-to-one will work sometimes, but you want to take some thought and care into how you're exactly translating your application over to use the Intel instruction sets. You may want to re-architect the data. You want to try to avoid using the permute since we don't have a direct analogous instruction that uses that.
And sometimes just reorienting the data to use our instruction set can save you some overhead of data shuffling. We highly recommend using the ADC Reference Guide that's available online on the Apple website. That's very helpful, and they have some good tips and tricks in there to help you translate your code. And utilize the Intel compiler. It's really easy to use. You let the compiler do the work, and there's no time investment on your part.