2009 • 47:36
OpenGL ES provides access to the stunning graphics power of iPhone and iPod touch. Learn how your application can create incredible visuals while maintaining high frame rates using the programmable pipeline enabled by OpenGL ES 2.0. Get specific recommendations for maximizing OpenGL ES performance and understand the best practices to keep your application on the fast path. This two-part session is essential for developers utilizing interactive graphics for games and other mobile 3D applications.
Speaker: Allan Schaffer
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Hello, I'm Allan Schaffe, the graphics and media technology evangelist at Apple. This presentation is Part 2 of Mastering OpenGL ES for iPhone. In Part 1, focused on the environment around OpenGL ES, covering versions, the display system, the rendering loop, and the processor architecture. And Part 2 goes deep into mastering OpenGL ES itself. I explain the architecture and the capabilities of the graphics processors, and then dive deep into specific performance recommendations for optimizing texture and geometry. So, welcome back. Let's continue.
So I begin this presentation with a deep look into the capabilities of the graphics processors and the specific differences between the PowerVR MBX Lite and the PowerVR SGX. And as I do this, you should keep in mind this chart from Part 1, outlining the platforms where each of those GPUs are supported. The PowerVR SGX is used in the iPhone 3GS and the third generation iPod touch, and the PowerVR MBX Lite is used in the iPhone, the iPhone 3G, and the first and second generation iPod touch. So, let's start with the details of the MBX Lite.
Now, as I said in Part 1, the MBX Lite supports the fixed function rendering pipeline of OpenGL ES 1.1. And its architecture is what's known as a tile-based deferred renderer. And I'll describe this in just a moment. Now, the same devices that support the MBX Lite are also very sensitive to memory bandwidth. And you'll notice that a lot of the optimization strategies that I talk about later in the presentation have to do with reducing the pressure on memory bandwidth for this reason.
There's a number of implementation limits on the MBX Lite, but one that deserves special mention is the 24 megabyte limit on textures and render buffers. This isn't an on-board limit per se, but rather is reflecting the fact that the MBX Lite can only address a contiguous 24 megabytes region in system RAM. So we always load the textures and surfaces into that region. And again, here's another reminder about the platforms.
Now, in addition to supporting the core functionality of OpenGL ES 1.1, the MBX Lite also supports a number of extensions, and I've listed them here. And you can see there's a number of multi-vendor extensions with the prefix EXT. There's a few extensions from Imagination Technologies, those are the ones with the IMG prefix, and a large number of standard extensions with the OES prefix.
Here I've highlighted a few particular extensions of interest. The extension for PVRTC, texture compression, is very important and a great way to reduce memory bandwidth usage, since the texture actually remains compressed even after it's loaded into system memory. I'll talk a lot more about PVRTC when we get to the texture section.
BGRA 8888 happens to be a native 32-bit texture format for the device. This avoids a byte swizzle, so it's a bit faster to load. And framebuffer objects. As I mentioned in Part 1, framebuffer objects are absolutely fundamental to how we do all rendering on the iPhone. All of your rendering gets directed to the color render buffer attachment of a framebuffer object, and this is picked up by your CA Eagle layer and the rest of the display system.
Or you can have a texture attachment to your framebuffer object and perform a render to texture, and then use that texture in another pass that's bound to the color render buffer. But in any case, anytime you're using these extensions or the ones coming up in the next few slides, you should code defensively and always query for the presence of the extension and provide some fallback or error case in case it's not present.
Next, here are some of the standard implementation limits for the MBX Lite. The maximum texture size is 1K by 1K, and as I'd mentioned, there's a 24MB limit for textures and surfaces. On the MBX, your texture dimensions must be a power of 2 for both width and height, and you have two texture units available for multi-texturing. Then, getting into the more detailed limits, your point size can be from 1 to 64 pixels, your line width can be up to 64 pixels, and your maximum LOD bias.
This is if you're generating texture mipmaps and want to artificially bias which mipmap level is chosen, your maximum bias value is 2. And the matrix palette extension supports up to 9 palette indices and 3 vertex units. And finally, there's one user clip plane. So, that's the MBX Lite, the general capabilities, the extensions available, and the limits that are enforced. Now, let's contrast that with the PowerVR SGX.
So the SGX is also a tile-based deferred renderer. And as I said in Part 1, it supports both the fixed function pipeline of OpenGL ES 1.1 and the programmable shading pipeline of OpenGL ES 2.0. The 2.0 pipeline is what's supported natively by the hardware. And actually, we express the 1.1 pipeline in hardware through some specially written shaders.
Allan Schaffer Now, the devices with the PowerVR SGX have higher memory bandwidth, so they'll be less sensitive to some of the memory bandwidth concerns that I'll be discussing later. And also, these devices have no particular limit on texture and surface memory. But the next three capabilities are very important. First is vertex buffer object acceleration in hardware. Now, both the MBX and SGX support vertex buffer objects. But on the SGX, you should see a significant difference. And there's a significant performance benefit from using them due to the hardware support.
There's also hardware support for stencil with an 8-bit stencil buffer. And textures on the SGX can have any arbitrary dimensions up to the maximum texture size. They don't have to be a power of two. And just remember, the SGX is the graphics processor used in the iPhone 3GS and the third-generation iPod Touch.
Now, when your OpenGL ES 1.1 application happens to be running on a device with ESGX, a number of new extensions appear. I've highlighted them here. There are several new extensions dealing with blending and the blend equation. You now have the ability to bind MIPMAP levels to a framebuffer object and do a render to texture directly to a MIPMAP. There's support for the stencil buffer, either separate or interleaved with the depth buffer. And our extension for limited non-power-of-two texture dimensions.
So, let me emphasize, these are new extensions or added features that your existing OpenGL ES 1.1 application can take advantage of. Anytime one of your users happens to be using an iPhone 3GS or third generation iPod touch. If your application could take advantage of non-Power of 2 textures, for example, they're just there now and you can just make use of them. Just make sure that you always query for the presence of any extensions you're using and conditionally handle the case where you're running on a device with the SGX.
Now here are the implementation limits you'll see when running an OpenGL ES 1.1 application on the SGX. The maximum texture size is now up to 2K by 2K, and that 24 megabyte limit for textures and surfaces is gone. There's no set limit for textures and surfaces. Your texture dimensions can be any arbitrary value up to the maximum size, so either non-power of 2 or power of 2.
And now you have eight texture units available for multi-texturing. Then again, the more detailed limits. Your point size can now go up to 511 pixels. Your line width is now up to 16 pixels. Your maximum LOD bias is 4 instead of 2. And the matrix palette extension supports up to 11 palette indices and 4 vertex units. And you have up to 6 user clip planes. Okay, so that's the SGX with an application that's created an OpenGL ES 1.1 context. Next, let's take a look at what happens with a 2.0 context.
First, here's the extensions. Now, these are, of course, extensions on top of the OpenGL ES 2.0 baseline specification. And there is a tremendous amount of new functionality in the core of OpenGL ES 2.0. Things like programmable shaders or frame buffer objects are now part of the core. And this is a list of extensions that are provided in addition to that. So most of these we've already seen. Render to MIP maps, pack depth and stencil, PVRTC, and so on. But the one that's new is standard derivatives. This is a shader extension. It provides DXDY and FWITH functions in your shaders.
And here are the limits that apply when you're using a 2.0 context and programmable shaders. As before, your maximum texture size is 2K by 2K, and you can sample from a maximum of eight textures in your fragment shaders. You can have up to 16 vertex attributes, eight varyings, 128 uniforms in vertex shaders, and 64 uniforms in fragment shaders. And then point size and line width ranges are the same as before.
Next thing I'd like to do is to briefly describe the tile-based deferred renderer architecture. Both the MBX and SGX have this architecture, and it's different than what you may be accustomed to if you came to OpenGL ES from the desktop, where you typically have a stream renderer. The basic way a tile-based renderer works is to spatially divide your screen into a number of tiles, and as OpenGL commands arrive, they'll be distributed to the appropriate tile on screen. Each tile will be building up and buffering a list of commands to be drawn within itself.
And then when we're done at the end of the frame, each tile will perform what amounts to a hidden surface removal pass over the commands in its buffer. And it will only rasterize the geometry that's actually visible. So you get a huge benefit. This architecture just tremendously reduces the depth complexity in your scene in a typical 3D rendering. And reducing the depth complexity lightens the fragment processing burden quite a lot, which would otherwise be a significant bottleneck.
But of course, this is a different architecture, different than what's been broadly used in desktop hardware. So many of the desktop assumptions about what's fast or slow may have changed in some instances. In particular, on a tile-based architecture, it's relatively expensive to change your scissoring, viewport, or dither state mid-frame, as those can lead to changes in the tile layout that would have to be managed. Likewise, binding a different frame buffer attachment mid-frame, or calling GL read pixels, introduces a synchronization point between the CPU and the graphics processor, and this lessens the very deep pipelining that would be achievable otherwise.
And finally, calls like GL Text Subimage to subload a texture or update a portion of a texture. These can have unintended consequences in a deferred renderer. We may actually need to keep a copy of the texture as it existed before the change and also make a second copy with your changes, since we won't actually start rasterizing those pixels and sampling from the textures until the entire frame has been submitted.
So if you're doing any of these things, they actually need to be very carefully scheduled as to where in the timeline of submitting commands for a new frame you'll put these calls. And I'll show you how to do this in just a moment, but there's one other topic first.
I just want to make one more point about the hidden surface removal calculation. The thing to understand is that on a per pixel basis, that calculation is only effective for as long as we're drawing something opaque. As soon as we draw a non-opaque object, an alpha blended object, that invalidates the pixel and everything that follows for that pixel isn't going to be able to take advantage of the hidden surface removal calculations.
So that means you should draw all of your opaque objects first so that all of them can take advantage of HIN surface removal. And you should draw all your alpha blended objects last, sorting those from back to front. Now, there's actually one other category to draw in between these other two groups, and that's any geometry that's drawn along with a shader that's using the discard command to throw away fragments conditionally. So when it does come time in your frame for submitting geometry, you should draw all your opaque objects first. Sorted by state, of course. And then any objects using discard, also sorted by state. And finally, any alpha blended objects sorted back to front.
And this can sound counterintuitive to beginners. You would think that if you were, say, drawing a bunch of cars, you'd prefer to draw all of the first car, and then all the next car, all the third car, and so on. But that's not optimal. Once you've calculated which cars are actually on screen, what's optimal is to draw all the opaque parts of all the visible cars first, sorted by state.
So the wheels of all the cars, the bodies of all the cars, and so on. And then to draw all the alpha blended parts last, sorted back to front. So the windows of the car that's furthest away from the eye point, and the windows of the next closest car, and the next closest, and so on.
So in terms of the frame architecture that we'd suggest, it would look something like this. Imagine if you were doing all the things I just mentioned, the good ones and the ones to be careful about. This will show you where in the sequence of commands you should put those calls. If you're calling GL text subimage to update a subregion of a texture, then the best place to call that would be before you've submitted any commands actually using that texture. This allows us to avoid making a copy since you actually haven't used that texture yet.
If you're binding a framebuffer object, then this is a sync point, and you generally want to put any sync points at the very beginning or very end of your frame, since those are natural sync points anyways, and it will minimize any impact on the command pipelining. You can then draw all your objects as I've just explained, opaque first, then discard, then alpha blended last. And at that point, you'll have presumably made the last drawing command in this frame.
So now we're at the very end of the frame, and only then it might be okay to issue a command like GL read pixels, since this will introduce a sync point and cause all the tiles to execute their command buffers. But so would have present render buffers, so in this case, it's okay.
Now, one more platform to mention is the iPhone Simulator. In terms of its feature set, it's very similar to an iPhone 3GS. It supports both OpenGL ES 1.1 and 2.0-based applications. But it's also very different because it's running on your Mac. It does not have a tile-based deferred renderer architecture. It does not enforce the memory limitations you find on an actual device.
It does not actually try to achieve pixel-for-pixel parity with what you see on the device. And obviously, it doesn't fit in your pocket. But it's really a great tool for debugging, and not just in the obvious way that you might expect. Rather, it's wonderful for comparative debugging and trying to isolate the source of a problem.
So imagine you have a texture that's always showing up white on the device when it should contain an image. Well, you can try your app in the simulator and compare the results. If that texture still shows up white in the simulator, then you know that the problem is probably in your code and related to the way that you're using OpenGL ES.
But if the result is different, even this is interesting. It's not a case where you should jump to conclusions and assume you found a bug in OpenGL. Instead, this can be frequently a telltale sign that you've tried to set a mode that isn't supported or exceeded a limit of the implementation for that specific device. So using the simulator for these sorts of clues can really help you. We all make these kinds of mistakes, but the simulator can be really helpful in finding them.
Now, in terms of the specific capabilities, the simulator supports the complete OpenGL ES 1.1 implementation. And as for the extensions, the only ones not supported that was on the list for SGX is the Pack Depth Stencil extension. And then likewise, it also provides a full implementation of the OpenGL ES 2.0 pipeline, but emits two of the extensions we saw on the SGX, packed up stencil again and standard derivatives. So to summarize, again, to master OpenGL ES, you really need to code your app defensively.
Rather than reading the device name and making assumptions based on that, you should check the list of extensions directly for any that you need to use. And query the limits of the device that you happen to be running on. And then have code in your app that takes advantage of the expanded capabilities of the SGX when it finds them and provides a fallback when it doesn't for the MBX.
Now, we're going to change gears and go very deep into optimizing texture. And there's six topics to cover. The recommended formats, the details about non-power-of-two textures, and I'll talk about compression, atlases, mipmaps, and then have a number of specific best practices that you'll need to adopt. So, I'll start with texture formats, and this is pretty simple. There's a variety of texture formats supported across both the MBX and SGX, and they're listed here on the right. There's a few 32-bit formats, several 16-bit formats, a couple 8-bit formats, and then the compressed formats for PBRTC.
Now, I should clarify that when we're discussing formats, we're talking about the actual formatting of the image in system memory and not image file formats like TIFF, JPEG, or PNG. All those file formats end up with the image itself just being expanded in memory out to its full native depth and resolution.
We really recommend using PVRTC if you can. So again, particularly systems with the MBX Lite are more sensitive to memory bandwidth pressure, and PVRTC is the only format that's able to remain compressed even when it's in memory. The graphics processor is able to decode it directly from its compressed format. And I'll cover this a lot more in just a moment.
Then all the way on the flip side, as I'd mentioned earlier, BGRA 8888 is the native format for the device, enabling you to avoid a particular swizzle as we set up that texture. But really, all these formats are native, with the exception of RGB 8888. That format, you might just think it could have a 24-bit packing, but really it just gets padded out to 32 bits per pixel by the implementation.
Now, remember in the previous section, I'd said that the MBX only supports textures whose dimensions are a power of two, and the SGX supports both power of two and non-power of two size textures. So, there's several important things to know about both of these. Power of two textures have some additional flexibility. They're able to support MIMMAP generation and sample from the MIMMAP stack as they're being rasterized, and they're more flexible in terms of their ramp modes. They support repeat as well as clamp to edge.
But, of course, the dimensions are fixed to values that are a power of two. So, these are sizes like 512 by 512 or 64 by 128 and so on. And none of these sizes are perfectly matched to the size of the screen on the iPhone. So, there's always either going to be some scaling, or if you size the textures to maintain a one-to-one mapping of textiles with pixels, then there will always be some region of the texture that's not being used. By contrast, non-power of two textures, of course, remove this restriction on their dimensions. So, they can be perfectly sized for the screen dimensions at 480 by 320.
And there's a number of advantages to that. Particularly for cases where you want to render to texture, or you just want some source imagery that maintains a one-to-one textile to pixel mapping without the unused region you'd get with, say, 512 by 512. But just bear in mind that non-power of two textures aren't necessarily as versatile. They don't support MIMAPs and they only support clamp to edge.
The third topic is compression with PVRTC. So this is the ability to take a source image, say a 32-bit ping image of a particular size, and perform a lossy compression that reduces it to either 2 or 4 bits per pixel. And that's a huge savings, either 8 to 1 or 16 to 1. And the GPU is able to directly sample from the compressed representation in memory, which means that far fewer bytes are being pushed around in a given frame to draw with a PVRTC texture.
Now there are some requirements. In our implementation, the dimensions of the texture must be square, the size must be a power of two, and the height and width of the base level must be at least eight pixels. We provide a compression tool with the iPhone SDK that will take any of the variety of desktop image formats supported by Image.io on your Mac and compress them into PVRTC images. And here's the path again. On your Mac, it's developer platforms, iPhoneOS.platform/developer-user-bin-te xture-tool.
Now, here's a comparison between an original image and some versions that were compressed with PVRTC. So, on the left is the original uncompressed image. This was a 32-bit ping that was 256 by 256. So, that one takes up 256K of RAM. Then in the middle is the same image compressed to PVRTC with 4 bits per pixel, which totals to 32K of RAM.
And on the right is PVRTC with 2 bits per pixel, that's 16K. And the quality is really pretty good. If we blow up the images and really compare them, here's the original, then here's the PVRTC at 4 bits per pixel, and that's really a close comparison. And I find that the 4-bit PVRTC is great for this kind of image.
And here's the PVRTC with two bits per pixel. So this entire image fits in 16K of RAM, but I see some artifacts when we get down to two bits per pixel. But depending on the usage, even this might be okay for you. So now a bit of guidance for when PVRTC will be most appropriate. It's really well suited for images from our analog world, things like wood and gravel, smoke or skin. But it's not really made for images that come from the digital world, images with really high contrast line art, for example, or fonts and symbols and so on.
And one last thought to leave you with about PVRTC is instead of doing a direct comparison between an original 32-bit image and its PVRTC equivalent, consider doing a comparison between images that fit into a fixed memory budget. A lot of the time, the decision you're making is really going to be between a 16-bit version of the original texture and a 4-bit PVRTC version. So at the sizes that I'm showing here, both of these would require the same amount of RAM, 128K in this case. But quite frequently, the PVRTC image will actually look better.
All right, our next topic for texture optimization is the concept of texture atlases. And what these are intended to solve is the very common mistake of defining a whole bunch of really small individual textures. And then when you go to draw them, you end up doing something like this.
You bind the first little texture and draw the shark. Then bind the second little texture and draw the character. And bind the third texture and draw the moon, and so on. And this is really inefficient because of all these state changes. every texture bind you do is going to chip away at your performance.
So instead, you take all these little textures and combine them together into a big texture mosaic or texture atlas. The concept is really simple but very important because now you can just bind that big texture once. And by setting your texture coordinates correctly, draw the shark, draw the character, and draw the moon without ever changing textures. And over the course of a whole frame, this can be a big savings.
Now even better, to combine this with something I'll be talking about in the geometry section, is to use texture atlases in combination with combined geometry arrays. So now to draw all three of these objects, we just bind the texture once and we submit a single combined array with all the geometry. And this is orders of magnitude more efficient than that first example where we were binding each texture and then just drawing one little quad.
So, we highly recommend using texture atlases. And here's a few things to keep in mind. You'll have to modify the texture coordinates on your geometry. And I hope this is obvious. For example, on the shark image on the left, in terms of S and T, its T coordinates run from 0 to 1. But in the atlas shown on the right, it only takes up the top half of that larger image. So, its T coordinates run from 0.5 to 1.
Also, you need at least a pixel of padding between the elements so that any sampling just off the edge of one part doesn't pick up any colors from a neighboring region. You can't use the repeat wrap mode. You have to use clamp. And that's also so you don't sample off the edge of one region into a neighbor. And since they tend to be larger images, texture atlases are really best suited for static texture.
Here's an example of some texture atlases from the game Pocket God from Bolt Interactive. These are some really great examples. They're both 1K by 1K. The image on the left has more than 300 elements in it, and the one on the right has over 175. So just think about this. What they're able to do is first bind the left-hand image and draw all the objects in their scene that are using any of the textures in this atlas. In one draw call, perhaps.
And then bind the right-hand image and draw all the objects in their scene that are using any of those elements. Again, in one or just a few draw calls. So using this technique can really reduce both the number of state changes and the number of draw calls you make in a given frame. And both of those things can really improve your performance.
Next topic is mipmaps. So this is just the concept of taking an original large texture, call that your base level texture, and then creating a series of successively smaller and smaller copies of it. The first copy, half the size in each dimension of the original. The next copy, half the size in each dimension of that first one. The next one, half of that, and so on. All the way down to where you end up with a texture that's just one pixel by one pixel.
And then when it's time to draw, the graphics processor will choose which of those copies most closely matches essentially the size that your texture is being scaled to when it's presented on screen. And it'll sample the actual textiles from whichever level or whichever copy it chooses. So this is very useful and we recommend generating and using mipmaps anytime you're presenting textures in a 3D scene. The only time you shouldn't bother with this is if you're only presenting unscaled images in 2D.
The benefits are that using mipmaps improves image quality in your scene. And actually, since you're potentially sampling fewer actual texels when you're using one of the smaller levels, it can help you with memory bandwidth. And that's potentially a nice win, especially on the MBX. There's two filtering modes, either linear mipmap linear or linear mipmap nearest.
And these have either a quality or performance tradeoff. And of course, there's the tradeoff of those extra smaller copies of your original texture. If you sum up the sizes of the copies, it comes out to an additional one-third memory usage compared to just having the original base texture.
Now, just a couple more tips before we wrap up this section. First is a special note about GL Text Parameter. Starting in iPhone OS 3.1 and only on devices with the PowerVR SGX, there's a benefit to putting any calls you make to GL Text Parameter before the call to GL Text Image 2D.
So, you use GL Text Parameter to configure texture filter settings, and of course, you use GL Text Image 2D to load the texture. But if you put the GL Text Parameter calls first, that will effectively act as a hint to the implementation for how to lay out the texture that's about to be loaded. And this will briefly improve its memory usage, avoiding a copy, and improve its load time.
A second tip, this one about UI image. So UI image is what many OpenGL based applications are using to load their images. You load an image from the file system, you get the CG image from it, you render that into a CG bitmap context, and now you finally have the pixels that you can hand directly off to GL Text Image 2D.
Allan Schaffer A second tip, this one about UI image. So UI image is what many OpenGL based applications are using to load their images. You load that into a CG bitmap context, and now you finally have the pixels that you can hand directly off to GL Text Image 2D.
So we've said a lot about texture. And to summarize, we really recommend using PVRTC if your content is suited for lossy compression. Otherwise, consider using one of the 16-bit color formats like RGBA 5551 or RGB 565. We recommend generating mipmaps. We recommend using texture atlases. And when you're running on the SGX hardware, note that non-PowerF2 textures are supported even in an OpenGL ES 1.1 app. Then those tips. Set up your texture parameters before you load the texture and use the UI image image with contents of file factory method.
But actually, we're not done yet. As I mentioned in the architecture section, doing any texture sub-loads in the middle of a frame can be expensive. So if you're doing this, we recommend putting those calls at the start of the frame before you've used the texture that you're about to modify. On the MBX hardware, it's imperative that you stay below the 24 megabyte limit for textures and surfaces. And then there's two things I hadn't covered yet. The first is to pre-load and warm all of your textures.
So the work done by the driver to actually prepare a texture for rendering is deferred until the first time you actually go to draw something with that texture. And sometimes this lazy loading is okay, but many applications are really trying hard to maintain a steady frame rate and would rather take that hit up front rather than at some arbitrary time while the render loop is running.
So the practical advice is to run through all your textures. And then the next step is to take the textures in the beginning of your app and draw something. It can be just one pixel and it can be off screen. It doesn't matter. But just draw something from each texture to get them all pre-loaded.
Now, an obvious place you might think to put that work would be in your application did finish launching method. But actually, that's not a good place since you want to return from that as quickly as possible during application startup. So you'll have to get clever. And this is something that apps will have to figure out on a case-by-case basis. So you'll have to get clever. And this is something that apps will have to figure out on a case-by-case basis. So you'll have to get clever.
And this is something that apps will have to figure out on a case-by-case basis. If after your app starts up, you're sitting on the menu screen, for instance, and waiting for the user to start their game, then that's an ideal time to be warming your textures. And if you don't get through them all, well, that's okay. Every little bit will help.
And finally, the last bit of advice for texturing is to remember that the call to GLTECHIMAGE2D will make its own copy of your image data. So you should free your copy of that image data immediately afterwards. All right, wow, so that was a lot to absorb. But now we're going to go even deeper into optimizing geometry. And there's seven topics to cover. First, the fundamental data structures, vertex arrays, triangle strips, indexed arrays, and VBOs. Then, the optimizations on those structures through interleaving, smaller data types, and proper alignment.
The starting point is to talk about vertex arrays. These are the method for defining geometry with OpenGL ES, and they form the foundation of indexed arrays and vertex buffer objects. But I'm not going to teach the basics of how to create a vertex array. I bring them up simply to give you a reminder. They're called arrays for a very important reason. And it's to encourage you to fill them with a lot of data, to enable the implementation to do a bulk data transfer into the graphics pipeline.
And I see a lot of applications that are creating really short vertex arrays, sometimes just one or two triangles in an array, and submitting one or two triangles at a time to the graphics pipeline. If you're doing this, you're dropping orders of magnitude of performance on the floor.
So developers who have really mastered OpenGL ES become very focused on maximizing the length, of their vertex arrays and managing state very carefully, to allow them to pack yet more data into those arrays. All with the goal of having just a few very long arrays, rather than a lot of shorter ones.
Now, depending on your dataset, one optimization you might be able to apply to your vertex arrays is to batch up individual triangles into triangle strips. This is a method of reducing the number of vertices that need to be submitted to the graphics processor without any change to the number of triangles being rendered. So, here in this diagram, we see that the vertices on the inside edges of these coincident triangles are really just duplicates and can be removed. and all three of these triangles batched up into a single strip.
And the reason why this is so helpful is because it is again a way of reducing the number of bytes that you need to be pushing around to render a frame. And here you see that just in these three triangles, we go from nine vertices down to five. And if each vertex is defined to be 48 bytes in size following this formula on the bottom, then this takes us from 432 bytes down to 240.
Now, something that comes up once you start creating triangle strips is that each strip requires its own call to GLDrawArrays. And yet we really want to reduce the number of calls to draw arrays and basically get all our drawing done from a big long array. Well, there's a technique that can help you with this called degenerate triangles. And the idea is that we'll actually add vertices, a copy of the last vertex of the first array and a copy of the first vertex of the next array.
And that will join these two separate strips together into one without changing what's actually rendered on screen. And it's because these degenerate triangles we've created won't be rasterized. The reason is this, the vertices ABC, BCD and CDE each define a valid triangle. But DEE, that's a line, it's not a triangle. So it won't be drawn if our array is defined to consist of triangles.
Likewise, EEF, EFF, FFG, all those are just more lines and won't be drawn. But then FGH, GHI and HIJ are valid triangles again and will be rendered normally. So this actually causes a net increase in the number of vertices and the number of bytes in the array, but it eliminates a call to GLDrawArrays, which is a big deal.
Then the third topic is indexed arrays. And again, this will be very dependent on your actual data set. But if you have data that ultimately defines a mesh, then the inner vertices within that mesh will generally be shared among a large number of triangles each. So in this example, the vertices in the middle each touch six different triangles.
Now, if I were to define this mesh as a triangle strip, I'd have a partial strip containing the blue triangle, the blue triangle, and the blue triangle. Then some degenerate vertices, and then another partial strip with the purple triangles. But all the shared vertices in the middle would have to be duplicated in both of the partial strips.
So instead, if we create just one array containing all the vertices here, with each vertex only listed once in that array, and then a separate index array that defines the triangles or triangle strips, then over a large mesh, I can have a huge net savings. Here, I'm drawing a total of 16 triangles and only using 742 bytes to do it. And you remember on the previous slide with the degenerate triangles, that was only six triangles, and it took 576 bytes.
And finally, the ultimate optimization generally comes from using vertex buffer objects. And as I mentioned in the devices section, these are a huge win on SGX because they're directly supported by the hardware. Now, on the MBX, we basically just treat VBOs as an indexed array since there isn't any special hardware for them.
So you don't see the same performance differential just from using them like you should on the SGX. But they don't hurt at all. So it's still considered a highly recommended practice. And bear in mind, even if you're tuning for the MBX, if your user just happens to have an SGX, they'll be able to take advantage of VBO hardware automatically from the same binary.
Now, the way VBOs work is to essentially take the vertex array API and add an object model to it. So, rather than you just providing OpenGL with a pointer to an array that you might be changing any time, you give OpenGL that array and then let it act as the gatekeeper for any changes. So, let's contrast VBOs with vertex or indexed arrays.
Any time you just use a vertex array or an indexed array, all the data in that array will be copied to the GPU every single frame. So, that's right. So, if you're not using VBOs, a copy of your vertex data will be made every frame and then discarded once it's been rendered. So, this is how all rendering happens on the MBX, and it's true for any non-VBO rendering on the SGX. And if you had wondered what copy index data was all about in instruments, for example, this is it.
Now, there is a silver lining to that approach. It means that making changes to your vertex array really can happen without any penalty, since the implementation is just going to make its own copy of it anyways. But now, instead, if you use vertex buffer objects, then on the SGX, this data gets copied once and then reused in subsequent frames. This has a great effect on performance. The data is faster to access, and it's faster for you to draw a frame.
But the question becomes, how about making changes? Well, you do this by calling map buffer. And as I said, OpenGL acts as the gatekeeper for data once you define VBOs. But actually, since we have a unified memory architecture on the iPhone, there's really no extra overhead of a map buffer call. It's just giving you a pointer to the wired system memory that's being directly referenced by the GPU. So, we really, really recommend that you define your vertex data with a vertex buffer object.
Now, changing gears from vertex optimization to the organization of the arrays themselves. It's very typical for a 3D object to be defined by an array of vertices and a separate array of normals, another separate array of texture coordinates, and so on. This tends to be how you find data organized in some of the popular 3D file formats. It's also an easy way to print and understand sample code in a book and so on. So it's just a routine practice that's seen quite a lot.
But actually, it's not the most optimal way of submitting that data to the graphics processor. And the reason is because to load the attributes for one vertex requires fetches from all these separate arrays, which probably don't have good locality and probably aren't all in cache and so on.
What's better is if you can interleave each vertex attribute and construct a single array. So now when we go to fetch the data for a 3D object, and we go to fetch the data for a particular vertex, there's excellent locality and the array is likely to wind up in cache as we walk through it. And this has nice benefits for both the MBX and SGX. Now, on the MBX, there's even an optimal order for the interleaved attributes. And it's vertices, then normals, then colors, then texture coordinates. On the SGX, the order doesn't matter.
Now, an exception for interleaving your vertex array is if you have a lot of dynamic attributes. Meaning you plan to run through one of the attribute lists every frame, let's say the texture coordinates, and update all of them. Well, this can be a lot of work since you have to move along the array with a particular stride. So in that case, we recommend that you separate out any dynamic attributes. Leave all the other attributes interleaved, but take the ones that you plan to update out of the interleaving, or any that you might share among multiple objects.
And then really getting now into some specifics. You remember in my example that I had defined a single vertex as being 48 bytes. And that was because I used floats for all my attributes. Well, you should actually try to use smaller data types if you can, like shorts for the vertices and normals and perhaps just bytes to define the colors. It's common for colors to be defined from 0 to 255 per component, for example. And since texture coordinates are frequently just 0 to 255, you can use them to define the colors. from 0 to 1, you can sometimes quantize them into bytes or at least use shorts.
So, here's a couple of examples. At first, I started out with 48 bytes per vertex with every attribute defined as a float, which are four bytes each. And then, in the middle example, I've reduced my color attributes to bytes, and my texture coordinates to shorts, and this cuts a single vertex down to 32 bytes, or two thirds the size of the original, and that's great.
Now, I'll go even further using shorts for the vertices and normals, bytes for the colors, and shorts for the texture coordinates. This takes me down to 20 bytes per vertex, which would be fantastic, but there's a problem with it that actually would ruin everything if we left it this way. And the attributes in that case are misaligned.
So here I'll just stack those vertices into an interleaved array to show the problem. The issue is that each attribute group must start at a 4-byte boundary. And you can see that the normals don't start at a 4-byte boundary in this case. They're starting at byte 6, and that throws everything off for the rest of the array.
So, if you were to do this, we actually have to correct it for you. The implementation will fall off the fast path, make a copy of your array, and pad it to fix your alignment problem. And remember, this will have to happen every single frame. So, don't leave this to the implementation to fix. Just when you choose the size of your attributes, make sure each attribute group will always start on a 4-byte boundary.
And this is nothing complicated. Instead of the vertices being defined as, say, short vertex sub 3, I'm saying to make it short vertex sub 4 and leave the fourth element unused. To basically add the padding that the driver would have had to do for you. And so, with this fixed, our size per vertex gets down to 24 bytes, which is half the original size. Half as many bytes to copy each frame for a vertex array or index array. And half as much memory bandwidth being used to submit this geometry to the graphics pipeline. So, it's a huge win.
So let's summarize what we've covered for geometry. Long vertex arrays are the starting point, and you should strive to create just a few very long arrays rather than a lot of short arrays. Use vertex buffer objects. Even if your current test platform is the MBX, where you won't see the benefit, your users who happen to have a device with the SGX will.
Interleave your vertex data, but separate out any dynamic data from the interleave list. Use smaller types if you can, and you almost always can for colors and texture coordinates. And if you do use smaller types, make sure that the alignment is perfect. Ensure that each attribute group starts on a 4-byte boundary.
All right, so that brings us to the end of Part 2. We've covered a lot of ground. And I hope this series has been helpful for you in giving you a lot of platform-specific information that you can put into use right away. In the meantime, here's my contact information if you have questions about the content presented here, and a link to the iPhone Dev Center for documentation, sample code, and our developer forums. Thank you for watching.