Boost your graphics performance with the M5 and A19 GPUs - Tech Talks

2026 • iOS, iPadOS, macOS • 35:08

Take a deep dive into the advanced graphics features of the M5 and A19 GPUs. Learn how to greatly accelerate graphics workloads for games and improve pro workflows for apps. Plus, find out how to get the best performance possible by using brand-new profiling tools to tune your apps for M5 GPUs.

Speaker: Irfan Zaidi

Open in Apple Developer site

Downloads from Apple

HD Video (142.3 MB)
SD Video (61.6 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Hi, I’m Irfan from Apple’s Graphics, Games, and Machine Learning software group. This year, Apple's new product line has made a significant leap in graphics performance. At the heart of these products is the new Apple Family ten GPU architecture. In the A19, A19 Pro and M5 family of chips. Today, I'll show you how to leverage the new features.

Whether you're shipping a Triple-A game, developing a professional app, or pushing the limits of real time rendering. I'll start with the architectural improvements in M5 GPUs that automatically improve the performance and power efficiency of your existing apps, giving players a better experience right out of the gate while setting you up for the next generation of apps.

Next M5 GPUs are built for games and graphics. They bring the most complete GPU driven pipeline yet. Extended indirect command buffers give the GPU full control over the entire rendering process. The new graphic features unlock visuals and workflows that simply weren't possible before. And finally, getting the most out of M5 starts with understanding how your app is performing.

In this section, I'll cover new profiling tools that give you deeper visibility into GPU occupancy, along with the best practices for maximum performance. Let's now look at each of these sections in more detail, starting with updated architecture in the M5 GPU family of chips, that makes it all possible.

M5 GPUs are the result of years of innovation. M1 brought Apple's GPUs to the Mac for the first time, delivering unified memory architecture with a power efficiency that made A-series chips so successful on mobile. M2 scaled with 25% more shader cores, larger caches, and higher memory bandwidth, delivering the headroom for more demanding math and memory workloads.

M3 introduce a new shader core with dynamic caching hardware, accelerated ray tracing, and mesh shading. It was a major leap in both performance and efficiency. M4 had faster cores, a 2X faster ray tracing engine, and a higher memory bandwidth, helping developers deliver high quality visuals in real time. M5 continues this momentum with some significant advancements across multiple categories.

GPU neural accelerators bring AI and ML acceleration directly onto the GPU. If you're working with ML workloads, check out Accelerate Your Machine learning Workloads with M5 and A19 GPUs for details. M5 also brings major graphics performance improvements across the board. Let’s take a look. M5 doubles FP16 and complex ALU execution speed. That means math heavy shaders run significantly faster. Geometry throughput is also doubled, which will provide immediate performance gains for scenes with complex geometry and with up to 30% more memory bandwidth, memory bound workloads will see an immediate performance boost without any changes to your code.

M5 GPUs feature second generation dynamic caching, providing faster on chip register memory access. The occupancy management unit now monitors a broader set of signals to keep your shaders running efficiently. Compute shader texture writes are now compressed automatically. Your advanced post-processing effects get bandwidth savings without any changes to your code. Third generation hardware raytracing brings faster instance, transforms fully hardware, accelerated intersection function, buffer indexing, and reduce memory alignment requirements. I'll walk you through each of these, starting with second generation dynamic caching.

Building on M3's dynamic caching for efficient memory allocation, M5 delivers lower register access latency and higher cache access energy efficiency. Alongside this, our redesigned Smarter Occupancy Management Unit maximizes shader core utilization while keeping data on chip. To understand these improvements, let me briefly recap how dynamic caching works. For more details, refer to the explore GPU advancements in M3 and A17 Pro. Without dynamic memory allocation, the maximum register usage in SIMD group dictates the occupancy. That is how many SIMD groups can run concurrently in the shader core.

With dynamic caching on chip registers and other shader, private memories are allocated dynamically from L1 cache, backed by the entire memory hierarchy. This happens over the lifetime of the shader based on live register usage. As the register count goes down, resources are freed up for other work. This allows many more SIMD groups to run concurrently, thereby increasing occupancy and ultimately its performance.

SIMD group may use a large amount of various on chip memories. With the increase in number of SIMD groups, there may come a point when your workload uses more memory than the available on chip storage. This causes spills to the next cache levels, increasing cache evictions and memory stalls.

Occupancy management boosts performance by adjusting occupancy to keep shader resources on the on chip memory, thereby reducing memory stalls and increasing GPU utilization. The redesigned occupancy management unit in M5 incorporates a much broader set of real time metrics stalls, cache residency and evictions, allowing the hardware to maximize GPU utilization while minimizing resource conflicts more intelligently than ever.

It throttles occupancy for these reasons. If your shares have too many live registers staying resident for too long, you will get cache thrashing at high SIMD Group counts. The GPU lowers occupancy to keep registers on chip. When your shaders. Private memory, whether Threadgroup style or stack, exceeds the available on chip storage, you get excessive eviction and data thrashing. The GPU lowers occupancy to reduce resource conflicts.

When memory request starts backing up in the last level cache MMU or main memory, the GPU lowers occupancy to reduce the pressure. When you're sampling textures faster than the GPU can decompress them. A common case with scattered access pattern GPU lowers occupancy to keep the stalls under control.

The new GPU profiling tools give you deeper insight into what's influencing occupancy that will be covered in detail in the tune your app section later. So that's dynamic caching, working smartly behind the scenes. Let me show you another feature that saves you memory bandwidth without any changes in your app. With universal texture compression on M5, you get automatic bandwidth savings on texture reads and writes in render and blit encoders.

And for the first time, it works for compute so that you can have both the flexibility of compute and the power and performance benefits of texture compression. This builds on Apple silicon's existing texture compression. Here's how it worked before M5. For years, Apple GPUs have supported lossless block based texture compression. Completely transparent to your app.

Compression is enabled for render targets. Texture sampling and blit operations reducing memory bandwidth during rendering. However, compression was disabled for textures with MTLTextureUsageShaderWrite. The hardware could not compress single pixel writes. This meant compute shares were limited to uncompressed textures, with modern engines relying heavily on compute for post-processing. This became a significant limitation.

M5 changes that compression now works with MTLTextureUsageShaderWrite. New hardware tracks, pixel level reads, writes, and block writes across the GPU, ensuring memory coherence. Every read sees the correct data, even when multiple shader invocations write to the same texture. Modern engines using compute for advanced post-processing now get automatic bandwidth savings. And the best part is no code changes are required. Now let's talk about ray tracing where M5 delivers significant improvements.

M5 represents the third generation of hardware ray tracing, bringing meaningful improvements across performance, power efficiency and developer flexibility. The key advancements made for ray tracing include faster hardware, accelerated instance transforms that delivers faster ray tracing while reducing power consumption. Fully hardware accelerated intersection function, buffer indexing, and reduced memory alignment requirements for acceleration structures. Here's a closer look at each of these.

Instance Geometry is fundamental to modern 3D scenes. Think of a forest with hundreds of trees or a city block with repeating architectural elements. In previous generations, processing these instances required substantial data movement between the ray tracing unit and the shader core, creating bottlenecks that affected both performance and power consumption with M5 instance transforms are accelerated with dedicated hardware that delivers faster ray tracing while reducing power and freeing up shader resources for your rendering work.

To take full advantage of faster instance transforms, be intentional with your acceleration structure flags. For example, only specify extended limits for your acceleration structures when your content actually requires the increased capacity. Similarly, choosing the appropriate instance type static versus Motion based on your actual content behavior will help you get the best results.

Intersection function buffers already provide a flexible indexing mechanism for custom intersection functions, and this aligns with direct and Vulkan, making cross-platform development easier. While IFBs are supported across all the hardware ray tracing GPUs. M5 takes this further by making the indexing scheme fully. Hardware accelerated in various games. GPU time spent in intersection function buffer indexing dropped up to 70% compared to emulation. It gives you the performance you need for complex procedural geometry and custom intersection tests in your scenes. Watch “Go further with Metal 4 games” for more details. Next is alignment requirements for acceleration structures.

Memory alignment requirements for acceleration structures drops from 16kB to just one kilobyte. That eliminates hundreds of megabytes of padding overhead in scenes with many small objects. Previously, you'd often merge small structures by transforming geometry. That still works for static scenes, but for dynamic scenes, M5 offers a better path.

One kilobyte alignment plus hardware. Accelerated instance transforms lets you keep objects separate without sacrificing performance. To recap, M5 delivers significant performance gains faster math, doubled geometry throughput, increased bandwidth, second generation dynamic caching, third generation ray tracing, and universal texture compression. Together, these enable more realistic visuals, faster rendering, and smoother gameplay in titles like cyberpunk 2077, all without changing your code.

The hardware does more for you automatically so you can focus on your content. Now let's look at how to put that performance to work. Everything I've covered so far works automatically, but M5 also brings new features you can adopt. Let me show you how it's designed for games and graphics.

GPU driven pipelines represent a fundamental shift in how you structure your rendering. That is, moving work from the CPU to the GPU for better scalability and performance. M5 enables new advanced rendering techniques that can greatly improve your pipelines. Efficiency and games are not the only workloads that benefit from M5, as M5 accelerates many Pro app workflows.

If you want better scalability and performance, GPU driven pipelines are the way to go. Let me show you how. Let's start with a quick recap on how rendering traditionally worked in a traditional CPU driven pipeline, the CPU does most of the heavy lifting. It traverses the scene graph, performs visibility analysis binary sources, and encodes draw commands for each visible object.

The GPU has to wait until the command buffer is submitted. And if the CPU needs that rendered data, you'll have to synchronize. This is why moving rendering decisions from the CPU to the GPU increases performance and efficiency. The first step in that direction is indirect command buffers. MetalFX indirect command buffers can be encoded from the GPU as well as CPU.

This allows the GPU to handle visibility analysis and render encoding entirely in compute. This reduces CPU load, improving frame rate and responsiveness in CPU bound apps, and less synchronization allows more efficient GPU usage. Metal four makes this easy with GPU driven pipelines in metal four. You aren't giving up any flexibility using argument buffers. The GPU can access all your resources and use them directly while encoding indirect command buffers.

Using argument buffers, GPU kernels can encode textures, material, data, acceleration structures, and even set pipeline states per draw. Metal four makes this extremely easy to use in your shaders, and now M5 completes that story. You can now set device states depth and stencil depth, bias clip and color modes, winding and fill mode all through the ICB encoding API.

In MSL. Setting these states on the GPU is a natural extension of your existing encoding code. Here we see some MSL that shows a very simple argument buffer definition containing textures, buffers, and even pipeline states, all written as simple c-style structures. And below it is a very short encoding function that shows how easy it is to encode a draw command.

It uses the new color mode state to allow for double sided materials to be encoded. Let's see how these new ICB encoded states can be used to optimize a shadow rendering path using metal icbs and argument buffers, as shadow paths can be completely GPU driven. The CPU just encodes a single kernel dispatch and an ICB execution.

Because we can encode render state changes. We can even mix and match different materials within a single ICD without device state support. You'd have to split the ICD and insert CPU state changes between executions using the new extended ICB API. Death bias and culling mode can be set per draw, allowing for high quality shadows without compromising on flexibility or efficiency.

This is all possible with the new M5 GPUs. In summary, you dramatically reduce CPU overhead for large and dynamic scene graphs. The CPU isn't iterating over thousands of objects anymore, as the GPU handles that in parallel. Argument buffers enable full screen encoding in GPU space. With an extended ICB API, complex work can be encoded more efficiently than ever.

You get increased frame overlap, improved performance, and greater efficiency. Let's move on to advanced graphics features. M5 enables new rendering techniques that can significantly improve your pipelines efficiency. Let's take a look. M5 enables efficient visibility buffer rendering for complex geometry. First pass is lightweight, just primitive IDs and barycentric coordinates. This reduces parameter buffer usage and saves bandwidth compared to a gbuffer approach.

Second pass fetches vertex attributes and interpolates manually. M5 exposes non interpolated vertex values directly to fragment shaders. This lets you decouple visibility from shading and evaluate materials only where needed. M5 introduces depth bound testing for efficient fragment culling. You specify a min and max depth range. Any fragment outside that range is discarded before shading.

This benefits many rendering techniques. With deferred lighting, you shared only pixel within a light's volume that billing and volumetric effects like fog. Also use depth bounds to limit work to relevant depth ranges. That's GPU driven pipelines and the new graphics features. But games aren't the only workloads that benefit from M5 Pro apps. See. Major improvements to Pro apps are about working with massive data while staying in the creative flow. M5 makes demanding tasks feel responsive.

Neural accelerators run ML models in the same command buffer as your mental workloads, making ML feel responsive in your apps. Universal texture compression now works for compute, providing full bandwidth savings for your heaviest assets. Second generation dynamic caching optimizes occupancy on the fly for faster renders and better GPU utilization.

Third generation ray tracing accelerates instance transforms thousands of animated instances update in real time without CPU bottlenecks. Beyond performance, M5 also improves rendering quality. M5 enables eight x Multi-sampling for even smoother edge anti-aliasing. By leveraging Apple GPUs, Tbdr architecture and memoryless textures at MSaa is resolved entirely on chip. Combined with universal texture compression, you get the spatial accuracy of eight samples per pixel with the bandwidth profile of a single sample renderer. Massive savings that would be prohibitively expensive on other platforms.

M5 increases maximum texture dimensions to 32 K, supporting gigapixel textures and render targets. Modern productions are already pushing past 16K – LED volumes, projection domes, 17K cinema cameras are here today. It helps future proof content for tomorrow's displays and with sparse textures and universal texture compression, M5 handles these massive assets efficiently without blowing through your memory budget.

To recap, M5 moves the heavy lifting from the CPU to GPU MetalFX extended ICB APIs let the GPU handle everything from visibility to state encoding, so your CPU isn't the bottleneck anymore. Visibility buffers and depth bound testing mean you only shade what actually matters, saving bandwidth and GPU cycles.

Pro apps benefit from faster execution, second gen dynamic caching, GPU, accelerated ML compute, friendly texture compression, and 32 texture support, enabling DaVinci resolve to accelerate a wide variety of workloads like AI video masking. With the performance it demands. To get the full benefit You need to see what's happening on the GPU. That's where tuning comes in. In this final section, I'll show you how to tune your app using our updated profiling tools and best practices for universal texture compression.

M5 has a redesigned occupancy management unit with new performance counters for occupancy profiling. I'll show you an updated workflow to identify and fix low occupancy issues using Xcode's new tooling. As we discussed earlier, compute shaders can now benefit from texture compression on M5, but there are cases where you might want to disable it. I'll show you how to identify those scenarios and make decisions based on performance data.

For a deeper dive into occupancy profiling fundamentals, check out Discover New Metal Profiling tools for M3 and A17 Pro. Let me show you what's new for M5. Xcode 26.4 has new set of performance counters and tools that can help you to easily identify and address the cause of low occupancy in your workload and achieve great performance. Next, I'll show you an updated workflow to help you meet your performance targets by increasing occupancy using the new performance counters on M5. But before I show you the workflow, here's a quick overview of the key counters and what their values mean.

The occupancy target counter shows the upper bound on occupancy set by the occupancy management unit at 100%. There is no throttling. The GPU isn't limiting occupancy, and your workload can use all available SIMD groups below 100%. The GPU is actively throttling capping SIMD groups to keep shader data on chip and reduce cache thrashing.

Lower values indicate stronger throttling. If both occupancy target and ALU utilization are low, that's a clear signal. Increasing occupancy will boost performance. The next step is identifying the cause, and that's exactly what the new occupancy target influence counters reveal. You have seen the reasons why the M5 Occupancy Management Unit throttles occupancy. Now you can measure their influence on occupancy directly using the new occupancy target influence counters.

For register pressure influence counter non-zero values indicate the GPU is throttling occupancy because register caches are thrashing. If this is high and driving your occupancy low, reduce register usage in your active shaders using GPU tools in Xcode. You can now see Live Register count at each shader line. Use this to focus your optimization for the L1 cache pressure. Influence counter non-zero values. Show that working set is too large to fit in L1 and is thrashing the cache.

If this counter is driving your occupancy low, analyze your workloads L1 memory bandwidth, and residency counters to identify the cause. Look for which L1 backed on chip memory type is accessing L1 most frequently and has allocated the largest percentage of cache lines for the memory request. Stalls influence counter non-zero values. Show memory requests are stalling in the memory hierarchy, including the last level cache, MMU and main memory. If this counter is driving your occupancy, target low, reduce random access patterns to large buffers. Use texture compression and MIP mapping to reduce memory traffic.

For the texture decompression stalls, influence counter non-zero values indicate texture sampling is outpacing decompression. The GPU can't keep up with the number of sample requests. If this count is driving your occupancy low, try interleaving ALU aliwork between texture samples. You may also want to turn off compression for expensive textures.

I'll show you a workflow to confirm whether compression is helping or hurting. Later in this session. Now let's see how to use these counters in practice to investigate a low occupancy issue in Xcode Metal Debugger. Let me show you how to use these counters in practice with a pre profiled GPU trace.

Looking at the timeline tab at the top is the encoder and shader track showing the fragment shader taking a long time. Let me investigate why in the bottom section our performance counter tracks looking at the occupancy track. I can see it's low throughout the duration of the shader.

The occupancy measure target is also low. So the GPU is actively throttling SIMD groups. I can also see that ALU utilization is low so increasing occupancy should boost performance. To investigate. I want to see when occupancy started dropping and what influenced it most at that moment. Using the top occupancy target influences track, I can see it's L1 cache pressure.

Let me check the L1 bandwidth and residency counters. High stack bandwidth with high residency indicates the shader is spilling registers to stack. Let me quickly check the shaders tab for highest register count and stack spills for this shader. I will now check shader cost graph with live register count enabled. I can see register counts in the left column.

I can see the total count of stack instructions for the entire fragment shader by hovering over the small pie chart in front of the fragment function definition at the bottom of the pop up menu. If I look under instruction memory types, I can see that it exceeds 2300 stack instructions per thread, indicating a lot of spills.

The calls to computation block ten have a high register count. Each call needs its own set of registers, and the compiler is keeping all of them alive at the same time. So the register count adds up. If the shader consistently require more live registers than the GPU's maximum per thread limit, the compiler runs out of registers and must spill to stack when those values are needed again. They must be filled back from the stack. The instruction pop up shows stack instructions for spills and fails, confirming the source of L1 cache pressure. The fix is straightforward reduce live registers so the compiler doesn't need to spill.

I've written an optimized version that puts these calls in a loop. This lets the compiler release registers after each iteration and reuse them for the next one. Keeping live register count low. The result a huge drop in live register usage and with fewer registers. No stack spills, which I can now confirm by clicking on the shader tab and can see the register count per thread is much lower compared to the unoptimized version. And there are no stack spills.

Checking the timeline tab, I can see the optimized fragment shader is running much faster compared to the unoptimized version. Occupancy Manager target and occupancy are back up, and ale utilization is much higher and overall results in a higher performance. To recap, first I spotted low occupancy and checked the Occupancy Manager target. It confirmed the GPU was throttling.

The influence count has pointed me to the core's L1 cache pressure, One bandwidth and residency counters identified stack as an issue and the shader stats confirmed it. High register count and stack spills. I fixed the source by reducing live registers and the results were immediate. No spills, no throttling, and much higher utilization and performance. Now let's move to best practices for universal texture compression. Universal texture compression works by default on M5. That said, there are a few things to keep in mind to maximize performance.

As you saw earlier, universal texture compression brings automatic bandwidth savings and performance improvements to compute shaders, even when writing to textures. If compute fits your algorithm better than render, use it with compute. You have access to threadgroup memory and finer control over your dispatch dimensions. When you call replace region, you're using a CPU side function that copies data directly into texture memory. It bypasses the GPU entirely so the data never gets compressed. It just sits in memory uncompressed. And you miss out on bandwidth savings. Instead, use blit commands to populate your textures.

The GPU handles the transfer and compresses the data as it writes. From that point forward, every read from that texture benefits from reduced bandwidth. Using blit has always been good practice on Apple silicon, but on M5 it's especially important with universal texture compression. More stages of your pipeline can benefit, but only if the data starts out compressed.

Now, there's one scenario where you might want to disable compression scattered access patterns. Remember, compression works well when access is spatially coherent. The opposite is scattered excess pixels spread randomly across the texture with scattered excess. You run into block over fetch. The GPU needs one pixel, but has to fetch and decompress the entire block.

You end up fetching many blocks, but only using a fraction of pixels in each. You're paying the decompression cost without getting the benefit. This applies to both reads and writes. Scattered reads cause over fetch scattered writes are even more expensive. The GPU has to read the block, decompress it, modify your pixels recompress and write it back.

If you have a texture with inherently scattered access, maybe a lookup table, a noise texture, or certain specialized compute effects, you can disable compression by setting. Allow GPU optimized contents to false on your texture descriptor. This tells us GPU to skip compression and optimize for random access instead. But don't guess. Don't disable compression because you think access might be scattered. Next, I'll show you how to identify exactly when this is happening so you can make a data driven decision.

This counter measures the ratio of original uncompressed data to actual compressed data read from memory. When this counter is higher than 1.0, you're saving bandwidth. Compression is doing its job. A ratio of 2.0 means you're cutting memory traffic in half. This is where you want to be. When it's less than 1.0, the GPU is over fetching and wasting bandwidth.

The upcoming workflow will show you how to diagnose performance issues due to scattered reads. This workload has high degree of scattered reads when sampling high resolution textures. A common pattern in volumetric fog techniques. Notice the occupancy target is much lower than 100%. That means Occupancy Manager is actively throttling occupancy. The new occupancy influence counters reveal the cause. Texture decompression stalls.

The compression ratio of texture memory red track confirms this. The ratio is well below 1.0, meaning the GPU is fetching entire compressed blocks but only using a fraction of the pixels. That's wasted bandwidth. The fix is to disable compression for these textures by setting. Allow GPU optimized contents to false. Now let's look at the performance counter for diagnosing scattered writes.

This counter measures the inefficiency of block writes to compress textures at 0%. Every write covers a full block. The GPU just compresses and writes the block back to memory at 100%. No block is full. Every write triggers a read, modify, write cycle. Typically, the value will fall somewhere on this spectrum.

The higher it is, the lower the performance. If it is high, restructure writes to cover full blocks or disable compression for those textures. The upcoming workflow will show you how to diagnose performance issues due to scattered writes. This compute workloads writes are scattered pixels to a compressed texture. Instead of writing in contiguous blocks, you will see this pattern in techniques like sparse particle updates, decals, splatting, or incremental texture painting.

The Compressed Texture Write Inefficiency Counter reveals the problem. A value near 100% shows writes are highly scattered, and most of the blocks written are not full. Here the inefficiency is high and compression is counterproductive. The fix is to restructure your writes to cover complete blocks or disable compression for these textures.

All right, let's recap. M5 delivers two x faster math, two x geometry throughput and 30% more bandwidth out of the box. Add second gen dynamic caching, universal texture compression for compute and third gen ray tracing, and you are getting major gains without changing code. M5 is built for games and graphics, GPU driven pipelines for scalability, visibility buffers, and depth bound testing for efficiency, plus 8x MSAA and 32K textures for pro-grade quality. And with new occupancy profiling tools and best practices, you can see exactly what's limiting performance and optimize with precision. Thank you for watching.