Debug GPU-side errors in Metal - WWDC 2020

Graphics and Games • iOS, macOS, tvOS • 20:56

Track down even the trickiest GPU-side programming errors with enhanced reporting in Xcode 12. While Metal’s API validation layer can catch most problems in a project, GPU errors can cause a host of difficult-to-debug issues. Get an introduction to GPU-side errors and learn how to find and eliminate problems like visual corruption, infinite loop timeouts, out of bounds memory accesses, nil resource access, or invalid resource residency with Xcode 12. Discover how to enable enhanced command buffer error reporting and shader validation, use them effectively as part of your debugging strategy, and automate them in your production pipeline.

Speaker: Michael Harris

Open in Apple Developer site

Downloads from Apple

HD Video (107.5 MB)
SD Video (44.4 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Hello and welcome to WWDC. Hi. I'm Michael Harris, and I'm a GPU software engineer at Apple. Today I'd like to talk about the improvements we've made to Metal's debugging tools, specifically for errors in GPU-side Metal shader code. So what are a few examples of errors that we can make in Metal shader code? You could have an out of bounds access in global or shared memory. You could attempt to access a null texture resource.

Or you might have forgotten to call useResource when using argument buffers, resulting in invalid resource residency. You may have a timeout, which can be caused by long-running or infinite loops. This isn't an exhaustive list, but it's some of the more common errors we Metal developers experience. These errors can often cause one another. An infinite loop may be caused by an out of bounds access of the loop iteration count. The result of a GPU-side error is a message like this one. In comparison, here's what we get from an API usage error on the CPU.

Let's compare and contrast these two errors because there's a pretty large gap in useful information. For a GPU error, all we get is a message that says something went wrong, but not much about what or where. But when there's an API usage error, Metal provides a lot of useful information. It shows the API entry point the error occurred on. It shows what type of error it was. In this case, we set an offset larger than the buffer's length.

There's also a call stack of exactly where the error occurred, including line and file information from your codebase. Wouldn't it be nice if the GPU errors looked a bit more like the API errors? Today, we'll show you some new tools to help improve the debugging experience of GPU errors. To help illustrate where our new tools fit in, we'll use a debugging workflow of detect, locate, classify and fix.

Metal has always had API validation to help you catch issues early. Finding them early means that they're caught before they can cause problems further down the line. Using it, you can detect when there's API misuse, locate the function causing the issue, and classify the error message. We'll leave fixing the error up to you.

But what about errors in your Metal shading language usage on the GPU? You've seen how these errors appeared on iOS 13 and macOS Catalina. Metal provides a basic error message. It's enough to tell you something happened in the execution of that command buffer, but not much else. So today, Metal is introducing two new diagnostic tools that will help improve the debug workflow: enhanced command buffer errors and the shader validation layer. First, let's talk about enhanced command buffer errors. What is it? Well, it enhances your command buffer's errors. To be more specific, it improves the existing command buffer error mechanism by helping you detect and locate execution errors at the encoder level.

Here's that GPU error again. There isn't a lot of actionable information here. When you're debugging a command buffer that might have hundreds of encoders, making progress is a lot of work. Here's that same error, but this time we've turned on enhanced command buffer errors. It's an obvious improvement over what we had before. You have information about each encoder within this command buffer, and that helps you narrow down the failure.

Most of our encoders completed their work. But there are a few suspect encoders that have been marked as affected or faulted. That narrows our search down significantly. Enabling enhanced command buffer errors is simple. All you have to do is create your command buffer with the new descriptor-based API and set the errorOptions to encoderExecutionStatus.

That's it. If an error occurs while the feature is enabled, you can get encoder-level information about that error. Here in our code example, we're using the encoder info error key to access the user info data of this error. This is where we'll find the array of our encoder info objects to iterate over.

As you can see here, each encoder info object has the label and debugSignposts that you're already using to uniquely identify each command encoder. If you're not already using labels and signposts, now's a great time to start. The error state tells you the status of the command encoder at the time of the fault.

Or, alternatively, if you don't want to format it yourself, you could just log the whole error. That will print all the information related to the error. The ErrorState property has a few possible values: completed, pending, faulted, affected and unknown. Faulted is the most important error state because it means that this encoder was directly responsible for the command buffer fault.

Affected could still indicate the faulting encoder, but unlike the faulted state, we're not a hundred percent sure. A fault on one encoder could have affected multiple encoders that are running in parallel, including encoders from different processes. In the rare event where we can't tell the state of an encoder, we'll report the unknown state.

There's synergy with the existing GPU tools as well. Since the encoder info objects are in recorded order and use your labels and debug signposts, you can easily associate them with the same encoder in Metal Debugger, Metal System Trace, and other tools built into Xcode. For example, with this information, you can jump right to the relevant encoder in Metal Debugger. So, when should you turn it on? First off, you should enable it on every command buffer during development and QA. That will enhance all your internal error reports and give you quick feedback on any errors.

Since enhanced command buffer errors are built right into Metal, it doesn't require any auxiliary layers. It's designed so that the API can leverage hardware functionality in its implementation. That makes it into a low-overhead feature. Because it is so low-overhead, you can even ship your application with the feature enabled. Since it's command buffer-specific, you can target what command buffers to enable it on. As you get telemetry and bug reports, you can tune the set of command buffers to hone in on the problem.

That said, test your performance before enabling on user devices. The performance impact varies across devices and workloads, so you'll want to check whether the overhead is acceptable to you. The challenge with debugging Metal shader code is that the codebase can be large and can contain a lot of places for errors to occur.

The first step is knowing where to look, and enhanced command buffer errors helps with that. It can get us to the encoder level, but to go deeper, Metal provides another tool. That's where shader validation comes in: to detect, locate and classify the error at the draw call level to help you debug and fix it.

So let's talk about the shader validation layer and what it can do for you. It's a layer similar to the API validation layer, but running on the GPU. It instruments your Metal shaders to detect logical issues as well as locate and classify them. When it detects that an operation would've caused undefined behavior, that operation is prevented and a log is created that can be used to locate the draw call, Metal function, possibly even the line in the shader causing the error. This tool can help you debug issues that cause command buffer errors, and it can help you detect ones that don't. This is important because there's many types of errors that don't actually cause a command buffer to fail but are still undefined behavior.

Let's walk through one of those cases now. We'll start by allocating two buffers, A and B. We want to read from buffer A but have a logic issue such that it causes us to read out of bounds. What happens next is undefined and depends on Metal's allocation behavior.

You could get lucky, and there's unallocated memory in between the two buffers. If you go out of bounds in this case, you can get a command buffer fault. Since there's a fault, it's obvious feedback that something bad has happened, and enhanced command buffer errors will narrow it down to the encoder. But if you're unlucky, out of bounds access won't cause a command buffer fault. Metal may place the buffers one after another in virtual memory with no unallocated space between.

Here, our logic error won't cause a command buffer fault. We still go out of bounds but end up landing in another allocation and either read the wrong data or corrupt another buffer. Such issues can be hard to detect and frustrating to debug, as they may appear intermittent. The most important thing to take from that example is that you should always test with API validation and shader validation before shipping. Just because you're not seeing a command buffer fault does not rule out that you don't have any undefined behavior. Undefined behavior isn't always obvious, and it can appear intermittent.

But the good news is that shader validation is meant to detect these cases, including the ones that aren't obvious. Let's go over what shader validation can and cannot detect. It can detect out of bounds device and constant memory access, out of bounds threadgroup memory access, and attempting to use texturing functions on a null texture object.

This doesn't cover all of the common issues mentioned, but for everything else, enhanced command buffer errors can help. You won't get draw information, but it will narrow down to the encoder. The most powerful way to use this feature when debugging is from within Xcode, and enabling it is easy. First, bring up the scheme settings in your project. In the Diagnostics tab, we have a new section for diagnostics specific to Metal.

Checking the box next to Shader Validation will enable the layer and enhanced command buffer errors for all command buffers. Once the layer is enabled, you still need to enable the Metal diagnostics breakpoint. The Metal diagnostics breakpoint tells Xcode to stop the execution of the program when a shader validation error occurs and to show the recorded GPU and CPU backtrace for that error. Clicking the arrow to the right of shader validation will add the breakpoint. Once the breakpoint has been added, you can find it in the Debug Navigator on the Breakpoints tab. You can view the settings of this breakpoint by clicking on the blue arrow.

That will bring up this interface, where you can customize the breakpoint. To configure the breakpoint for shader validation, first make sure the breakpoint is enabled. Then set the Type to System Frameworks and enter "Metal Diagnostics" into the Category field. At this point, you're ready to use the feature within Xcode.

Now let's jump into a demo showing it in action. We're using the Metal Performance Shader Ray Tracing sample code. We've introduced an easy-to-make GPU error into the sample for this demo. During the demo, we'll go through using shader validation to detect and debug this issue. First, I'll start by launching the app without shader validation.

That doesn't look quite right. There's missing shadows and a bunch of lines on the screen. Why this isn't rendering isn't obvious, though. We're not getting any command buffer errors, so we don't know which encoder or Metal function has the bug. Before we start trying to debug this line by line, I'll use Metal's new debugging workflow by enabling shader validation. First, I'll bring up the scheme settings in my project.

And then I'll go into the Diagnostics tab and then down at the bottom, there'll be options for API validation and shader validation. The API validation has been moved from a different tab to this one. Now I'll enable API validation and shader validation. Since I want to have Xcode break on the first validation error, I'll click this arrow to add the Metal diagnostics breakpoint. Now I'm set up to use shader validation, and I'll relaunch the application.

So we have some logs being printed in the console from shader validation indicating that it detected an error. Since I have the breakpoint enabled, Xcode has stopped my application and brought up the Metal shader where the error occurred. Xcode is also showing a shader annotation on the line that shader validation found had an error. I can click the shader annotation, and it'll show some more details about this error. Based on the annotation, I'm hitting an out of bounds memory read.

Looking at this expression, there's only one memory access going on. We're reading the maxDistance field from the shadowRay argument, which is a pointer in device memory. There are two possibilities here: either shadowRay is null or shadowRay points to invalid memory. Since we enabled API validation, that would've caught a null buffer binding, so we can rule that one out. Just looking at the function here, it's not clear how the address of shadowRay is being calculated.

So we'll use the GPU backtrace view in the bottom left-hand side of Xcode. This view shows the GPU backtrace of the error, which has the recorded call stack of the error at the time the error occurred. We can traverse this call stack just like you would any other recorded call stack. I'll click on the stack below our function, which will jump me to the call site of the function shadowRayIntersection.

It looks like the variable shadowRay is what's being passed in, which is computed by taking the shadowRay's argument and indexing it using the rayIndex variable. Since we suspect an invalid offset, we need to investigate rayIndex. Looking at the comment above the computation of rayIndex, its code intends to convert a 2D grid coordinate into a 1D array coordinate.

That's typically accomplished by multiplying the grid Y with the grid width and then adding the grid X. However, looking closely at this expression, we see that instead of multiplying the grid Y and the width, we're multiplying the X and the width. That's definitely a typo, so let's correct that and rerun the application.

Now our app is fixed. With the help of shader validation and API validation, we were able to quickly locate and classify this issue. We realize you're not always able to run everything under Xcode, so with some additional setup, you can use shader validation without Xcode. That lets you use shader validation for use cases like automated testing. Similar to API validation, shader validation can be enabled using two new environment variables we've added to the new macOS and iOS 14.

These variables must be set before any Metal device is created for that process. Once a device is created, we latch their values, so any changes to them after that point will not have an effect. To enable API validation, set MTL_DEBUG_LAYER to any non-zero value. And to enable shader validation, set MTL_SHADER_VALIDATION to any non-zero value.

Both of these can be set at once or used independently. The command buffer now has a new logs property which allows you to retrieve the details for any validation errors that occurred. The first thing to note is that the log's property is only valid after a command buffer finishes. For that reason, we're doing all of our work inside the completion handler. We'll walk through this code sample, showing how to use the new API and what information it provides. Each command buffer can have multiple shader validation errors.

So we're gonna iterate through all of them. Every log object contains information about the shader validation error. Like the label of the encoder that had an error, that will give you the label, but there can be more information. If your Metal library was compiled from source or was compiled with debug samples, each log may also have a debugLocation property. This property is the GPU stack frame containing the error, and it will hold the file URL and line of the faulting expression.

Alternatively, you could just use the description property. This contains all the same information formatted in an easy-to-read string. You'll also be able to find this information in the system log. You can access this log by running this highlighted command in your Terminal. When a validation error occurs, it'll show up like this. The first thing in the log is the process name the error is occurring from. The next will be the type of error and then the error details. Finally, the name of the Metal file and the line information.

We have some tips to help you get the most out of shader validation. You can expect pipelines to take a bit longer to compile. Because of that, you should really be using the asynchronous compilation methods. That will parallelize compilation across multiple threads, which will help mitigate the increased load times during development.

You should also enable debug symbols when compiling your Metal libraries. That should automatically happen if you're using a debug scheme in Xcode. But if you are invoking the Metal front-end manually, symbols can be enabled by adding the "-g" flag. If any of your libraries are compiled from source online, debug symbols will automatically be enabled.

If you are compiling libraries online, we recommend using the line preprocessor directive. The backtrace we report uses the file name to identify a shader. Off-line compiled metalib files include this information automatically, but it's missing when compiling from source at runtime. You can manually add the file name information by using the line directive to tell the compiler what file it was sourced from or to provide a useful identifier. Due to the nature of its instrumentation, there are a few things to be aware of when enabling shader validation.

Shader validation is a process-wide switch that when enabled causes all Metal commands, including UI rendering, to go through the shader validation layer. Unlike enhanced command buffer errors, using shader validation does have a high performance and memory impact. We recommend enabling this feature in development and during QA but not for users because of this impact.

Enabling the feature may also change some queries to return different values. In particular, you should always check the maxTotalThreadsPerThreadgroup and the threadExecutionWidth properties of a compute pipeline state, as these two may change when shader validation is enabled. We support some level of customization on how this feature behaves, such as disabling specific checks.

For example, if you're already doing null texture checks, you can safely disable texture usage instrumentation by setting the environment variable MTL_SHADER_VALIDATION_TEXTURE_USAGE to zero. While disabling some instrumentation can improve runtime and compile time performance, it's at the cost of no longer detecting some possible issues. More information about what flags are supported can be found at the new MetalValidation man page.

Some features are not supported when using shader validation. Binary function pointers and dynamic linking are not supported. There's an additional limitation for MTLGPUFamilyMac1, as well as MTLGPUFamilyApple5 and older devices, which is that global memory access of pointers coming from an argument buffer are not checked. Thank you very much for coming to our session about the two new Metal debugging tools we've added this year.

First, we covered enhanced command buffer errors, which is a low-overhead in-framework tool that helps you detect and locate your faulting encoders in multiple environments, like during development and QA, or even after you've shipped. And we just covered shader validation, which helps you detect, locate and classify both subtle and obvious shader errors during development and QA. Now go out and try the features. Test your apps with enhanced command buffer errors and shader validation. Thanks, and have a great WWDC.