Real Life CUDA Programming — part 4—Error Checking

Error checking is an important part of every program. We must be able to know when our operations failed, so that we can retry or at the very least log the problem for later analysis.

tl;dr

Error checking in CUDA must be done by hand, fortunately the Toolkit gives us usefully methods to do just that.

Error checking in CUDA

Unfortunately for us, CUDA code runs on the GPU and so, for the concurrent (parallel) code, there is no stack for us to receive errors from, like we’re used to having in C\C++ programs. Instead, the kernel code will fail silently and we will never be any wiser about it.

The good news is that error checking in CUDA code is possible, the bad news is that it needs to be done by hand and it is still lacking. Let me show you what I mean.

The runtime provides an error variable that is initially set to cudaSuccess and is overwritten every time there is an error in the CUDA code.

CUDA provides us with two functions for error checking. cudaPeekAtLastError and cudaGetLastError . The difference between the two functions is how they treat the success variable. The first only returns the error variable, while the latter also resets the error variable to cudaSuccess. In order to make sure the variable is correct after asynchronous calls (most calls in the GPU context), we must call the synchronization method cudaDeviceSyncronize.

Tips and Tricks

There are two main tips to follow when using the CUDA error checking methods.

This function can be used to wrap any function that returns cuda errors. For example:

Conclusion

Error checking is a critical part of any program. While error checking in CUDA is not as straightforward as in most programs, it is not complicated in and of itself, but required more manual work than usual.

Using the error checking mechanism that are included in the CUDA Toolkit gives us the ability to successfully monitor and manage our errors even in the GPU code.

Good Luck!

A math and computer science enthusiastic.