Error checking is an important part of every program. We must be able to know when our operations failed, so that we can retry or at the very least log the problem for later analysis.
Error checking in CUDA must be done by hand, fortunately the Toolkit gives us usefully methods to do just that.
Error checking in CUDA
Unfortunately for us, CUDA code runs on the GPU and so, for the concurrent (parallel) code, there is no stack for us to receive errors from, like we’re used to having in C\C++ programs. Instead, the kernel code will fail silently and we will never be any wiser about it.
The good news is that error checking in CUDA code is possible, the bad news is that it needs to be done by hand and it is still lacking. Let me show you what I mean.
The runtime provides an error variable that is initially set to
cudaSuccess and is overwritten every time there is an error in the CUDA code.
CUDA provides us with two functions for error checking.
cudaGetLastError . The difference between the two functions is how they treat the success variable. The first only returns the error variable, while the latter also resets the error variable to
cudaSuccess. In order to make sure the variable is correct after asynchronous calls (most calls in the GPU context), we must call the synchronization method
Tips and Tricks
There are two main tips to follow when using the CUDA error checking methods.
- Always check the errors before you run the main functions of your code. I’ve lost count of the number of times I could not find the error in my code that the error function was reporting, simply because the error originated from something silly not related to the main code, such as a bad cast to a variable or badly allocated memory.
- Use a wrapper function to report the errors correctly and easily.
This function can be used to wrap any function that returns cuda errors. For example:
Error checking is a critical part of any program. While error checking in CUDA is not as straightforward as in most programs, it is not complicated in and of itself, but required more manual work than usual.
Using the error checking mechanism that are included in the CUDA Toolkit gives us the ability to successfully monitor and manage our errors even in the GPU code.