Real Life CUDA Programming — part 2 — Hello CUDA

5 min readJan 23, 2019

If you haven’t read the first two posts in this series, they are An Overview and An Introduction to the GPU.

Setup

The first thing needed for compiling GPU code are the tools that we’ve described in the previous section. The tools are included in the CUDA Toolkit, which can be downloaded and installed here. Depending on your GPU and the driver version, you might need to use an earlier version of the CUDA toolkit.

Please note that the code sections that follow can be compiled for any OS, but were tested on Linux.

I like compiling and profiling the code using a simple Makefile that looks like this:

Using the makefile, compiling is simply done using the make, and profiling is done with make profile.

Let’s get coding!

The classical starter code, the “Hello World” of GPU programming, is a simple program adding two vectors. Simple though it is, the program demonstrates a few key points. Among others, the program shows us how to execute GPU threads, what time factor the different launch configurations give us, and how to adapt regular code to be able to run on the GPU.

We’ll be working on improving a simple program which performs the addition of the vectors. In order to properly see the benefits we obtain by using the GPU’s parallel computing capabilities, we must have a large enough array. The size we will use will be 1,000,000. Adding two arrays, each a million elements long, has significant benefits when running in a parallel fashion on the GPU. Using the GPU, we can achieve up to an 11K performance improvement, but in order to get that much of a performance boost we must go through several iterations of improving our code.

The CPU version of the code

We’ll start with the most basic element needed to run code on the GPU — the kernel function. The kernel function is a regular C++ function annotated with the __global__ annotation, which tells the GPU compiler to regard this function as one that will be called from the CPU, but run on the GPU. More on function types in a later post. This function will always have the same basic structure.

Thread away

The first type of kernel function is a thread-only function. This function will be ran for a number of threads, but only on one thread block, which, if you remember from the previous post, is a group of threads that can have shared resources and can cooperate with each other. Running the kernel in such a way will speed up our code, but not by the factor we’re aiming for. The kernel function is as follows.

Let’s understand what’s going on here.

First, we can see that there are special variables that the kernel has access to, such as the threadIdx and blockDim (we’ll see more variables soon). These variables correspond to the launch configurations that the kernel was ran with. Broadly speaking, a launch configurations consists of two parts: the number of blocks to run and the number of threads in each. In the case of this kernel , there is only one block. We can also see that the loop itself (inside the addArrays function) has changed, but not by much. Looking at the new loop, we can see that the only difference is that instead of going over each element sequentially, we are running multiple threads, each responsible for some of the elements in the array, which is the reason for the new loop structure.

Running any kernel is simply a matter of using the special syntax that CUDA makes available. The syntax is rather straightforward, consisting of two sets of triple sharp brackets : <<< and >>>. (You can ignore the memory allocation lines, we’ll deal with them in the next post).

Between these two brackets we give the launch configuration, in the form of the two elements we’ve already discussed, the block dimensions (number of threads per block) and grid dimensions (number of blocks to run). For convenience, each of these can have up to three dimensions, enabling us to more easily build programs for structures that have more dimensions. (The classical example in this case is matrix multiplication, where we want to have a 2-dimensional array of threads, each responsible for one element in the matrix).

Looking back at the above example, we’ve already noted that there is only one thread block and a specific number of threads, which means the launch will improve the running time of the code, but only by a constant value (in this case, the number of threads). This is cool, but we can do better.

Blocks galore

How can we enhance this code? There is still one dimension we haven’t taken advantage of yet — thread blocks. Launching our kernel with several blocks allows us to scale up to even larger and more complex programs. Changing the kernel to run on several blocks is just a matter of recalculating the indices used inside the kernel. The launch configuration changes accordingly, with each element in the array now receiving it’s own thread on the GPU. The new launch configuration finally gives us the speedup we wanted!

To understand the code, it might be helpful to see how the blocks and threads are structured:

Block and Thread structure (source: http://geco.mines.edu/tesla/cuda_tutorial_mio/pic/Picture1.png)

Using this image as a reference, we can now understand the loop in the kernel function, which gives each thread a different element to work on, based on the different block and grid dimensions.

The difference between this code and the previous one is simply the way we arrange our threads. Since we are using multiple blocks, the beginning index and stride of each thread change accordingly.

Summing Up

So, what have we learned?

We saw how to write and launch kernels.
We understood what launch configurations were and how they affect our performance and kernels.
We wrote our very first CUDA program!

Next time — CUDA memory and error checking. See you then!

Bonus points: write another kernel, this one responsible for filling the array (instead of the CPU function that there is currently called).