Real Life CUDA Programming — part 3 — Unified Memory

5 min readFeb 19, 2019

Or: going back to basics

This is part 3 of the CUDA programming series, you are welcome to check out the previous posts:

The next few posts will be geared towards understanding key concepts related to general purpose GPU programming in general and CUDA programming in particular. While each post will include some functional code to demonstrate the concept being discussed, their main purpose is to explain the concepts as best as possible.

In short

In this post we will talk about the memory model used by the GPU in general and CUDA in particular, a memory model you need to understand in order to both write the code efficiently and to write efficient code.

What is a memory model?

Generally speaking, a memory model is the way the programmer views the machine’s memory. In classical programs, we usually talk about two types of memory — the stack and the heap. The stack is where all your static variables, those that you create inside classes and function, go. The heap is where dynamic memory lives. The main difference between the two is that the static memory can be calculated ahead of time by the compiler, which allocates space in the stack for all those variables (and also function calls and returns, but that is a different subject), while the dynamic memory is only requested at runtime. In good old C code, this distinction is obvious. The only way to allocate memory dynamically (for example, creating an array whose size is decided at runtime, is only possible through special functions such as malloc. Any and all other variables must have their size known at compile time (for example, simple arrays must have a size specified by a hard-coded number or a constant variable, so that the compiler knows the size of the array). In later, more sophisticated languages the boundary is blurred (for example, in C++ and Java you can create arrays of a size that is decided at runtime), but since CUDA extends the C memory model, that is the one we will keep in mind.

The CUDA memory model

When programming for the GPU, you must remember that there are two machines your memory can be stored on, which are the host (the computer running the program) and the device (the GPU executing the CUDA code). Each of those implements a C memory model, i.e. each has a separate stack and heap. You might be beginning to see the problem inherent to this separation. When writing code that includes CUDA sections, it’s important to realize that at some point you have to transfer your memory from the host memory (which is used by the CPU) to the device memory (which is used by the GPU). This is a tedious and inefficient task for two main reasons.

First, it requires that you manually write the code for copying the memory from one to the other, which can be easily forgotten and\or messed up. Second, this method is inefficient at runtime. As the GPU can’t access the host memory, the memory sections must be copied serially by the CPU into the device memory, a costly operation.

Luckily for us, the good folk at NVIDIA who develop CUDA gave us a great solution which solves both problems — unified memory.

As both the image and the name clearly convey, the unified memory model has one simple purpose— it gives programmers one memory space to work with. Simple in concept, but in practice, this memory model saves us the developers both time and effort. In giving us one heap to work with, we can now allocate memory that is accessible from both the CPU and the GPU. Moreover, this method allows us to prefetch the memory before usage, which means it will be more readily available for the GPU at runtime.

Coding Time!

Let’s see some code in action.

We’ll go through several iterations of this code in order to demonstrate the various features of unified memory.

This code will be based on the basic code you first saw in the previous post.

The simple Hello World program from the last post

There are several things worth noting about this code.

First, there is only one allocation call per pointer. The pointer now holds an address that can be used by both the CPU and the GPU. You might be wondering how this is possible. The short answer is that behind the scenes CUDA moves the memory to device memory when requested by the GPU (for the longer answer you are welcome to read this post by the NVIDIA developer team, which goes into more detail about the unified memory migration system). If you think this is inefficient, you’re right. We’ll see how to improve the performance in a second.

Second, using pointers to a shared memory means that you can pass the pointers to both regular functions and to kernel function, as demonstrated in the code. This allows us to write vastly more flexible code, as we don’t need to constantly check the memory is synchronized between host and device.

Here’s a second version of the same code, this time using prefetching. The only changes are in the main function on lines 13–17.

We use the cudaMemPrefetchAsync function to make the memory available to the GPU before it actually requests it. Notice where we call the function. We do so immediately before the kernel call but after all calls to CPU function. Calling the prefetch function before launching the kernel causes the memory to start being copied in the background into device memory. Running this operation requires both the CPU and the GPU, and so we do it only after all other CPU operations. Notice that we must also tell the prefetch function the device to which we want the memory to be copied. Getting the device id (lines 14–15) is usually done automatically by the CUDA library, but can also be overridden by giving the function a different number (for example, if you want to write multi-GPU code). For those interested, the full docs for the prefetch function can be found here.

Using prefetching can up to double our performance, enabling us to write much more scalable and durable code.

Summary

So, what have we learned?

We understood what a memory model is and the problems we need to face
We saw how the unified memory model solves those problems
We used prefetching to improve our memory performance so that we could write better code!

Next time — CUDA error checking. See you then!

Bonus points: Write the function initializing the arrays as a kernel, where do we need to prefetch the memory now?