Real Life CUDA Programming - Part 1 — A gentle introduction to the GPU

  • Introduction (why care about the hardware)
  • Yours new best friends (common tools)
  • GPU deep dive (useful terms)


If you’re also interested in the motivation behind writing all this, check out part 0.

Unlike most of the programming problems, coding for the GPU requires some understanding of the basic terms and concepts of the hardware itself. This is necessary for two main reasons:

  1. If you’re coding I’m CUDA, you’re looking for speed. While the GPU, generally speaking, will give you a lot of speed, you must know how to utilize the device properly.
  2. Many of the effects and restrictions you see when programming in CUDA are caused by the hardware underneath, and understanding that hardware will also improve your ability to program successfully.

Other than those, if you want to program in CUDA you need to, like in all crafts, know the tools available to you.

Your new best friends

Let me start from the end and introduce you to two of the most powerful tools you will use for GPU programming.


NVCC (NVidia Cross Compiler) is the compiler for any CUDA program. The compiler behaves like any other C\C++ compiler, including the compilation and linking process for any regular, non-CUDA code that is included in your project. NVCC is special in that it can compile all CUDA functions and symbols, which a regular compiler can’t. Other than that, you simply swap your current compiler for NVCC and everything should work out as before.

This also means that you can used NVCC in any project building system you used before, for example, a makefile.


NVProf (NVidia Profiler) is the second most common tool you’ll find. While profiling (measuring the time each part of your program takes) is commonplace and used extensively, it is especially important for GPU programming. The simple reason being that, as I said before (and as I’ll probably say again), we have a need for speed. This drives us to always improve on our time, and the profiler is a tremendous help in that regard. The downside of this tool is that it only profiles code that runs on the GPU (“device code”), and provides no information about any of the CPU code (“host code”).

GPU Deep Dive

So, now we must face the huge wall that is the GPU architecture.


(Full disclosure, that learning curve is the Vim learning curve, but I think it applies just the same).

Although difficult to understand at first, once you get a hang of the architecture and capabilities of the GPU, the rest of the way is extremely easy. There are three main ideas that need to be understood:

  1. GPU Cores
  2. GPU Threads & thread blocks
  3. GPU Memory

Once you are confident in those three ideas, you’ll be ready to program!

Let’s tackle them one at a time.

GPU cores

First, the architecture itself.

A GPU, unlike a CPU, is designed for massive parallel computing. As a result, the GPU’s design is of hundreds of cores which can handle thousands of threads, all running in parallel.

It is important to remember that unlike on the CPU, where the number of parallel operations is limited by the number of cores in the CPU itself (which is usually no more that 8), the GPU has no such limitation. You are able, and even encouraged, to schedule thousands of parallel threads, so as to keep the GPU cores busy and maximise their usage.

This view is important, because as we will see, we will schedule many more processes than there are cores.

GPU Threads

As I said, the GPU is capable of running thousands of threads in parallel, but order to launch that many threads, we must group them into smaller units called thread blocks. Each thread block is of a fixed size, which must be a multiple of the warp size (usually 32). We can then launch as many blocks as we want in what is called a grid. These two terms will appear every time you want to run something on the GPU, as you must specify both the block size (which is usually static and hardcoded) and the grid size (which is usually determined at runtime so that there is one thread for each element of the computation.

For example, when adding together two arrays (an example we’ll analyze in depth in the next post), we run a seperate thread for each element in the array.

GPU Memory


The last element that must be understood before we can start the actual coding is the memory that can be accessed by the GPU.

Generally speaking, there are three types of memory that are in play when running GPU code.

  1. Regular CPU memory (“host memory”). This is the memory that is created when allocating memory in the standard code (i.e. when using the new keyword or the malloc function). This code is only accessible by the CPU and not the GPU.
  2. GPU-only memory (“device memory”). This memory exists on the GPU and accessible by the GPU and not the CPU.
  3. Unified Memory. This memory is accessible by both the CPU and the GPU. In practice, the unified memory is automatically managed by CUDA and transferred as needed between the CPU and the GPU. Using unified memory simplifies the programming process and is discussed in a following post.

The immediate implication of having two seperate memories is that there is always a bottleneck in the program that is caused by the necessary memory copying that has to be done by the CPU (because, as we said, the GPU has no access to the memory on the host.


So what have we learned today?
We looked at the common tools that are used with every CUDA project (NVCC, NVProf).

We’ve also looked at the basic terms surrounding GPU and CUDA programming and saw some of their implications on how we should write code for the GPU.

Next time — our first CUDA program!

See you next then!

A math and computer science enthusiastic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store