Accelerated Computing with CUDA on C/C++

This is a certified course within NVIDIA DLI.

The work done by GPU is assynchronous. The process needs to wait for synchronizing using cudaDeviceSynchronize().

Parts of a GPU function declaration for CUDA :

kernelName<<< #blocks, #threads_per_block >>>()

Introduction

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model, released in 2007. It lets developers use NVIDIA GPUs for general-purpose computing (GPGPU), extending beyond graphics rendering into scientific computing, deep learning, data analytics, and more.

CUDA provides:

A C/C++/Fortran API with GPU extensions (__global__, __device__, etc.)
A runtime and driver API for GPU management
Optimized libraries (cuBLAS, cuDNN, cuFFT, etc.)
Compiler toolchain (NVCC)

Host vs Device

Host = CPU & system memory (RAM)
Device = GPU & global memory (VRAM)
CUDA programs split work between host and device.

Kernels

A kernel is a function executed on the GPU.
Defined with __global__ keyword in CUDA C/C++.
Invoked with [grid, block] execution configuration syntax.

cpp

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

Thread Hierarchy

CUDA organizes threads in a grid–block–thread model:

Thread: smallest unit of execution
Block: group of threads (1D/2D/3D)
Grid: group of blocks (1D/2D/3D)

Each thread has unique IDs (threadIdx, blockIdx) to identify work.

Memory Hierarchy

Different memory types with trade-offs:

Registers (per-thread, fastest)
Shared memory (per-block, low latency, useful for collaboration)
Global memory (device-wide, high latency, but large)
Constant & texture memory (special cached memories)

SIMT (Single Instruction, Multiple Thread)

Threads are executed in groups called warps (32 threads).
All threads in a warp execute the same instruction (though divergence is possible).

Streams & Concurrency

Streams allow overlapping computation and data transfers.
Enables pipelining (compute on one batch while transferring another).

CUDA Version Evolution (Highlights)

NVIDIA has steadily expanded CUDA since 2007. Here are the key version milestones:

Version	Release Year	Key Features / Changes
1.0 (2007)	2007	First release, GPU programming in C, targeting Tesla architecture.
2.x (2008–2009)	2008–09	Double precision support, improved atomics, multi-GPU support.
3.x (2010–2011)	2010–11	Fermi architecture: unified cache, ECC memory, improved concurrency.
4.x (2011)	2011	Unified Virtual Addressing (UVA), peer-to-peer GPU memory access.
5.x (2012)	2012	Kepler architecture: dynamic parallelism (kernels launch kernels), shuffle instructions.
6.x (2014)	2014	Maxwell support, Unified Memory (single address space across CPU+GPU).
7.x (2015)	2015	Pascal architecture: NVLink, more Unified Memory improvements.
8.0 (2016)	2016	Pascal GPUs: half-precision (FP16), CUDA Graphs API.
9.x (2017)	2017	Volta architecture: Tensor Cores, warp-level primitives.
10.x (2018–2019)	2018–19	Turing architecture, enhanced Tensor Cores (mixed precision).
11.x (2020–2021)	2020–21	Ampere architecture, CUDA Python, expanded C++17 support.
12.x (2022–2023)	2022–23	Hopper architecture, FP8 precision, improved asynchronous execution, new memory models.
Latest (CUDA 12.5, 2024)	2024	Hopper + Ada Lovelace optimizations, streamlined multi-GPU workflows, expanded cooperative groups.

CUDA provided variables inside kernel, for describing executing thread, block and grid :

gridDim.x : #blocks
blockIdx.x : id(or index) of current block
blockDim.x : #thread in block
threadIdx.x : id of current thread inside block

Streaming Multiprocessors

NVIDIA GPUs contain functional units called Streaming Multiprocessors or SMs.

Blocks of threads are scheduled to run on SMs.
Depending on the number of SMs on a GPU and the requirements of a block more than one block can be scheduled on an SM.
Grid dimensions divisible by number of SMs on GPU can promote full SM utilization.

Unified Memory behaviour

Defining of UM occurs through :
cudaMallocManaged

It is where the GPU send the output to.

Behaviour :

At UM alloc, it may not be resident initially on the CP or GPU
When some work asks for memory for the first time a page fault will occur
Page fault triggers migration of the demanded memory
Process repeats anytime the memory is requested somewhere in the system where it is not resident.
If it is known that the memory will be accessed somewhere it is not resident, asynchronous prefetching can be used using cudaMemPrefetchAsync(cpu)
This moves the memory in larger batches and prevents page faulting.

Concurrent CUDA Streams

CUDA kernels run in the default stream. 1 kernel must complete before the next kernel launch. In any stream an instruction in it must complete before the next can begin.

However, Multiple streams can be made. Instructions in the stream must run in sequence, whereas the kernels in different, non-default streams can interact concurrently.

Default stream is special, as it blocks all kernels in all other streams.

Creating a stream :

cudaStream_t stream;       // CUDA streams are of type `cudaStream_t`.
cudaStreamCreate(&stream); // Note that a pointer must be passed to 'cudaCreateStream'.

someKernel<<<number_of_blocks, threads_per_block, 0, stream>>>(); // `stream` is passed as 4th EC argument.

cudaStreamDestroy(stream); // Note that a value, not a pointer, is passed to 'cudaDestroyStream'

Non-Unified Memory

Memory can be allocated directly to the GPU with cudaMalloc. Memory allocated in either of these ways can be copied to other locations using cudaMemCpy(HtoD). Copying leaves 2 copied of it in the system.

cudaMemcpyAsync can asynchronously transfer memory over a non-default stream.

This can allow the overlapping memory copies and computation.

More on Memory management

cudaMalloc will allocate memory directly to the active GPU. This prevents all GPU page faults. In exchange, the pointer it returns is not available for access by host code.
cudaMallocHost will allocate memory directly to the CPU. It also "pins" the memory, or page locks it, which will allow for asynchronous copying of the memory to and from a GPU. Too much pinned memory can interfere with CPU performance, so use it only with intention. Pinned memory should be freed with cudaFreeHost.
cudaMemcpy can copy (not transfer) memory, either from host to device or from device to host.

Sample program

#include <stdio.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
  {
    a[i] = num;
  }
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

Running :

bash

nvcc -o vector-add-no-prefetch 01-vector-add/01-vector-add.cu -run

Profiling :

bash

nsys profile --stats=true -o vector-add-no-prefetch-report ./vector-add-no-prefetch

Sample Prefetching

Changes in main :

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

Kernelized initialization

Changes :

__global__
void initWith(float num, float *a, int N)
{

  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    a[i] = num;
  }
}

.
.
.
.
.
.

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  initWith<<<numberOfBlocks, threadsPerBlock>>>(3, a, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(4, b, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(0, c, N);

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

Financial Portfolio Development

Theories

Banking

Investing

Index

Retail

Decision Making

Growth and Transformation

Software development

DevOps

SAFe

Estimation - TCS course 5474

Research Methodology

Ai

Narrow

High Powered Computing

Networks

Software_engineering

Software_security

Computer Science

Algorithms

Automata Theory

Theory of Computation

Problems - Computation

Data Science

Data Mining

Stream_processing

Electronics & Automation

Micro_controllers

Atmega

Database Management Systems

Networks

Physical Layer(OSI Layer 1)

Radio

Index

Software Engineering

Artificial Intelligence

Generative AI

Narrow

Sensory

Architect

Software architecture patterns

Togaf

Amazon Web Services

Software Frameworks

Frameworks used for REST

Middleware softwares

Packaging

High Powered Computing

NVIDIA

Observability

Software Operations

Containers

OS

Index

Practical Setups

Introduction to Programming Languages

C(++)

Js

R

Security in Softwares

Layer 6

Linear_regressions

Monte_carlo

Accelerated Computing with CUDA on C/C++ ​

Introduction ​

Host vs Device ​

Kernels ​

Thread Hierarchy ​

Memory Hierarchy ​

SIMT (Single Instruction, Multiple Thread) ​

Streams & Concurrency ​

CUDA Version Evolution (Highlights) ​

Streaming Multiprocessors ​

Unified Memory behaviour ​

Concurrent CUDA Streams ​

Non-Unified Memory ​

More on Memory management ​

Sample program ​

Sample Prefetching ​

Accelerated Computing with CUDA on C/C++

Introduction

Host vs Device

Kernels

Thread Hierarchy

Memory Hierarchy

SIMT (Single Instruction, Multiple Thread)

Streams & Concurrency

CUDA Version Evolution (Highlights)

Streaming Multiprocessors

Unified Memory behaviour

Concurrent CUDA Streams

Non-Unified Memory

More on Memory management

Sample program

Sample Prefetching

Kernelized initialization