This is an overview of what I’ve been upto for the past 2 weeks. Doesn’t go into much technical details and the actual code but just walks through the general idea.
Convolution is a fundamental operation in various domains, such as image processing, signal processing, and deep learning. It is an important module in Gnuastro and is also used as a subroutine in other modules.
Convolutional operations can be broken down into smaller tasks, such as applying the kernel to different portions of the input data. By utilizing multiple threads, each thread can independently process a subset of the input, reducing the overall execution time. This parallelization technique is particularly effective when dealing with large input tensors or performing multiple convolutions simultaneously.
While traditional CPUs (Central Processing Units) excel at performing a wide range of tasks, they are not specifically designed for heavy parallel computations like convolutions. On the other hand, GPUs (Graphics Processing Units) are highly optimized for parallel processing, making them ideal for accelerating convolutional operations.
CPUs have fewer, more powerful cores optimized for sequential processing, while GPUs have thousands of smaller cores designed for parallel processing. This parallelism allows GPUs to perform computations on multiple data elements simultaneously, leading to significant speedup in parallelizable tasks like graphics rendering and deep learning.
CPUs typically have larger caches and more advanced memory management units (MMUs), focusing on low-latency operations and complex branch prediction. GPUs, prioritize high memory bandwidth and utilize smaller caches to efficiently handle large amounts of data simultaneously, crucial for tasks like image processing and scientific simulations.
CPUs are designed with an emphasis on executing single threads - very fast. GPUs are designed with an emphasis on executing on executing multiple threads.
For Programming GPUs, several frameworks (high level APIs) are available
Programming Language
- Based on C, has extensions to write code for GPU.Compiler
- Based on clang, offloads host code to system compiler and translates device code into binary code that can be executed on the GPU.Runtime Library
- Provides the necessary functions and tools to manage the execution of the code on the GPU (interacts with the driver).Note : When we have multiple devices(GPUs, FPGAs, etc) on a single system, which can execute tasks apart from the main CPU, they’re generally referred to as device
whereas the main CPU is referred to as host
.
CUDA programs consists of normal host code along with some kernels
.
Kernels are like other functions, but when you call a kernel, they’re executed N times parallely by N different CUDA threads, as opposed to only once like normal functions. They’re defined using the __global__
keyword.
Eg :
Normally, we put the above piece of code inside a loop, so all elements are covered.
With GPUs, there’s no need for loops - for N elements, we launch N threads each of which add 1 element at the same time!
Can we launch an arbitrary large number of threads? Technically No
Blocks and Grids could be a 1D, 2D or 3D structures.
When calling a GPU kernel, we specify the structure of each block, number of blocks, and number of threads/block - This is called the Execution Configuration.
Example :
The above code Launches
32x32x1 = 1024 blocks
Each having 16x16 = 256 threads
Total no. of threads = 1024x256.
CUDA threads may access data from multiple memory spaces during their execution as illustrated above.
Local memory for each thread.
Shared memory b/w all threads of same block.
Global memory b/w all blocks.
The entire GPU is divided into several Streaming MultiProcessors (SMs). They have different architecture than a typical CPU core. Each SM has several CUDA cores, which are the actual processing units.
It is designed with SIMT/SIMD philosophy, which allow execution of multiple threads concurrently on them. One Block is executed at a time on a single SM.
All tests were performed on a system with the following specifications:
CPU :
GPU :
The input image was a 10k x 20k FITS file with 32-bit floating point values. The kernel was a 3x3 matrix with 32-bit floating point values.
The overall speedups seems to only be 6X but this also counts the time taken to transfer the data from CPU to GPU and back. If we only consider the time taken to perform the convolution, the speedup is around ~700X!.