Introduction to Parallel Programming - Part I
Course from Udacity by NVIDIA, and Github repo here.
1 The GPU Programming Model
GPU favors throughput over latency, and optimizes for it.
Typical GPU program
- CPU allocates storage on GPU
cudaMalloc
- CPU copies input data to GPU
cudaMemcpy
- CPU launches kernel(s) on GPU to process the data
- CPU copies result back from GPU
cudaMemcpy
💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.
GPU is good for
- launching a large number of threads efficiently
- running a large number of threads in parallel
kernal<<<gridSize, blockSize, shmem>>>(...)
kernal<<<dim3(bx, by, bz), dim3(tx, ty, tz)>>>(...)
GPU efficiently runs on many blocks and each block has a maximum #threads.


(Images from Wikipedia thread-block.)
2 GPU Hardware & Parallel Communication Model
Abstraction & Communication Patterns
Map 1-to-1: tasks read from and write to specific data elements (memory)
map(element, function)
Gather n-to-1: tasks compute where to read data
Scatter 1-to-n: tasks compute where to write data
Stencil n-to-1: tasks read from a fixed neighborhood in an array (data re-use)
Transpose 1-to-1: tasks re-order data elements in memory
Hardware & Model
GPU is responsible for allocating blocks to SMs
- a thread block contains many threads
- a SM may run more than one blocks
- all the threads in one block may cooperate to solve a subproblem, but not to communicate with threads in other blocks (even in the same SM)
CUDA makes few guarantees about when and where thread blocks will run
- Pros: flexibility
-> efficiency; scalability
- Cons: no assumptions block
<->SM; no communications between blocks
CUDA guarantees:
- all threads in a block run on the same SM at the same time
- all blocks in a kernel finishe before any blocks from the next kernel run
Synchronization
- Barrier
__syncthreads(): wait till all threads arrive the proceed
- Atomic Ops: only certain ops & data types, still no ordering
Strategies
Maximize arithmetic intensity: math/memory
- maximize compute ops per thread
- minimize time spent on memory
- move frequently-accessed data to fast memory:
local > shared >> global >> host
- coalesce global memory accesses (
coalesced >> strided)
Avoid thread divergence (branches & loops)
💡 How to use shared memory (static & dynamic)
UPDATE: Cooperative Groups in Cuda 9 features
- define groups of threads explicitly at sub-block and multiblock granularities
- enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
- provide an abstraction for scalable code across different GPU architectures
__global__ void cooperative_kernel(...)
{
// obtain default "current thread block" group
thread_group my_block = this_thread_block();
// subdivide into 32-thread, tiled subgroups
// Tiled subgroups evenly partition a parent group into
// adjacent sets of threads - in this case each one warp in size
thread_group my_tile = tiled_partition(my_block, 32);
// This operation will be performed by only the
// first 32-thread tile of each block
if (my_block.thread_rank() < 32) {
…
my_tile.sync();
}
}
3 & 4 Fundamental GPU Algorithms
Step complexity vs. Work complexity (total amount of opearions)
Reduce
Input:
- set of elements
- reduction operator
- binary
- associative (
(a op b) op c = a op (b op c))
a b
\ /
+ c
\ /
+ d
\ /
+
a b c d
\ / \ /
+ +
\ /
+
logn step complexity - what if we only have p processor but n > p input? Brent's theorem
Scan
Input:
- set of elements
- binary associative operator
- identity element
I s.t. I op a = a
Two types of scan:
- Exclusive scan - output all element before but not current element
- Inclusive scan - output all element before and current element
| Algorithm |
Desp. |
Work |
Step |
Notes |
| Serial Scan |
- |
O(n) |
n |
|
| Hillis & Steele Scan |
Starting with step 0, on step i, op yourself to your 2^i left neighbor (if no such neighbor copy yourself) |
O(n*logn) |
logn |
step efficient (more processor than work) |
| Blelloch Scan |
reduce -> downsweep (paper, wiki) |
O(n) |
2*logn |
work efficient (more work than processor) |
Sparse matrix/dense vector multiplication (SpMv) & Segmented scan
Histogram
- Accumulate using atomics
- Per-thread local histograms, then reduce
- Sort, then reduce by key
Compact
Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.
- Predicate
- Scan-in array
A (0 for false and 1 for true)
- Exclusive-sum-scan on array
A -> scatter addresses (dense)
- Scatter using addresses
Allocate
Using scan (good strategy)
- Input: allocation request per input element
- Output: location in array to write your thread's output
Sort
- Odd-even sort -
O(n) steps & O(n^2) works
- (Parallel) Merge sort
- Tons of tasks (each task small) - task per thread
- Bunches of tasks (each task medium) - task per block
- One task (big) - split task across SMs
- Sorting network (e.g., bitonic sorter, image from wikipedia)

- Radix sort -
O(kn) where k is #bits of the representation (quite brute-force)
- Quick sort
Oblivious - behavior is independent of some aspects of the problem
Introduction to Parallel Programming - Part I
Course from Udacity by NVIDIA, and Github repo here.
1 The GPU Programming Model
Typical GPU program
cudaMalloccudaMemcpycudaMemcpy💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.
GPU is good for
GPU efficiently runs on many blocks and each block has a maximum
#threads.(Images from Wikipedia thread-block.)
2 GPU Hardware & Parallel Communication Model
Abstraction & Communication Patterns
Map
1-to-1: tasks read from and write to specific data elements (memory)map(element, function)Gather
n-to-1: tasks compute where to read dataScatter
1-to-n: tasks compute where to write dataStencil
n-to-1: tasks read from a fixed neighborhood in an array (data re-use)Transpose
1-to-1: tasks re-order data elements in memoryHardware & Model
->efficiency; scalability<->SM; no communications between blocksCUDA guarantees:
Synchronization
__syncthreads(): wait till all threads arrive the proceedStrategies
Maximize arithmetic intensity:
math/memorylocal > shared >> global >> hostcoalesced >> strided)Avoid thread divergence (branches & loops)
💡 How to use shared memory (static & dynamic)
UPDATE: Cooperative Groups in Cuda 9 features
3 & 4 Fundamental GPU Algorithms
Step complexity vs. Work complexity (total amount of opearions)
Reduce
Input:
(a op b) op c = a op (b op c))lognstep complexity - what if we only havepprocessor butn > pinput? Brent's theoremScan
Input:
Is.t.I op a = aTwo types of scan:
O(n)ni,opyourself to your2^ileft neighbor (if no such neighbor copy yourself)O(n*logn)logn->downsweep (paper, wiki)O(n)2*lognSparse matrix/dense vector multiplication (SpMv) & Segmented scan
Histogram
Compact
Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.
A(0for false and1for true)A->scatter addresses (dense)Allocate
Using scan (good strategy)
Sort
O(n)steps &O(n^2)worksO(kn)wherekis#bits of the representation (quite brute-force)Oblivious - behavior is independent of some aspects of the problem