Introduction to Parallel Programming - Part I

# Introduction to Parallel Programming - Part I

Course from [Udacity](https://www.udacity.com/course/intro-to-parallel-programming--cs344) by [*NVIDIA*](http://www.nvidia.com/), and Github repo [here](https://github.com/udacity/cs344).

## 1 The GPU Programming Model

> GPU favors *throughput* over *latency*, and optimizes for it.

Typical GPU program

1. CPU allocates storage on GPU `cudaMalloc`
2. CPU copies input data to GPU `cudaMemcpy`
3. CPU launches kernel(s) on GPU to process the data
4. CPU copies result back from GPU `cudaMemcpy`

:bulb: Kernels look like serial programs. Write your program as it if it will run on *one* thread; the GPU will run that program on *many* threads.

GPU is good for 

- launching a large number of threads efficiently
- running a large number of threads in parallel

```c
kernal<<<gridSize, blockSize, shmem>>>(...)
kernal<<<dim3(bx, by, bz), dim3(tx, ty, tz)>>>(...)
```

GPU efficiently runs on many blocks and each block has a maximum `#`threads.

<img src="https://upload.wikimedia.org/wikipedia/commons/5/5b/Block-thread.svg" width=350>

<img src="https://upload.wikimedia.org/wikipedia/commons/a/af/Software-Perspective_for_thread_block.jpg" width=500>

(*Images from Wikipedia [thread-block](https://en.wikipedia.org/wiki/Thread_block).*)

## 2 GPU Hardware & Parallel Communication Model

### Abstraction & Communication Patterns

**Map** `1-to-1`: tasks read from and write to specific data elements (memory)

`map(element, function)`

**Gather** `n-to-1`: tasks compute where to read data

**Scatter** `1-to-n`: tasks compute where to write data

**Stencil** `n-to-1`: tasks read from a fixed neighborhood in an array (*data re-use*)

**Transpose** `1-to-1`: tasks re-order data elements in memory

### Hardware & Model

> GPU is responsible for allocating blocks to SMs

- a thread block contains many threads
- a SM may run more than one blocks
- all the threads in one block may cooperate to solve a subproblem, but not to communicate with threads in other blocks (even in the same SM)

> CUDA makes few guarantees about when and where thread blocks will run

- Pros: flexibility `->` efficiency; scalability
- Cons: no assumptions block`<->`SM; no communications between blocks

CUDA guarantees:

1. all threads in a block run on the same SM at the same time
2. all blocks in a kernel finishe before any blocks from the next kernel run

**Synchronization**

- Barrier `__syncthreads()`: wait till all threads arrive the proceed 
- Atomic Ops: only certain ops & data types, still no ordering

### Strategies

Maximize arithmetic intensity: `math/memory`

- maximize compute ops per thread
- minimize time spent on memory
	- move frequently-accessed data to fast memory: `local > shared >> global >> host`
	- coalesce global memory accesses (`coalesced >> strided`)

Avoid thread divergence (branches & loops)

:bulb: [How to use shared memory](https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/) (static & dynamic)

*UPDATE*: **Cooperative Groups** in Cuda 9 [features](https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/)

- define groups of threads explicitly at sub-block and multiblock granularities
- enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
- provide an abstraction for scalable code across different GPU architectures

```c
__global__ void cooperative_kernel(...)
{

    // obtain default "current thread block" group
    thread_group my_block = this_thread_block();
    
    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    thread_group my_tile = tiled_partition(my_block, 32);

    // This operation will be performed by only the 
    // first 32-thread tile of each block
    if (my_block.thread_rank() < 32) {
        …
        my_tile.sync();
    }
}
```

## 3 & 4 Fundamental GPU Algorithms

*Step complexity* vs. *Work complexity* (total amount of opearions)

### Reduce

Input: 

- set of elements 
- reduction operator
	- binary
	- associative (`(a op b) op c = a op (b op c)`)

```
a b
\ /
 + c
 \ /
  + d
  \ /
   +

a b c d
\ / \ /
 +   +
  \ /
   +
```

`logn` step complexity - what if we only have `p` processor but `n > p` input? [Brent's theorem](https://en.wikipedia.org/wiki/Analysis_of_parallel_algorithms)

### Scan

Input: 

- set of elements 
- binary associative operator
- identity element `I` s.t. `I op a = a`

Two types of scan:

- Exclusive scan - output all element before but *not* current element
- Inclusive scan - output all element before *and* current element

Algorithm|Desp.|Work|Step|Notes
---|---|:---:|:---:|---
*Serial Scan* |-|`O(n)`|`n`|
*Hillis & Steele Scan*|Starting with step 0, on step `i`, `op` yourself to your `2^i` left neighbor (if no such neighbor copy yourself)|`O(n*logn)`|`logn`|step efficient (more processor than work)
*Blelloch Scan*|reduce `->` downsweep ([paper](https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf), [wiki](https://en.wikipedia.org/wiki/Prefix_sum))|`O(n)`|`2*logn`|work efficient (more work than processor)

#### Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

### Histogram

- Accumulate using *atomics*
- Per-thread local histograms, then reduce
- Sort, then reduce by key

### Compact

Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.

1. Predicate
2. Scan-in array `A` (`0` for false and `1` for true)
3. Exclusive-sum-scan on array `A` `->` scatter addresses (dense)
4. Scatter using addresses

#### Allocate

Using scan (good strategy)

- Input: allocation request per input element
- Output: location in array to write your thread's output

### Sort

- [Odd-even sort](https://en.wikipedia.org/wiki/Odd%E2%80%93even_sort) - `O(n)` steps & `O(n^2)` works
- (Parallel) Merge sort
	1. Tons of tasks (each task small) - task per thread
	2. Bunches of tasks (each task medium) - task per block
	3. One task (big) - split task across SMs
- [Sorting network](https://en.wikipedia.org/wiki/Sorting_network) (e.g., [bitonic sorter](https://en.wikipedia.org/wiki/Bitonic_sorter), *image from wikipedia*)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/SimpleSortingNetworkFullOperation.svg/1300px-SimpleSortingNetworkFullOperation.svg.png)

- [Radix sort](https://en.wikipedia.org/wiki/Radix_sort) - `O(kn)` where `k` is `#`bits of the representation (quite brute-force)
- Quick sort

Oblivious - behavior is independent of some aspects of the problem


Algorithm	Desp.	Work	Step	Notes
Serial Scan	-	`O(n)`	`n`
Hillis & Steele Scan	Starting with step 0, on step `i`, `op` yourself to your `2^i` left neighbor (if no such neighbor copy yourself)	`O(n*logn)`	`logn`	step efficient (more processor than work)
Blelloch Scan	reduce `->` downsweep (paper, wiki)	`O(n)`	`2*logn`	work efficient (more work than processor)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to Parallel Programming - Part I #2

Introduction to Parallel Programming - Part I

1 The GPU Programming Model

2 GPU Hardware & Parallel Communication Model

Abstraction & Communication Patterns

Hardware & Model

Strategies

3 & 4 Fundamental GPU Algorithms

Reduce

Scan

Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

Histogram

Compact

Allocate

Sort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Introduction to Parallel Programming - Part I #2

Description

Introduction to Parallel Programming - Part I

1 The GPU Programming Model

2 GPU Hardware & Parallel Communication Model

Abstraction & Communication Patterns

Hardware & Model

Strategies

3 & 4 Fundamental GPU Algorithms

Reduce

Scan

Sparse matrix/dense vector multiplication (SpMv) & Segmented scan

Histogram

Compact

Allocate

Sort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions