GPU Hardware

Thread Blocks And GPU Hardware

  • 一个thread对应GPU核,tread block对应SM(Streaming Multiprocessor),而一个grid对应若干个SM

  • A thread block contains many threads

  • A thread block must be on one SM; a SM may run more than one block

  • The programmer is responsible for defining blocks

  • The GPU is responsible for allocating thread blocks to hardware SMs

  • The threads run in parallel, and CUDA makes few guarantees about when and where thread blocks will run.thread block间的运行是并行的,无约束的

 2024-09-23 191041.png

A Thread-Block Programming Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <stdio.h>
__global__ void hello()
{
printf("Hello world! I'm a thread""in block %d\n", blockIdx.x);
}
int main(int argc,char **argv)
{
// launch the kernel
hello<<<16, 1>>>();
// !!! force the printf()s to flush
cudaDeviceSynchronize();
printf("That's all!\n");
return 0;
}
  • 如果不加cudaDeviceSynchronize();CPU发出指令后,直接向后执行”That’s all”后主程序退出,不会等GPU执行完

  • 执行顺序不一定,先执行完的先输出,所以一共有16!种结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <stdio.h>
__global__ void hello(float f)
{
printf("Hello world! I'm thread ""%d, f=%f\n", threadIdx.x,f);
}
int main(int argc,char **argv)
{
// launch the kernel
hello<<<1, 5>>>(1.2345f);
// force the printf()s to flush
cudaDeviceSynchronize();
printf("That's all!\n");
return 0;
}
  • 得到的结果每组(wrap)线程执行顺序不同而组内的每个县城都执行相同的代码,基本同事执行完成

  • 可以认为treads,threads block,warp间是并行的

What does CUDA guarantee

  • CUDA makes few guarantees about when and where thread blocks will run and different blocks / threads run in parallel

  • All threads in a block run on the same SM at the same time

  • All blocks in a kernel finish before any blocks from the next kernel run

SIMT: Single-Instruction, Multiple-Thread

  • The multiprocessor of GPU creates, manages, schedules, and executes threads in groups of 32 parallel threads, called warps

  • A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path

  • If threads of a warp diverge(分歧) via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths

Thread Divergence

 2024-09-23 192811.png

 2024-09-23 192825.png

  • 并行编程:尽量减少逻辑判断和for循环语句

Synchronization and Parallel Patterns

Synchronization and Parallel Patterns

  • Threads can access each others’ results through global and shared memory

  • 多个线程一起写(修改)的时候必须加同步

  • warning:

    1. A thread reads a result before another thread writes
    2. Multiple threads writes into the same memory location (e.g., sum)
  • Threads need to synchronize to collaborate

  • Barrier: A point in the program where threads stop and wait until all threads have reached the barrier; then threads proceed

Use Barriers to Avoid Data Races

  • 多个线程一起写,可能某一个线程修改完后面的线程才读入
1
2
3
4
5
6
7
8
9
10
11
12
const int N = 128;
__global__ void shift_sum(float* array)
{ //一位一求和
// do the "shift and sum"
int idx = threadIdx.x;
if (idx < N-1)
{
array[idx] = array[idx] + array[idx+1];
}
}
// . . .
shift_sum<<<1, N>>>(array);
  • Barrier解决竞争:
1
2
3
4
5
6
7
8
9
10
11
__global__ void shift_sum(float* array) 
{
// do the "shift and sum"
int idx = threadIdx.x;
if (idx < N-1)
{
float tmp = array[idx] + array[idx+1];
__syncthreads();
array[idx] = tmp;
}
}
  • Use shared memory to reduce the visits of global memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void shift_sum(float* array)
{
// shared memory can be accessed
// by all threads in the block.
__shared__ float shared[N];
// fill the shared memory
int idx = threadIdx.x;
shared[idx] = array[idx];
__syncthreads();
// 以上位模板可以直接用于程序代码中
// 一定要同步再进行后面操作
// do the "shift and sum"
if (idx < N-1)
{
array[idx] = shared[idx] + shared[idx+1];
}
// the following code has NO EFFECT
// 因为要写到global中才能看到
shared[idx] = 3.14;
}

Another Example on Thread Collaboration

  • Lots of threads reading and writing same memory locations,For example, 10k threads increment 10 array element
1
2
3
4
5
6
7
8
9
10
11
12
13
14
const int kNumThreads = 1000000;
const int kArraySize = 100;
const int kBlockWidth = 1000;

__global__ void increment_naive(int *g)
{
// thread index
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements
i = i % kArraySize;
g[i] = g[i] + 1;
}

increment_naive <<<kNumThreads/kBlockWidth,kBlockWidth>>>(d_array);

 2024-09-23 202433.png

  • 线程多,空间小,反复写,__syncthreads();失效

  • have to use atomicAdd()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const int kNumThreads = 1000000;
const int kArraySize = 100;
const int kBlockWidth = 1000;

__global__ void increment_atomic(int *g)
{
// thread index
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements
i = i % kArraySize;
atomicAdd(&g[i], 1);
}

increment_automic <<<kNumThreads/kBlockWidth,kBlockWidth>>>(d_array);

Atomic Memory Operations

  • 原子操作:一次只能有一个thread对内存进行修改

  • Atomic memory operations: perform read-modify-write operations on a memory location in a thread-safe manner

  • atomicCAS: compare and swap,We can construct general atomic operations with atomicCAS

  • Limitations of Atomic Memory Operations:

    1. The results are not fully reproducible(结果可能有偏差,代码不可复现,有波动)
    2. Greatly slow down the program

Measure Speed

Measure Speed On CPU

  • 不能用CUDA kernel function来测量CPU时间,因为CPU、GPU执行是异步的
1
2
3
4
5
6
7
8
// A naïve method for measure time on CPU
#include <ctime>
std::clock_t start = std::clock();
////////////////////////////
// put your C++ code here //
////////////////////////////
std::clock_t end = std::clock();
double time_elapsed = (end - start) / CLOCKS_PER_SEC;

Measure Speed On GPU

  • use cuda event APIs & C++ to implement a class to simplify its usage
1
2
3
4
5
6
7
8
9
10
11
12
13
// Measure time on GPU
cudaEvent_t start = cudaEventCreate(&start);
cudaEvent_t stop = cudaEventCreate(&stop);
cudaEventRecord(start, 0);
///////////////////////////////
// put your CUDA kernel here //
///////////////////////////////
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed;
cudaEventElapsedTime(&elapsed, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

PyTorch Wrapper around the CUDA Even

1
2
3
4
5
6
7
8
9
import torch
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
# whatever you are timing goes here
end.record()
# Waits for everything to finish running
torch.cuda.synchronize()
print(start.elapsed_time(end))

Communication Patterns

Map: One-to-One

  • Read from and write to specific memory locations,读完后一一映射e.g.激活函数

 2024-09-23 203531.png

Gather: Many-to-One

  • 读-写:多对一,e.g.卷积是二维Gather(图二)

 2024-09-23 203641.png

 2024-09-23 203659.png

Scatter: One-to-Many

  • Scatter is the reverse of Gather,与Gather常常配对出现

 2024-09-23 203811.png

 2024-09-23 203822.png

Stencil

  • Read input from a fixed neighborhood in an array

  • Stencil is a special kind of gather

  • The backward pass of stencil is a special kind of scatter

 2024-09-23 204001.png

Transpose

  • 转置:一种特殊的map,也是一一映射

 2024-09-23 204200.png

 2024-09-23 204217.png

  • 可以将四维tensor:N*(HWC)变成N*(CHW)