GPU Hardware

Thread Blocks And GPU Hardware

一个thread对应GPU核，tread block对应SM(Streaming Multiprocessor)，而一个grid对应若干个SM
A thread block contains many threads
A thread block must be on one SM; a SM may run more than one block
The programmer is responsible for defining blocks
The GPU is responsible for allocating thread blocks to hardware SMs
The threads run in parallel, and CUDA makes few guarantees about when and where thread blocks will run.thread block间的运行是并行的，无约束的

2024-09-23 191041.png

A Thread-Block Programming Example

#include <stdio.h>
__global__ void hello() 
{
    printf("Hello world! I'm a thread""in block %d\n", blockIdx.x);
}
int main(int argc,char **argv)
{
    // launch the kernel
    hello<<<16, 1>>>();
    // !!! force the printf()s to flush
    cudaDeviceSynchronize();
    printf("That's all!\n");
    return 0;
}

如果不加cudaDeviceSynchronize();CPU发出指令后，直接向后执行”That’s all”后主程序退出，不会等GPU执行完
执行顺序不一定，先执行完的先输出，所以一共有16！种结果

#include <stdio.h>
__global__ void hello(float f) 
{
    printf("Hello world! I'm thread ""%d, f=%f\n", threadIdx.x,f);
}
int main(int argc,char **argv)
{
    // launch the kernel
    hello<<<1, 5>>>(1.2345f);
    // force the printf()s to flush
    cudaDeviceSynchronize();
    printf("That's all!\n");
    return 0;
}

得到的结果每组(wrap)线程执行顺序不同而组内的每个县城都执行相同的代码，基本同事执行完成
可以认为treads,threads block,warp间是并行的

What does CUDA guarantee

CUDA makes few guarantees about when and where thread blocks will run and different blocks / threads run in parallel
All threads in a block run on the same SM at the same time
All blocks in a kernel finish before any blocks from the next kernel run

SIMT: Single-Instruction, Multiple-Thread

The multiprocessor of GPU creates, manages, schedules, and executes threads in groups of 32 parallel threads, called warps
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path
If threads of a warp diverge(分歧) via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths

Thread Divergence

2024-09-23 192811.png

2024-09-23 192825.png

并行编程：尽量减少逻辑判断和for循环语句

Synchronization and Parallel Patterns

Threads can access each others’ results through global and shared memory
多个线程一起写(修改)的时候必须加同步
warning：
1. A thread reads a result before another thread writes
2. Multiple threads writes into the same memory location (e.g., sum)
Threads need to synchronize to collaborate
Barrier: A point in the program where threads stop and wait until all threads have reached the barrier; then threads proceed

Use Barriers to Avoid Data Races

多个线程一起写，可能某一个线程修改完后面的线程才读入

const int N = 128;
__global__ void shift_sum(float* array)
{   //一位一求和
    // do the "shift and sum"
    int idx = threadIdx.x;
    if (idx < N-1) 
    {
        array[idx] = array[idx] + array[idx+1];
    }
}
// . . .
shift_sum<<<1, N>>>(array);

Barrier解决竞争：

__global__ void shift_sum(float* array) 
{
    // do the "shift and sum"
    int idx = threadIdx.x;
    if (idx < N-1)
    {
        float tmp = array[idx] + array[idx+1];
        __syncthreads();
        array[idx] = tmp;
    }
}

Use shared memory to reduce the visits of global memory

__global__ void shift_sum(float* array)
{
    // shared memory can be accessed
    // by all threads in the block.
    __shared__ float shared[N];
    // fill the shared memory
    int idx = threadIdx.x;
    shared[idx] = array[idx];
    __syncthreads();
    // 以上位模板可以直接用于程序代码中
    // 一定要同步再进行后面操作
    // do the "shift and sum"
    if (idx < N-1) 
    {
        array[idx] = shared[idx] + shared[idx+1];
    }
    // the following code has NO EFFECT
    // 因为要写到global中才能看到
    shared[idx] = 3.14;
}

Another Example on Thread Collaboration

Lots of threads reading and writing same memory locations,For example, 10k threads increment 10 array element

const int kNumThreads = 1000000;
const int kArraySize = 100;
const int kBlockWidth = 1000;

__global__ void increment_naive(int *g) 
{
    // thread index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    // each thread to increment consecutive elements
    i = i % kArraySize;
    g[i] = g[i] + 1;
}

increment_naive <<<kNumThreads/kBlockWidth,kBlockWidth>>>(d_array);

2024-09-23 202433.png

线程多，空间小，反复写,__syncthreads();失效
have to use atomicAdd()

const int kNumThreads = 1000000;
const int kArraySize = 100;
const int kBlockWidth = 1000;

__global__ void increment_atomic(int *g) 
{
    // thread index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    // each thread to increment consecutive elements
    i = i % kArraySize;
    atomicAdd(&g[i], 1);
}

increment_automic <<<kNumThreads/kBlockWidth,kBlockWidth>>>(d_array);

Atomic Memory Operations

原子操作：一次只能有一个thread对内存进行修改
Atomic memory operations: perform read-modify-write operations on a memory location in a thread-safe manner
atomicCAS: compare and swap，We can construct general atomic operations with atomicCAS
Limitations of Atomic Memory Operations：
1. The results are not fully reproducible(结果可能有偏差，代码不可复现，有波动)
2. Greatly slow down the program

Measure Speed

Measure Speed On CPU

不能用CUDA kernel function来测量CPU时间，因为CPU、GPU执行是异步的

// A naïve method for measure time on CPU
#include <ctime>
std::clock_t start = std::clock();
////////////////////////////
// put your C++ code here //
////////////////////////////
std::clock_t end = std::clock();
double time_elapsed = (end - start) / CLOCKS_PER_SEC;

Measure Speed On GPU

use cuda event APIs & C++ to implement a class to simplify its usage

// Measure time on GPU
cudaEvent_t start = cudaEventCreate(&start);
cudaEvent_t stop = cudaEventCreate(&stop);
cudaEventRecord(start, 0);
///////////////////////////////
// put your CUDA kernel here //
///////////////////////////////
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed;
cudaEventElapsedTime(&elapsed, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

PyTorch Wrapper around the CUDA Even

import torch
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
# whatever you are timing goes here
end.record()
# Waits for everything to finish running
torch.cuda.synchronize()
print(start.elapsed_time(end))