并行编程的介绍 :

CPU vs GPU :

  • CPU: Complex control circuit

    • good:Flexibility + Performance
    • bad:Expensive in terms of power
  • GPU: Simple control circuit

    • good:More circuit for parallel computation
    • good:More power efficient
    • bad:Less flexibility, more restrictive programming models
  • The CPU is Optimized for Latency,The GPU is Optimized for Throughput

CUDA is Written in C with Extensions :

  • The CUDA complier complies a CUDA program into two parts, which runs on CPUs and GPUs

    1. step1:Data:cpu->gpu
    2. step2:Data:gpu->spu
    3. step3:cudaMalloc(分配内存空间用于储存和计算)
    4. step4:Launch kernels on GPUs
  • CPU称为内存而GPU称为显存

A Typical CUDA Program :

  • CPU和GPU是异步的

  • What CPU does

    1. CPU allocates a block of memory on GPU–cudaMalloc
    2. CPU copies data from CPU to GPU–cudaMemcpy
    3. CPU initiates launching kernels on GPU–Launchkernels<<<>>>
    4. CPU copies results back from GPU to CPU–cudaMemcpy
  • What GPU does

    1. GPU efficiently launch a lot of kernels(kernels之间是并行运算的而内部是顺序计算的)
    2. GPU runs kernels in parallel
    3. A kernel looks like a serial C program for a thread
    4. The GPU will run the kernel for many threads in parallel

第一个CUDA程序 :

Review: Implement LeNet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class LeNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten dimensions
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

The first CUDA program

  • Let’s implement torch.nn.functional.relu
    1
    2
    3
    4
    5
    6
    7
    8
    9
    //ReLU on CPU
    float relu_cpu(float x)
    {
    return x > 0 ? x : 0;
    }
    for (int i = 0; i < N; ++i)
    {
    h_out[i] = relu_cpu(h_in[i]);//cpu的指针用h_开头(host)
    }
1
2
3
4
5
6
7
8
9
//ReLU on GPU. Define a kernel with "__ global __"; Launch kernels with <<<, >>>
__global__ void relu_gpu(float* in, float* out)
{
int i = threadIdx.x;//表示是第几号kernel
out[i] = in[i] > 0 ? in[i] : 0;//进行ReLU计算
}
//先写关键字_blobal_
relu_gpu<<<1, N>>>(d_in, d_out);
//控制kernel数量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
const int N = 64;
const int size = N * sizeof(float);
// allocate memory on CPU
float* h_in = (float*) malloc(size);
float* h_out = (float*) malloc(size);
// initialize input array
for (int i = 0; i < N; ++i) {
h_in[i] = (i - 32) * 0.1;
}
// relu on CPU
for (int i = 0; i < N; ++i) {
h_out[i] = relu_cpu(h_in[i]);
}
// free memory ...

// 1. allocate memory on GPU
float* d_in = nullptr;
float* d_out = nullptr;
cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
// 2. copy data from CPU to GPU
cudaMemcpy(d_in, h_in, size,
cudaMemcpyHostToDevice);
// 3. launch the kernel
relu_gpu<<<1, N>>>(d_in, d_out);
// 4. copy data from GPU to CPU
cudaMemcpy(h_out, d_out, size,
cudaMemcpyDeviceToHost);
// free memory ...

Configure the Kernel Launch

  • GPU由一些grid组成,其中grid是根据实际情况由系统自动分配的

  • grid中有很多block,block中又有很多thread,其中block和thread由代码分配

  • Kernel <<< number of blocks, number of thread per block >>> (…)

  • Launch many blocks at once

  • Maximum number of threads per block(256/512/1024(少))

  • number of blocks, number of thread per block可以是多维的。relu_gpu<<<1, N>>>(…) → relu_gpu <<< dim3(1, 1, 1), dim3(N, 1, 1) >>> (…)

    1
    2
    // How many threads in total ?
    relu_gpu<<<dim3(4, 6, 8), dim3(16, 16)>>>(d_in, d_out);
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    __global__ void relu_gpu(float* in, float* out) 
    {
    int i = threadIdx.x;
    out[i] = in[i] > 0 ? in[i] : 0;
    }
    int i = threadIdx.x;
    int j = threadIdx.y;
    int m = blockIdx.x
    int n = blockIdx.y
    int w = blockDim.x // use w
    int k = gridDim.x // use k

    // Use 512 or 256 threads per block
    const int kCudaThreadsNum = 512;
    inline int CudaGetBlocks(const int N)
    {
    return (N + kCudaThreadsNum - 1) / kCudaThreadsNum;
    }
    //上述代码一般直接用于程序

    // Define the grid stride looping
    // 宏定义,用于在CUDA的内核中进行并行循环
    // i 是循环变量,n 表示输入数组的长度,或者说需要处理的元素总数。
    // 该循环的结构将确保线程正确地分配计算任务
    // blockIdx.x 表示当前block的索引
    //blockDim.x 表示每个block中线程的数量,threadIdx.x 表示当前线程在block中的索引
    // 通过这些索引和数量信息计算出线程全局唯一的ID
    #define CUDA_KERNEL_LOOP(i, n) for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)

    // 定义CUDA核函数,用于在GPU上执行ReLU操作
    // __global__ 关键字表示这是一个GPU核函数,可以在GPU上并行执行
    // 参数 in 是输入数组,out 是输出数组,n 是数组的长度
    // 每个线程将会处理数组中的一部分数据
    __global__ void relu_gpu(float* in, float* out, int n)
    {
    // CUDA_KERNEL_LOOP 使用上面定义的宏进行并行循环。
    // 每个线程将通过 CUDA_KERNEL_LOOP 负责处理数组中的不同元素。
    CUDA_KERNEL_LOOP(i, n)
    {
    out[i] = in[i] > 0 ? in[i] : 0;
    }
    }

    // 启动 relu_gpu 核函数
    // <<<CudaGetBlocks(N), kCudaThreadsNum>>>用来指定核函数的执行配置
    // CudaGetBlocks(N) 返回启动的block数目,kCudaThreadsNum 是每个block中线程的数量
    // d_in 是设备上的输入数组,d_out 是设备上的输出数组,N 是数组长度
    // 通过这个启动配置,每个线程将负责处理 d_in 和 d_out 中的一部分数据,实现并行化
    relu_gpu <<<CudaGetBlocks(N), kCudaThreadsNum>>> (d_in, d_out, N);

GPU Memory and Hardware :

implement a class Tensorto circumvent this issue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// allocate memory in the constructor
// free memory in the destructor
Tensor tensor(shape, device);
//在CPU上分配还是在GPU上分配

// copy data between cpu and gpu if needed
Tensor c = tensor.cpu();
Tensor g = tensor.gpu();
//注意要深拷贝

// 1. allocate memory on GPU
float* d_in = nullptr;
float* d_out = nullptr;
cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
// 2. copy data from CPU to GPU
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);
// 3. launch the kernel
relu_gpu<<<1, N>>>(d_in, d_out);
// 4. copy data from GPU to CPU
cudaMemcpy(h_out, d_out, size,cudaMemcpyDeviceToHost);

Tensor Operations

-Tensor常见的参数包括:
1. size:(C,H,W)
2. strides(每一个维度的步长,内存+1要走多少步):(H*W,W,1)
3. dtype:float
4. device:cuda:0?cpu?
5. offset(偏移量,首地址指到该位置要走多少步)

9be31ac1b070773176f64c1d6225517.jpg

GPU Memory Model

  • 访问速度:local>shared>>global>>cpumemory

  • 包含线程的大小:cpumemory>global>shared>local

  • Define the shared memory by the keyword __ shared __

  • Highlight: We can fetch data into the shared memory for data reuse to reduce the visit of global memory.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    __global__ foo(float* x, float* y)
    {
    int i = threadIdx.x; // local memory
    float s, t; // local memory
    __shared__ float s[128]; // shared memory
    __shared__ float a, b, c; // shared memory
    // which of the following is the fastest?
    t = *x;
    b = a;
    s[i] = t; //local复制shared,最快
    *y = *x;
    }

Highlight: Coalesced Global Memory Access

  • The CUDA program is efficient when threads read /write contiguous memory locations

26ab30ba250bd9df63325b6208afbda.png

Highlight: Coalesced Global Memory Access

  • The CUDA program is efficient when threads read / write contiguous memory locations
    1
    2
    3
    4
    5
    6
    7
    8
    __global__ foo(float* x) {
    int i = threadIdx.x; // local memory
    float s, t; // local memory
    // which of the following is coalesced?
    t = x[i]; //是coalesced
    x[i*2] = t; //是strides
    x[i+1] = s; //是coalesced
    }