Lecture 7

2 Types of Parallelism

Latency vs. Throughput

Fig. 1. Von Neumann Architecture
Fig. 2. Von Neumann Architecture for GPU

Compiling for CUDA

VectorAdd - trivial example

C=A+B\vec C = \vec A + \vec B

The sequential code for this may look like this:

void vecAdd(float* h_A, float* h_B, float* h_C, int n){
    for (i=0; i<n ; i++){
        h_C[i] = h_A[i] + h_B[i];
    }
}
int main(){
    // declarations
    vecAdd(h_A,h_B, h_C, N);
    ...   
}

To write this for CUDA we must identify whether a function runs on the host, the device or both.
We also have to work out where it should be callable from:

CPU computational unit structure

CUDA thread issues

CUDA thread organisation

Outside the Grid/Block/THread hierarchy, there is the concept of a Warp

Invoking Kernel Functions

...
int threadsPerBlock = 256;
int blocksPerGrid = 1 + (numElements - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock >>>(d_A, d_B, d_C, numElements);
...

Note <<<blocksPerGrid, threadsPerBlock >>> is not standard C/C++ and is handled by nvcc

Inside Kernel Functions

Each thread needs to know which part of the data to work on.
CUDA provides predefined variables for this purpose:

__global__ void 
vectorAdd(const float *A, const float *B, float *C, int n){
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < n){
        C[i] = A[i] + B[i];
    }
}

Grid and Block Dimensionalities

Device Global Memory

float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);
...
err = cudaMemcpy(d_A,h_A, size, cudaMemcpyHostToDevice);
... // Invoke kernel
err = cudaMemcpy(h_c,d_C, size, cudaMemcpyDeviceToHost);
...
err = cudaFree(d_A);
err = cudaFree(d_C);

Note: this code is for a slightly different example, use it only as an example of cuda<method> methods

CUDA Error Handling

The only way to check hat things are working correctly on the GPU is the check the error return values. Check them every time

if (err != cudaSuccess){
    // handler
}

TIming Host Code with Host Timers

# include <cuda_runtime.h>
# include <helper_cuda.h>
# include <helper_functions.h>

StopWatchInterface *timer = NULL;
sdkCreateTimer(&timer);
sdkStartTimer(&timer);

/* The Host code that is to be timed*/

sdkStopTimer(&timer);
double h_msecs = sdkGetTimerValue(&timer);
sdkDeleteTimer(&timer);

# include <cuda_runtime.h>
# include <helper_cuda.h>
# include <helper_functions.h>

StopWatchInterface *timer = NULL;
sdkCreateTimer(&timer);
sdkStartTimer(&timer);

/* Host + GPU code that is to be timed */

cudaDeviceSyncronize();

sdkStopTimer(&timer);
double h_msecs = sdkGetTimerValue(&timer);
sdkDeleteTimter(&timer);

cudaEvent_t start, stop;
float m_secs; 
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start ,0);

/* Call GPU kernel(s) */

cudaEventRecord(stop, 0);
cudaEventSyncronize(stop);

cudaEventElapsedTime(&d_msecs, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);