Latency oriented processors
Throughput oriented processors
nvcc
nvcc
takes C/C++ code for with Nvidia extensions, separates and compiles the GPU code and passes the CPU code to the host to be compiled
The sequential code for this may look like this:
void vecAdd(float* h_A, float* h_B, float* h_C, int n){ for (i=0; i<n ; i++){ h_C[i] = h_A[i] + h_B[i]; } } int main(){ // declarations vecAdd(h_A,h_B, h_C, N); ... }
To write this for CUDA we must identify whether a function runs on the host, the device or both.
We also have to work out where it should be callable from:
__host__ void f(...)
__global__ void f(...)
__device__ void f(...)
__host__ __device__ void f(...)
Outside the Grid/Block/THread hierarchy, there is the concept of a Warp
A Warp is a set of a number of tightly related threads that must execute fully in lock step with each other.
Warps are not part of CUDA but are on all modern Nvidia GPUs, dictated by low level hardware design
The number of threads in a warp is a feature of a particular GPU, but is most commonly 32
Warps are the low-level basis of thread scheduling on a GPU, if a thread is scheduled to execute, so are all other threads in the warp
As they execute the same instructions in lock step, all threads in a warp will have the same instruction timing
A block can have a size between 1 and the number of threads on the GPU. (typically 2014) and is the high-level basis of thread scheduling
Because of the nature of warps, the block size should be a multiple of the warp size
Grids can have large numbers of blocks, many more than can be concurrently executed.
... int threadsPerBlock = 256; int blocksPerGrid = 1 + (numElements - 1) / threadsPerBlock; vectorAdd<<<blocksPerGrid, threadsPerBlock >>>(d_A, d_B, d_C, numElements); ...
Note <<<blocksPerGrid, threadsPerBlock >>>
is not standard C/C++ and is handled by nvcc
Each thread needs to know which part of the data to work on.
CUDA provides predefined variables for this purpose:
blockIdx.x
the unique identifier for this block in this gridblockDim.x
the number of threads in a block for this gridthreadIdx.x
the unique identifier of this thread in this block__global__ void vectorAdd(const float *A, const float *B, float *C, int n){ int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n){ C[i] = A[i] + B[i]; } }
blockIdx.x, blockDim.x, threadIdx.x
have .y
and .z
variants
malloc
and free
.
cudaMalloc
is an error numberfloat *h_A = (float *)malloc(size); float *h_B = (float *)malloc(size); float *d_A = NULL; err = cudaMalloc((void **)&d_A, size); float *d_C = NULL; err = cudaMalloc((void **)&d_C, size); ... err = cudaMemcpy(d_A,h_A, size, cudaMemcpyHostToDevice); ... // Invoke kernel err = cudaMemcpy(h_c,d_C, size, cudaMemcpyDeviceToHost); ... err = cudaFree(d_A); err = cudaFree(d_C);
Note: this code is for a slightly different example, use it only as an example of cuda<method>
methods
The only way to check hat things are working correctly on the GPU is the check the error return values. Check them every time
if (err != cudaSuccess){ // handler }
err = cudaGetLastError();
to get the error number if one occurred.cudaGetLastError()
before the kernel function finishes, the rror may only occur after you requested the error.cudaDeviceSyncronize()
Typically we want to time both the sequential code (on the host) and the parallel code (on the GPU).
The general approach to time the Host code is:
# include <cuda_runtime.h> # include <helper_cuda.h> # include <helper_functions.h> StopWatchInterface *timer = NULL; sdkCreateTimer(&timer); sdkStartTimer(&timer); /* The Host code that is to be timed*/ sdkStopTimer(&timer); double h_msecs = sdkGetTimerValue(&timer); sdkDeleteTimer(&timer);
# include <cuda_runtime.h> # include <helper_cuda.h> # include <helper_functions.h> StopWatchInterface *timer = NULL; sdkCreateTimer(&timer); sdkStartTimer(&timer); /* Host + GPU code that is to be timed */ cudaDeviceSyncronize(); sdkStopTimer(&timer); double h_msecs = sdkGetTimerValue(&timer); sdkDeleteTimter(&timer);
cudaEvent_t start, stop; float m_secs; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start ,0); /* Call GPU kernel(s) */ cudaEventRecord(stop, 0); cudaEventSyncronize(stop); cudaEventElapsedTime(&d_msecs, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);