Exercise 1 Report

Machine Specs

CPU

     *-cpu
          description: CPU
          product: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
          bus info: cpu@0
          version: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
          capabilities: ...

GPU

../DeviceQuery/Debug/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce RTX 2070"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7982 MBytes (8370061312 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1710 MHz (1.71 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 11 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1050"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4042 MBytes (4238737408 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  GPU Max Clock rate:                            1493 MHz (1.49 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 2070 (GPU0) -> GeForce GTX 1050 (GPU1) : No
> Peer access from GeForce GTX 1050 (GPU1) -> GeForce RTX 2070 (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

Calculate Speedup

256 Threads per Block

[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.28000mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34509mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 196 blocks of 256 threads
Executed vector add of 50000 elements on the Device in 196 blocks of 256 threads in = 0.02950mSecs
Speedup relative to Host 9.49024
Speedup relative to single threaded device 1231.86987
Copy output data from the CUDA device to the host memory
Test PASSED
Done

140 Threads per Block

[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.23800mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34224mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 358 blocks of 140 threads
Executed vector add of 50000 elements on the Device in 358 blocks of 140 threads in = 0.02099mSecs
Speedup relative to Host 11.33765
Speedup relative to single threaded device 1731.24243
Copy output data from the CUDA device to the host memory
Test PASSED
Done

1024 Threads per Block, GPU maximum

[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.24000mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34560mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 49 blocks of 1024 threads
Executed vector add of 50000 elements on the Device in 49 blocks of 1024 threads in = 0.03059mSecs
Speedup relative to Host 7.84519
Speedup relative to single threaded device 1188.07532
Copy output data from the CUDA device to the host memory
Test PASSED
Done

We see that for 50000 elements, the smaller value of Threads per Block is preferable

Table of Results

Threads Per Block	Speedup vs Host	Speedup vs Single Thread
1024	7.85	1188.08
140	11.33	1731.24
256	9.49	1231.86
512	7.50	1069.56
64	7.11	1159.10

The Range of acceptable Threads Per Block values for my system is 1 -> 1024

Maximum Elements

Through experimentation and looking at device memory utilisation, I worked out that the max value for numElements on my system is 686722110

256 Threads per Block

[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3439.10498mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 381607.06250mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 2682509 blocks of 256 threads
Executed vector add of 686722110 elements on the Device in 2682509 blocks of 256 threads in = 38.81689mSecs
Speedup relative to Host 88.59815
Speedup relative to single threaded device 9830.95312
Copy output data from the CUDA device to the host memory
Test PASSED
Done

140 Threads per Block

~/git-repos/Parallel+Distributed/labs/CUDA master* 4m 9s
λ ./VectorAdd/Debug/VectorAdd         
[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3376.98291mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 383470.78125mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 4905158 blocks of 140 threads
Executed vector add of 686722110 elements on the Device in 4905158 blocks of 140 threads in = 47.15405mSecs
Speedup relative to Host 71.61597
Speedup relative to single threaded device 8132.29785
Copy output data from the CUDA device to the host memory
Test PASSED
Done

As we can see from the above results, the case with 256 Threads per Block has a higher speedup both compared with the host and compared to a single thread on the device.

1024 Threads per Block (Maximum)

λ ./VectorAdd/Debug/VectorAdd 686722110 1024
You have entered 3 arguments:
./VectorAdd/Debug/VectorAdd
686722110
1024
[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3434.57690mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 379501.43750mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 670628 blocks of 1024 threads
Executed vector add of 686722110 elements on the Device in 670628 blocks of 1024 threads in = 42.28717mSecs
Speedup relative to Host 81.22031
Speedup relative to single threaded device 8974.38770
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Here we can see that 265 Threads per Block performs the best in terms of speedup and overall time to execute.