*-cpu
description: CPU
product: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bus info: cpu@0
version: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
capabilities: ...
../DeviceQuery/Debug/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "GeForce RTX 2070"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7982 MBytes (8370061312 bytes)
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1710 MHz (1.71 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 11 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GTX 1050"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4042 MBytes (4238737408 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1493 MHz (1.49 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 2070 (GPU0) -> GeForce GTX 1050 (GPU1) : No
> Peer access from GeForce GTX 1050 (GPU1) -> GeForce RTX 2070 (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS
[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.28000mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34509mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 196 blocks of 256 threads
Executed vector add of 50000 elements on the Device in 196 blocks of 256 threads in = 0.02950mSecs
Speedup relative to Host 9.49024
Speedup relative to single threaded device 1231.86987
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.23800mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34224mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 358 blocks of 140 threads
Executed vector add of 50000 elements on the Device in 358 blocks of 140 threads in = 0.02099mSecs
Speedup relative to Host 11.33765
Speedup relative to single threaded device 1731.24243
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[Vector addition of 50000 elements]
Executed vector add of 50000 elements on the Host in = 0.24000mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 50000 elements on the Device in a SINGLE THREAD in = 36.34560mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 49 blocks of 1024 threads
Executed vector add of 50000 elements on the Device in 49 blocks of 1024 threads in = 0.03059mSecs
Speedup relative to Host 7.84519
Speedup relative to single threaded device 1188.07532
Copy output data from the CUDA device to the host memory
Test PASSED
Done
- We see that for 50000 elements, the smaller value of Threads per Block is preferable
Threads Per Block |
Speedup vs Host |
Speedup vs Single Thread |
1024 |
7.85 |
1188.08 |
140 |
11.33 |
1731.24 |
256 |
9.49 |
1231.86 |
512 |
7.50 |
1069.56 |
64 |
7.11 |
1159.10 |
- The Range of acceptable
Threads Per Block
values for my system is 1 -> 1024
- Through experimentation and looking at device memory utilisation, I worked out that the max value for
numElements
on my system is 686722110
[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3439.10498mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 381607.06250mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 2682509 blocks of 256 threads
Executed vector add of 686722110 elements on the Device in 2682509 blocks of 256 threads in = 38.81689mSecs
Speedup relative to Host 88.59815
Speedup relative to single threaded device 9830.95312
Copy output data from the CUDA device to the host memory
Test PASSED
Done
~/git-repos/Parallel+Distributed/labs/CUDA master* 4m 9s
λ ./VectorAdd/Debug/VectorAdd
[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3376.98291mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 383470.78125mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 4905158 blocks of 140 threads
Executed vector add of 686722110 elements on the Device in 4905158 blocks of 140 threads in = 47.15405mSecs
Speedup relative to Host 71.61597
Speedup relative to single threaded device 8132.29785
Copy output data from the CUDA device to the host memory
Test PASSED
Done
- As we can see from the above results, the case with 256 Threads per Block has a higher speedup both compared with the host and compared to a single thread on the device.
λ ./VectorAdd/Debug/VectorAdd 686722110 1024
You have entered 3 arguments:
./VectorAdd/Debug/VectorAdd
686722110
1024
[Vector addition of 686722110 elements]
Executed vector add of 686722110 elements on the Host in = 3434.57690mSecs
Copy input data from the host memory to the CUDA device
Executed vector add of 686722110 elements on the Device in a SINGLE THREAD in = 379501.43750mSecs
Copy output data from the CUDA device to the host memory
Test PASSED
Launching the CUDA kernel with 670628 blocks of 1024 threads
Executed vector add of 686722110 elements on the Device in 670628 blocks of 1024 threads in = 42.28717mSecs
Speedup relative to Host 81.22031
Speedup relative to single threaded device 8974.38770
Copy output data from the CUDA device to the host memory
Test PASSED
Done
- Here we can see that 265 Threads per Block performs the best in terms of speedup and overall time to execute.