環境は、Windows 7 x64、GTX 970、CUDA 7.5 RC、ドライバ 353.30。
性能は 2.93 TFlop/s。
[関連ページ]
[GTX 970][CUDA 7] sgemm の性能
http://cuda-memo.blogspot.jp/2015/07/gtx-970cuda-7-sgemm.html
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>nvprof matrixMulCUBLAS.exe -sizemult=10
[Matrix Multiply CUBLAS] - Starting...
==4876== NVPROF is profiling process 4876, command: matrixMulCUBLAS.exe -sizemult=10
GPU Device 0: "GeForce GTX 970" with compute capability 5.2
MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 2931.65 GFlop/s, Time= 0.358 msec, Size= 1048576000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==4876== Profiling application: matrixMulCUBLAS.exe -sizemult=10
==4876== Profiling result:
Time(%) Time Calls Avg Min Max Name
84.06% 10.850ms 31 349.99us 346.34us 355.75us maxwell_sgemm_128x64_nn
9.67% 1.2485ms 3 416.16us 1.3120us 806.45us [CUDA memcpy HtoD]
6.26% 808.59us 1 808.59us 808.59us 808.59us [CUDA memcpy DtoH]
==4876== API calls:
Time(%) Time Calls Avg Min Max Name
67.27% 311.92ms 7 44.559ms 18.432us 311.27ms cudaFree
29.27% 135.70ms 6 22.617ms 10.649us 134.12ms cudaMalloc
2.30% 10.664ms 1 10.664ms 10.664ms 10.664ms cudaEventSynchronize
0.57% 2.6268ms 4 656.69us 23.348us 1.1702ms cudaMemcpy
0.24% 1.1256ms 2 562.79us 478.41us 647.17us cudaGetDeviceProperties
0.17% 806.91us 166 4.8600us 0ns 212.58us cuDeviceGetAttribute
0.08% 370.28us 31 11.944us 9.8300us 35.635us cudaLaunch
0.05% 228.15us 2 114.07us 84.377us 143.77us cuDeviceGetName
0.01% 69.225us 372 186ns 0ns 819ns cudaSetupArgument
0.01% 27.851us 16 1.7400us 819ns 11.059us cudaEventCreateWithFlags
0.00% 19.251us 2 9.6250us 9.0110us 10.240us cudaThreadSynchronize
0.00% 17.203us 1 17.203us 17.203us 17.203us cudaEventElapsedTime
0.00% 14.745us 16 921ns 409ns 2.0480us cudaEventDestroy
0.00% 13.517us 2 6.7580us 5.3250us 8.1920us cudaEventRecord
0.00% 9.0120us 2 4.5060us 3.6870us 5.3250us cuDeviceTotalMem
0.00% 8.5970us 31 277ns 0ns 1.6380us cudaConfigureCall
0.00% 7.3730us 2 3.6860us 1.6390us 5.7340us cudaGetDevice
0.00% 6.9660us 31 224ns 0ns 410ns cudaGetLastError
0.00% 4.5060us 10 450ns 409ns 819ns cudaDeviceGetAttribute
0.00% 4.0960us 2 2.0480us 1.2290us 2.8670us cudaEventCreate
0.00% 2.0490us 3 683ns 0ns 1.6390us cuDeviceGetCount
0.00% 410ns 1 410ns 410ns 410ns cuDriverGetVersion
0.00% 409ns 1 409ns 409ns 409ns cuInit
0.00% 409ns 3 136ns 0ns 409ns cuDeviceGet