環境は、Windows 7 x64、GTX 970、CUDA 7、ドライバ 353.30。
性能は 2.92 TFlop/s。
下記の参考ページによると、GTX 970 は 2.95TFlop/s、GTX 980 は 3.87TFlop/s、GTX 780 Ti は 3.06TFlop/s らしい。
[参考ページ]
http://www.comphys.las.shibaura-it.ac.jp/matrixMulCUBLAS_2015
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\bin\win64\Release>nvprof matrixMulCUBLAS.exe -sizemult=10
[Matrix Multiply CUBLAS] - Starting...
==1864== NVPROF is profiling process 1864, command: matrixMulCUBLAS.exe -sizemult=10
GPU Device 0: "GeForce GTX 970" with compute capability 5.2
MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 2919.77 GFlop/s, Time= 0.359 msec, Size= 1048576000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==1864== Profiling application: matrixMulCUBLAS.exe -sizemult=10
==1864== Profiling result:
Time(%) Time Calls Avg Min Max Name
81.92% 10.896ms 31 351.50us 348.19us 355.84us maxwell_sgemm_128x64_nn
12.14% 1.6145ms 3 538.15us 1.3120us 1.0752ms [CUDA memcpy HtoD]
5.94% 789.79us 1 789.79us 789.79us 789.79us [CUDA memcpy DtoH]
==1864== API calls:
Time(%) Time Calls Avg Min Max Name
66.45% 302.18ms 7 43.168ms 19.660us 301.50ms cudaFree
29.95% 136.21ms 6 22.702ms 11.469us 134.27ms cudaMalloc
2.36% 10.715ms 1 10.715ms 10.715ms 10.715ms cudaEventSynchronize
0.61% 2.7778ms 4 694.46us 23.347us 1.1751ms cudaMemcpy
0.26% 1.1714ms 2 585.71us 583.26us 588.17us cudaGetDeviceProperties
0.20% 925.67us 166 5.5760us 0ns 295.31us cuDeviceGetAttribute
0.08% 343.65us 31 11.085us 9.4200us 32.767us cudaLaunch
0.05% 219.54us 2 109.77us 78.232us 141.31us cuDeviceGetName
0.01% 58.161us 372 156ns 0ns 820ns cudaSetupArgument
0.01% 29.082us 2 14.541us 9.4210us 19.661us cudaThreadSynchronize
0.01% 26.623us 16 1.6630us 409ns 9.8300us cudaEventCreateWithFlags
0.00% 15.565us 1 15.565us 15.565us 15.565us cudaEventElapsedTime
0.00% 15.155us 2 7.5770us 6.1440us 9.0110us cuDeviceTotalMem
0.00% 14.336us 2 7.1680us 4.9150us 9.4210us cudaEventRecord
0.00% 13.517us 16 844ns 409ns 2.4570us cudaEventDestroy
0.00% 8.6020us 2 4.3010us 2.8670us 5.7350us cudaGetDevice
0.00% 6.9630us 31 224ns 0ns 1.2290us cudaConfigureCall
0.00% 4.5050us 10 450ns 409ns 819ns cudaDeviceGetAttribute
0.00% 4.0960us 2 2.0480us 1.2290us 2.8670us cudaEventCreate
0.00% 3.2760us 31 105ns 0ns 410ns cudaGetLastError
0.00% 820ns 3 273ns 0ns 410ns cuDeviceGetCount
0.00% 819ns 1 819ns 819ns 819ns cuDriverGetVersion
0.00% 819ns 1 819ns 819ns 819ns cuInit
0.00% 818ns 3 272ns 0ns 409ns cuDeviceGet