2015-07-13

[GTX 970][CUDA 7] sgemm の性能

CUDA のサンプルソース matrixMulCUBLAS で計測した。
環境は、Windows 7 x64、GTX 970、CUDA 7、ドライバ 353.30。
性能は 2.92 TFlop/s。

下記の参考ページによると、GTX 970 は 2.95TFlop/s、GTX 980 は 3.87TFlop/s、GTX 780 Ti は 3.06TFlop/s らしい。

[参考ページ]
http://www.comphys.las.shibaura-it.ac.jp/matrixMulCUBLAS_2015

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\bin\win64\Release>nvprof matrixMulCUBLAS.exe -sizemult=10
[Matrix Multiply CUBLAS] - Starting...
==1864== NVPROF is profiling process 1864, command: matrixMulCUBLAS.exe -sizemult=10
GPU Device 0: "GeForce GTX 970" with compute capability 5.2

MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 2919.77 GFlop/s, Time= 0.359 msec, Size= 1048576000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==1864== Profiling application: matrixMulCUBLAS.exe -sizemult=10
==1864== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 81.92%  10.896ms        31  351.50us  348.19us  355.84us  maxwell_sgemm_128x64_nn
 12.14%  1.6145ms         3  538.15us  1.3120us  1.0752ms  [CUDA memcpy HtoD]
  5.94%  789.79us         1  789.79us  789.79us  789.79us  [CUDA memcpy DtoH]

==1864== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 66.45%  302.18ms         7  43.168ms  19.660us  301.50ms  cudaFree
 29.95%  136.21ms         6  22.702ms  11.469us  134.27ms  cudaMalloc
  2.36%  10.715ms         1  10.715ms  10.715ms  10.715ms  cudaEventSynchronize
  0.61%  2.7778ms         4  694.46us  23.347us  1.1751ms  cudaMemcpy
  0.26%  1.1714ms         2  585.71us  583.26us  588.17us  cudaGetDeviceProperties
  0.20%  925.67us       166  5.5760us       0ns  295.31us  cuDeviceGetAttribute
  0.08%  343.65us        31  11.085us  9.4200us  32.767us  cudaLaunch
  0.05%  219.54us         2  109.77us  78.232us  141.31us  cuDeviceGetName
  0.01%  58.161us       372     156ns       0ns     820ns  cudaSetupArgument
  0.01%  29.082us         2  14.541us  9.4210us  19.661us  cudaThreadSynchronize
  0.01%  26.623us        16  1.6630us     409ns  9.8300us  cudaEventCreateWithFlags
  0.00%  15.565us         1  15.565us  15.565us  15.565us  cudaEventElapsedTime
  0.00%  15.155us         2  7.5770us  6.1440us  9.0110us  cuDeviceTotalMem
  0.00%  14.336us         2  7.1680us  4.9150us  9.4210us  cudaEventRecord
  0.00%  13.517us        16     844ns     409ns  2.4570us  cudaEventDestroy
  0.00%  8.6020us         2  4.3010us  2.8670us  5.7350us  cudaGetDevice
  0.00%  6.9630us        31     224ns       0ns  1.2290us  cudaConfigureCall
  0.00%  4.5050us        10     450ns     409ns     819ns  cudaDeviceGetAttribute
  0.00%  4.0960us         2  2.0480us  1.2290us  2.8670us  cudaEventCreate
  0.00%  3.2760us        31     105ns       0ns     410ns  cudaGetLastError
  0.00%     820ns         3     273ns       0ns     410ns  cuDeviceGetCount
  0.00%     819ns         1     819ns     819ns     819ns  cuDriverGetVersion
  0.00%     819ns         1     819ns     819ns     819ns  cuInit
  0.00%     818ns         3     272ns       0ns     409ns  cuDeviceGet