2015-07-14

[GTX 970][CUDA 7.5 RC] sgemm の性能

CUDA のサンプルソース matrixMulCUBLAS で計測した。
環境は、Windows 7 x64、GTX 970、CUDA 7.5 RC、ドライバ 353.30。
性能は 2.93 TFlop/s。

[関連ページ]
[GTX 970][CUDA 7] sgemm の性能
http://cuda-memo.blogspot.jp/2015/07/gtx-970cuda-7-sgemm.html

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>nvprof matrixMulCUBLAS.exe -sizemult=10
[Matrix Multiply CUBLAS] - Starting...
==4876== NVPROF is profiling process 4876, command: matrixMulCUBLAS.exe -sizemult=10
GPU Device 0: "GeForce GTX 970" with compute capability 5.2

MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 2931.65 GFlop/s, Time= 0.358 msec, Size= 1048576000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==4876== Profiling application: matrixMulCUBLAS.exe -sizemult=10
==4876== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 84.06%  10.850ms        31  349.99us  346.34us  355.75us  maxwell_sgemm_128x64_nn
  9.67%  1.2485ms         3  416.16us  1.3120us  806.45us  [CUDA memcpy HtoD]
  6.26%  808.59us         1  808.59us  808.59us  808.59us  [CUDA memcpy DtoH]

==4876== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 67.27%  311.92ms         7  44.559ms  18.432us  311.27ms  cudaFree
 29.27%  135.70ms         6  22.617ms  10.649us  134.12ms  cudaMalloc
  2.30%  10.664ms         1  10.664ms  10.664ms  10.664ms  cudaEventSynchronize
  0.57%  2.6268ms         4  656.69us  23.348us  1.1702ms  cudaMemcpy
  0.24%  1.1256ms         2  562.79us  478.41us  647.17us  cudaGetDeviceProperties
  0.17%  806.91us       166  4.8600us       0ns  212.58us  cuDeviceGetAttribute
  0.08%  370.28us        31  11.944us  9.8300us  35.635us  cudaLaunch
  0.05%  228.15us         2  114.07us  84.377us  143.77us  cuDeviceGetName
  0.01%  69.225us       372     186ns       0ns     819ns  cudaSetupArgument
  0.01%  27.851us        16  1.7400us     819ns  11.059us  cudaEventCreateWithFlags
  0.00%  19.251us         2  9.6250us  9.0110us  10.240us  cudaThreadSynchronize
  0.00%  17.203us         1  17.203us  17.203us  17.203us  cudaEventElapsedTime
  0.00%  14.745us        16     921ns     409ns  2.0480us  cudaEventDestroy
  0.00%  13.517us         2  6.7580us  5.3250us  8.1920us  cudaEventRecord
  0.00%  9.0120us         2  4.5060us  3.6870us  5.3250us  cuDeviceTotalMem
  0.00%  8.5970us        31     277ns       0ns  1.6380us  cudaConfigureCall
  0.00%  7.3730us         2  3.6860us  1.6390us  5.7340us  cudaGetDevice
  0.00%  6.9660us        31     224ns       0ns     410ns  cudaGetLastError
  0.00%  4.5060us        10     450ns     409ns     819ns  cudaDeviceGetAttribute
  0.00%  4.0960us         2  2.0480us  1.2290us  2.8670us  cudaEventCreate
  0.00%  2.0490us         3     683ns       0ns  1.6390us  cuDeviceGetCount
  0.00%     410ns         1     410ns     410ns     410ns  cuDriverGetVersion
  0.00%     409ns         1     409ns     409ns     409ns  cuInit
  0.00%     409ns         3     136ns       0ns     409ns  cuDeviceGet