サンプルソースコード matrixMulCUBLAS を簡易的に改造して実行した。cublasSgemm 関数では内部で maxwell_sgemm_128x64_nn 関数が呼び出されていたが、cublasSgemmEx 関数 (CUBLAS_DATA_HALF を指定) では maxwell_fp16_sgemm_fp16_128x64_nn 関数が呼び出されている。計算部分は float 型らしいので、性能はほとんど変わらないようだ(メモリ転送やキャッシュヒットの有無も考慮すれば、結果は違うのかも知れないが)。
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>nvprof matrixMulCUBLAS_half.exe -sizemult=10
[Matrix Multiply CUBLAS] - Starting...
==6140== NVPROF is profiling process 6140, command: matrixMulCUBLAS_half.exe -sizemult=10
GPU Device 0: "GeForce GTX 970" with compute capability 5.2
MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 2953.19 GFlop/s, Time= 0.355 msec, Size= 1048576000 Ops
Computing result using host CPU...done.
Listing first 100 Differences > 0.000010...
Row 0:
Loc(0,0) CPU=160.44701 GPU=160.50000 Diff=0.052994
Loc(1,0) CPU=167.72478 GPU=167.75000 Diff=0.025223
Loc(2,0) CPU=171.19669 GPU=171.25000 Diff=0.053314
(省略)
Total Errors = 819097
Comparing CUBLAS Matrix Multiply with CPU results: FAIL
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==6140== Profiling application: matrixMulCUBLAS_half.exe -sizemult=10
==6140== Profiling result:
Time(%) Time Calls Avg Min Max Name
77.29% 10.771ms 31 347.46us 344.64us 352.74us maxwell_fp16_sgemm_fp16_128x64_nn
9.45% 1.3174ms 3 439.14us 1.3120us 864.16us [CUDA memcpy HtoD]
6.19% 862.27us 1 862.27us 862.27us 862.27us [CUDA memcpy DtoH]
4.70% 655.58us 2 327.79us 327.17us 328.42us copy_array_from_float_to_half(float const *, __half*, __int64)
2.36% 329.02us 1 329.02us 329.02us 329.02us copy_array_from_half_to_float(__half const *, float*, __int64)
==6140== API calls:
Time(%) Time Calls Avg Min Max Name
69.23% 331.75ms 10 33.175ms 22.937us 330.90ms cudaFree
27.33% 130.98ms 9 14.554ms 9.4200us 128.45ms cudaMalloc
2.22% 10.629ms 1 10.629ms 10.629ms 10.629ms cudaEventSynchronize
0.65% 3.1297ms 4 782.42us 22.937us 1.6253ms cudaMemcpy
0.21% 999.81us 2 499.91us 427.21us 572.61us cudaGetDeviceProperties
0.17% 809.76us 166 4.8780us 0ns 258.04us cuDeviceGetAttribute
0.09% 420.24us 34 12.360us 9.8300us 31.539us cudaLaunch
0.05% 235.52us 2 117.76us 88.882us 146.63us cuDeviceGetName
0.01% 71.272us 381 187ns 0ns 1.2290us cudaSetupArgument
0.01% 26.214us 16 1.6380us 409ns 9.8310us cudaEventCreateWithFlags
0.01% 25.804us 2 12.902us 4.9150us 20.889us cudaEventRecord
0.00% 21.708us 2 10.854us 9.8300us 11.878us cudaThreadSynchronize
0.00% 15.974us 1 15.974us 15.974us 15.974us cudaEventElapsedTime
0.00% 14.335us 16 895ns 409ns 2.8670us cudaEventDestroy
0.00% 10.649us 34 313ns 0ns 2.0480us cudaConfigureCall
0.00% 9.4200us 2 4.7100us 4.0960us 5.3240us cuDeviceTotalMem
0.00% 6.9640us 2 3.4820us 1.2290us 5.7350us cudaGetDevice
0.00% 5.7330us 31 184ns 0ns 410ns cudaGetLastError
0.00% 4.9150us 10 491ns 409ns 1.2280us cudaDeviceGetAttribute
0.00% 4.5060us 2 2.2530us 819ns 3.6870us cudaEventCreate
0.00% 2.8680us 3 956ns 410ns 2.0480us cuDeviceGetCount
0.00% 819ns 3 273ns 0ns 410ns cuDeviceGet
0.00% 819ns 1 819ns 819ns 819ns cuInit
0.00% 409ns 1 409ns 409ns 409ns cuDriverGetVersion
======== Error: Application returned non-zero code 1