CUDA メモ: 2016

2016-10-29

CPU の FP16C を使って float (FP32) と half (FP16) を変換する方法

CPU が FP16C をサポートしていれば、下記の Intrinsics を使って float (FP32) と half (FP16) 間の変換ができる。

__m256 _mm256_cvtph_ps (__m128i a)
__m128i _mm256_cvtps_ph (__m256 a, int rounding)

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=FP16C

関連

CPU で float(FP32) 型から half(FP16) 型へ変換する（逆変換もあり）方法

2016-09-29

CUDA 8.0 が出たようだ

CUDA 8.0 Downloads | NVIDIA Developer:
https://developer.nvidia.com/cuda-downloads

CUDA 8 Features Revealed | Parallel Forall:
https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/

Microsoft Visual Studio 2015 (updates 2 and 3) をサポートしたとか、NVCCコンパイラの速度が上がったとか。

2016-08-13

cuDNN v5.1 (August 10, 2016), for CUDA 8.0 RC

cuDNN v5.1 (August 10, 2016), for CUDA 8.0 RC と cuDNN v5.1 (August 10, 2016), for CUDA 7.5 が出ているようだ。

https://developer.nvidia.com/rdp/cudnn-download

2016-06-28

cuDNN v5.1 RC (June 16, 2016), for CUDA 7.5

cuDNN v5.1 RC (June 16, 2016), for CUDA 7.5 が出ているようだ。

https://developer.nvidia.com/cudnn

2016-05-29

cuDNN v5 (May 27, 2016), for CUDA 8.0 RC

cuDNN v5 (May 27, 2016), for CUDA 8.0 RC が出ているようだ。

CUDA 8 RC を入れてみた

CUDA 8 RC (8.0.27) が出ていたので、インストールしてみた。

Visual Studio 2015 に対応していたので、Visual Studio も更新した。

2016-04-07

cuDNN v5 Release Candidate (RC) (April, 2016), for CUDA 7.5 and later

cuDNN v5 Release Candidate (RC) (April, 2016) が出ているようだ。

https://developer.nvidia.com/cudnn

2016-02-17

word2vec_cbow を Windows で実行する

word2vec_cbow を Visual Studio 2013 でビルドして、Windows 上で実行してみた。

プロジェクトを作成する手順は次のとおり。

Visual Studio 2013 で CUDA 7.5 Runtime プロジェクトを作成する。
https://github.com/ChenglongChen/word2vec_cbow から word2vec.cu と cbow.cu の2つのファイルをダウンロードして、プロジェクトのフォルダに入れる。
https://github.com/zhangyafeikimi/word2vec-win32 から win32-port.h ファイルをダウンロードして、プロジェクトのフォルダに入れる。
word2vec.cu を開いて、#include <pthread .h> を #include "win32-port.h" に変更する。
プロジェクトに word2vec.cu を追加する。

私の実行環境は Geforce GTX 970 なので、[Configuration Properties]-[CUDA C/C++]-[Device]-[Code Generation] に「compute_52,sm_52」を指定した。

[Solution Platforms] は「x64」でビルドした。

実行結果は次のとおり。

>word2vec_cbow.exe -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
Alpha: 0.000009  Progress: 99.99%  Words/thread/sec: 1528.12k

実行中の GPU の負荷は 50％程度だった。

なお、処理結果が正しいかどうかは確認していないので、注意されたい。

2016-02-13

cuDNN v4 (Feb 10, 2016)

cuDNN v4 の正式版が出ているようだ。

https://developer.nvidia.com/cudnn

以下、リリースノートから引用。

UPDATES SINCE RELEASE CANDIDATE

The API of cudnnBatchNormalizationBackward has been changed to include an additional set of scaling parameters (alphaParamsDiff and betaParamsDiff) applied to the dBnScaleResult and dBnBiasResult outputs of the function.

The prior restriction of batch size 512 in all Batch Normalization routines has been removed.

Numerical stability and performance of cudnnBatchNormalizationBackward in some cases has been improved.

Performance of cudnnConvolutionBackwardFilter when using Algo 1 has been improved for some cases. This code path now also requires a workspace.

2016-01-26

cuDNN v4 Release Candidate (December 10, 2015)

cuDNN v4 Release Candidate (December 10, 2015) が出ていたようだ。

日本語のリリース情報は次のリンク先を参照。
https://ja-jp.facebook.com/NVIDIAGPUComputing/posts/458731480981793

以下、リリースノートから引用。

NEW FEATURES

Batch Normalization routines have been added.

Convolution forward and backward now supports NHWC tensor format.

FFT Tiling algorithm has been added for cudnnConvolutionForward and cudnnConvolutionBackwardData routines

cudnnConvolutionForward now supports computation in FP16 when run on GPU with a compute capability >= 5.3

cudnnConvolutionForward has been optimized for batch size = 1

Pooling and activation routines have a descriptor option to propagate NaN numbers.