NVIDIA CUDA User Manual

NVIDIA CUDA 编程指南
NVIDIA
Jan. 2008 Version 1.1
GPU
.....................................................................................................................1
NVIDIA CUDA
Chapter1
1.1
1.2 CUDA
Chapter2
2.1
线
2.2
线
2.2.1
2.2.2
2.3
线 线
Chapter3
3.1
3.2
3.3
3.4
3.5
CUDA…….....................................................................................................................11
.........................................................................................................................1
GPU
............................................................................................................................... 15
.....................................................................................................................15
.......................................................................................................................................15
..........................................................................................................................................16
...................................................................................................................................16
...........................................................................................................................................17
on-chip
................................................................................................................................18
SIMD
...........................................................................................................................................19
........................................................................................................................................20
...............................................................................................................................................20
...........................................................................................................................................20
………………………….............................................11
..............................................................................................12
....................................................................................18
Chapter4
4.1
C
4.2
4.2.1
API ...................................................................................................21
..............................................................................................................................21
...........................................................................................................................................21
......................................................................................................................... 22
4.2.1.1 __device__..................................................................................................................22
4.2.1.2 __global__..........................................................................................................22
4.2.1.3 __host__............................................................................................................22
4.2.1.4
4.2.2
....................................................................................................................22
........................................................................................................................23
4.2.2.1 __device__..................................................................................................23
4.2.2.2 __constant__ .......................................................................................................23
4.2.2.3 __shared__............................................................................................................23
4.2.2.4
4.2.3
4.2.4
.............................................................................................................................24
...........................................................................................................................25
..............................................................................................................................26
4.2.4.1 gridDim.............................................................................................................................. 26
4.2.4.2 blockIdx...........................................................................................................................26
4.2.4.3 blockDim............................................................................................................................26
- 2 -
4.2.4.4 threadIdx......................................................................................................................... 26
4.2.4.5
4.2.5 NVCC
..................................................................................................................................... 26
.................................................................................................................................. 26
4.2.5.1 __noinline__ ...................................................................................................................27
4.2.5.2 #pragmaunroll .................................................................................................................27
4.3
4.3.1
Runtime
........................................................................................................................... 28
............................................................................................................................. 28
4.3.1.1
char1,uchar1,char2,uchar2,char3,uchar3,char4,uchar4,short1,ushort1,short2,us
hort2,short3,ushort3,short4,ushort4,int1,uint1,int2,uint2,int3,uint3,int4,ui
nt4,long1,ulong1,long2,ulong2,long3,ulong3,long4,ulong4,float1,float2,float3
,float4.......................................................................................................................................... 28
4.3.1.2 dim3
4.3.2
4.3.3
4.3.4
4.3.4.1 Texture Reference
4.3.4.2 RuntimeTexture Reference
........................................................................................................................... 28
..................................................................................................................................... 28
..................................................................................................................................... 28
..................................................................................................................................... 29
.......................................................................................................29
.........................................................................................30
4.3.4.3
4.4
4.4.1
4.4.2
4.4.3
4.4.4 TypeCasting
4.4.5
4.4.5.1
4.4.5.2 CUDA
4.4.6
4.5
4.5.1
4.5.1.1
4.5.1.2
线
Runtime
........................................................................................................................... 31
..................................................................................................................................... 31
..................................................................................................................................... 31
............................................................................................................................. 32
..................................................................................................................................... 33
Runtime
...................................................................................................................................... 34
........................................................................................................................... 34
..................................................................................................................................... 35
..................................................................................................................................... 35
..................................................................................................................................... 35
CUDA
.......................................................................................31
........................................................................................................................32
................................................................................................................33
............................................................................................................33
4.5.1.3 OpenGL Interoperability ...................................................................................................... 36
4.5.1.4 Direct3D Interoperability ..................................................................................................... 36
4.5.1.5
...................................................................................................................37
- 3 -
4.5.2 RuntimeAPI.................................................................................................................................... 38
4.5.2.1
4.5.2.2
4.5.2.3
4.5.2.4
4.5.2.5
4.5.2.6 Texture Reference
..................................................................................................................................... 38
..................................................................................................................................... 40
.................................................................................................................................. 38
.................................................................................................................................. 39
.................................................................................................................................. 41
...........................................................................................................42
4.5.2.7 OpenGL Interoperability .......................................................................................................... 44
4.5.2.8 Direct3D Interoperability ......................................................................................................... 44
4.5.2.9
4.5.3
4.5.3.1
4.5.3.2
4.5.3.3 Context
4.5.3.4
4.5.3.5
4.5.3.6
4.5.3.7
使仿
API ........................................................................................................................................ 47
..................................................................................................................................... 47
................................................................................................................................. 47
............................................................................................................................. 47
................................................................................................................................. 48
................................................................................................................................. 49
................................................................................................................................. 49
..................................................................................................................................... 51
.............................................................................................................45
4.5.3.8
4.5.3.9 Texture Reference
................................................................................................................................. 51
..........................................................................................................52
4.5.3.10 OpenGL Interoperability ...................................................................................................... 53
4.5.3.11 Direct3D Interoperability ...................................................................................................... 53
Chapter5
5.1
5.1.1
5.1.1.1
5.1.1.2
5.1.1.3
5.1.1.4
5.1.2
5.1.2.1
5.1.2.2
5.1.2.3
5.1.2.4
5.1.2.5
................................................................................................................................ 54
........................................................................................................................................... 54
................................................................................................................................. 54
............................................................................................................................. 54
.......................................................................................................................... 55
............................................................................................................................. 56
............................................................................................................................. 56
.................................................................................................................................... 56
............................................................................................................................. 57
............................................................................................................................. 62
............................................................................................................................. 63
............................................................................................................................. 63
................................................................................................................................. 70
- 4 -
5.2
线
5.3
.......................................................................................................................... 70
....................................................................................................................71
5.4 Texture Fetch
5.5
Chapter6
6.1
6.2
6.3
............................................................................................................................................... 74
........................................................................................................................................... 76
.........................................................................................71
.......................................................................................................................... 72
....................................................................................................................74
.................................................................................................................................... 78
6.3.1 Mul().................................................................................................................................. 78
6.3.2 Muld()................................................................................................................................ 79
A
A.1
A.2
B
B.1
runtime
B.2
runtime
C
C.1
................................................................................................................................... 80
....................................................................................................................................... 81
.................................................................................................................................... 82
................................................................................................................................... 83
............................................................................................................................. 83
........................................................................................................................... 86
................................................................................................................................... 88
....................................................................................................................................... 88
C.1.1 atomicAdd() ................................................................................................................... 88
C.1.2 atomicSub() ................................................................................................................... 88
C.1.3 atomicExch() ................................................................................................................. 88
C.1.4 atomicMin() ................................................................................................................... 88
C.1.5 atomicMax() ................................................................................................................... 89
C.1.6 atomicInc() ....................................................................................................................89
C.1.7 atomicDec() ................................................................................................................... 89
C.1.8 atomicCAS() ................................................................................................................... 89
C.2
.................................................................................................................................... 90
C.2.1 atomicAnd() ................................................................................................................... 90
C.2.2 atomicOr()..................................................................................................................….. 90
C.2.3 atomicXor() ................................................................................................................... 90
D Runtime API Reference ............................................................................................................ 91
D.1
........................................................................................................................................ 91
D.1.1 cudaGetDeviceCount().......................................................................................................91
D.1.2 cudaSetDevice()..................................................................................................................91
D.1.3 cudaGetDevice()..................................................................................................................91
D.1.4 cudaGetDeviceProperties() .............................................................................91
D.1.5 cudaChooseDevice() .......................................................................................................93
- 5 -
D.2
线
...................................................................................................................................... 93
D.2.1 cudaThreadSynchronize()..................................................................................................93
D.2.2 cudaThreadExit()..................................................................................................................93
D.3
........................................................................................................................................... 93
D.3.1 cudaStreamCreate() ...........................................................................................................93
D.3.2 cudaStreamQuery()...............................................................................................................93
D.3.3 cudaStreamSyncronize().....................................................................................................93
D.3.4 cudaStreamDestroy() .........................................................................................................94
D.4
....................................................................................................................................... 94
D.4.1 cudaEventCreate()...............................................................................................................94
D.4.2 cudaEventRecord()................................................................................................................94
D.4.3 cudaEventQuery()...................................................................................................................94
D.4.4 cudaEventSyncronize()........................................................................................................94
D.4.5 cudaEventDestroy() .............................................................................................................95
D.4.6 cudaEventElapsedTime()......................................................................................................95
D.5
........................................................................................................................................ 95
D.5.1 cudaMalloc() ..........................................................................................................................95
D.5.2 cudaMallocPitch().................................................................................................................95
D.5.3 cudaFree()................................................................................................................................96
D.5.4 cudaMallocArray()................................................................................................................96
D.5.5 cudaFreeArray().....................................................................................................................96
D.5.6 cudaMallocHost()...................................................................................................................96
D.5.7 cudaFreeHost().......................................................................................................................96
D.5.8 cudaMemSet() ..........................................................................................................................97
D.5.9 cudaMemSet2D().......................................................................................................................97
D.5.10 cudaMemcpy() .........................................................................................................................97
D.5.11 cudaMemcpy2D() ...................................................................................................................98
D.5.12 cudaMemcpyToArray() .........................................................................................................98
D.5.13 cudaMemcpy2DToArray().....................................................................................................99
D.5.14 cudaMemcpyFromArray().....................................................................................................99
D.5.15 cudaMemcpy2DFromArray()................................................................................................100
D.5.16 cudaMemcpyArrayToArray()..............................................................................................100
D.5.17 cudaMemcpy2DArrayToArray() ........................................................................................101
D.5.18 cudaMemcpyToSymbol().....................................................................................................101
D.5.19 cudaMemcpyFromSymbol()..................................................................................................101
D.5.20 cudaGetSymbolAddress()..................................................................................................102
D.5.21 cudaGetSymbolSize() .......................................................................................................102
- 6 -
D.6 Texture Reference
D.6.1
API ................................................................................................................................... 102
................................................................................................................102
D.6.1.1 cudaCreateChannelDesc() ............................................................................................102
D.6.1.2 cudaGetChannelDesc()...................................................................................................102
D.6.1.3 cudaGetTextureReference().........................................................................................103
D.6.1.4 cudaBindTexture().........................................................................................................103
D.6.1.5 cudaBindTextureToArray()..........................................................................................103
D.6.1.6 cudaUnBindTexture()......................................................................................................103
D.6.1.7 cudaGetTextureAlignmentOffset()............................................................................104
D.6.2
API .....................................................................................................................................104
D.6.2.1 cudaCreateChannelDesc()...............................................................................................104
D.6.2.2 cudaBindTexture()............................................................................................................104
D.6.2.3 cudaBindTextureToArray().............................................................................................105
D.6.2.4 cudaUnBindTexture() ......................................................................................................105
D.7
...................................................................................................................................... 105
D.7.1 cudaConfigureCall() .........................................................................................................105
D.7.2 cudaLaunch() ........................................................................................................................105
D.7.3 cudaSetupArgument() .........................................................................................................106
D.8 OpenGL Interoperability ............................................................................................................... 106
D.8.1 cudaGLRegisterBufferObject().......................................................................................106
D.8.2 cudaGLMapBufferObject()..................................................................................................106
D.8.3 cudaGLUnMapBufferObject() ............................................................................................106
D.8.4 cudaGLUnRegisterBufferObject()...................................................................................106
D.9 Direct3D Interoperability .............................................................................................................. 107
D.9.1 cudaD3D9Begin()...................................................................................................................107
D.9.2 cudaD3D9End().......................................................................................................................107
D.9.3 cudaD3D9RegisterVertexBuffer()...................................................................................107
D.9.4 cudaD3D9MapVertexBuffer() ............................................................................................107
D.9.5 cudaD3D9UnMapVertexBuffer().........................................................................................107
D.9.6 cudaD3D9UnRegisterVertexBuffer() ............................................................................107
D.9.7 cudaD3D9GetDevice() ........................................................................................................108
D.10
......................................................................................................................................108
D.10.1 cudaGetLastError() .........................................................................................................108
D.10.2 cudaGetErrorString()......................................................................................................108
E DriverAPIReference ................................................................................................................ 109
E.1
........................................................................................................................................... 109
E.1.1 cuInit()................................................................................................................................. 109
- 7 -
E.2
....................................................................................................................................... 109
E.2.1 cuGetDeviceCount().............................................................................................................109
E.2.2 cuDeviceGet() ......................................................................................................................109
E.2.3 cuDeviceGetName()...............................................................................................................109
E.2.4 cuDeviceTotalMem().............................................................................................................109
E.2.5 cuDeviceComputeCapability() .......................................................................................110
E.2.6 cuDeviceGetAttribute()....................................................................................................110
E.2.7 cuDeviceGetProperties()..................................................................................................111
E.3 Context
...................................................................................................................................111
E.3.1 cuCtxCreate() .....................................................................................................................111
E.3.2 cuCtxAttach() .....................................................................................................................112
E.3.3 cuCtxDetach() .....................................................................................................................112
E.3.4 cuCtxGetDevice().................................................................................................................112
E.3.5 cuCtxSynchronize().............................................................................................................112
E.4
.......................................................................................................................................112
E.4.1 cuModuleLoad().....................................................................................................................112
E.4.2 cuModuleLoadData()............................................................................................................113
E.4.3 cuModuleLoadFatBinary().................................................................................................113
E.4.4 cuModuleUnload()................................................................................................................113
E.4.5 cuModuleGetFunction().....................................................................................................113
E.4.6 cuModuleGetGlobal() .......................................................................................................113
E.4.7 cuModuleGetTexRef() .......................................................................................................114
E.5
.........................................................................................................................................114
E.5.1 cuStreamCreate()...............................................................................................................114
E.5.2 cuStreamQuery().................................................................................................................114
E.5.3 cuStreamSynchronize()....................................................................................................114
E.5.4 cuStreamDestroy().............................................................................................................114
E.6
.....................................................................................................................................114
E.6.1 cuEventCreate().................................................................................................................114
E.6.2 cuEventRecord().................................................................................................................115
E.6.3 cuEventQuery()...................................................................................................................115
E.6.4 cuEventSynchronize() ....................................................................................................115
E.6.5 cuEventDestroy()...............................................................................................................115
E.6.6 cuEventElapsedTime() ....................................................................................................115
E.7
.....................................................................................................................................116
E.7.1 cuFuncSetBlockShape()....................................................................................................116
E.7.2 cuFuncSetSharedSize().....................................................................................................116
- 8 -
E.7.3 cuParamSetSize()................................................................................................................116
E.7.4 cuParamSeti() .....................................................................................................................116
E.7.5 cuParamSetf() .....................................................................................................................116
E.7.6 cuParamSetv() .....................................................................................................................116
E.7.7 cuParamSetTexRef()............................................................................................................117
E.7.8 cuLaunch().............................................................................................................................117
E.7.9 cuLaunchGrid().....................................................................................................................117
E.8 Memory
..................................................................................................................................117
E.8.1 cuMemGetInfo().....................................................................................................................117
E.8.2 cuMemAlloc() .......................................................................................................................118
E.8.3 cuMemAllocPitch()...............................................................................................................118
E.8.4 cuMemFree()............................................................................................................................118
E.8.5 cuMemAllocHost().................................................................................................................118
E.8.6 cuMemFreeHost()...................................................................................................................119
E.8.7 cuMemGetAddressRange()....................................................................................................119
E.8.8 cuArrayCreate()...................................................................................................................119
E.8.9 cuArrayGetDescriptor()....................................................................................................120
E.8.10 cuArrayDestroy()...............................................................................................................121
E.8.11 cuMemset()........................................................................................................................... 121
E.8.12 cuMemset2D() .......................................................................................................................121
E.8.13 cuMemcpyHtoD()...................................................................................................................121
E.8.14 cuMemcpyDtoH()..................................................................................................................122
E.8.15 cuMemcpyDtoD()...................................................................................................................122
E.8.16 cuMemcpyDtoA()...................................................................................................................122
E.8.17 cuMemcpyAtoD()...................................................................................................................123
E.8.18 cuMemcpyAtoH()...................................................................................................................123
E.8.19 cuMemcpyHtoA()...................................................................................................................123
E.8.20 cuMemcpyAtoA()...................................................................................................................124
E.8.21 cuMemcpy2D() ......................................................................................................................124
E.9 Texture Reference
.................................................................................................................126
E.9.1 cuTexRefCreate().................................................................................................................126
E.9.2 cuTexRefDestroy()...............................................................................................................127
E.9.3 cuTexRefSetArray().............................................................................................................127
E.9.4 cuTexRefSetAddress() .......................................................................................................127
E.9.5 cuTexRefSetFormat() ..........................................................................................................128
E.9.6 cuTexRefSetAddressMode()................................................................................................128
E.9.7 cuTexRefSetFilterMode()..................................................................................................128
- 9 -
E.9.8 cuTexRefSetFlags().............................................................................................................129
E.9.9 cuTexRefGetAddress() ......................................................................................................129
E.9.10 cuTexRefGetArray()...........................................................................................................129
E.9.11 cuTexRefGetAddressMode()..............................................................................................129
E.9.12 cuTexRefGetFilterMode()................................................................................................129
E.9.13 cuTexRefGetFormat() .......................................................................................................130
E.9.14 cuTexRefGetFlags()...........................................................................................................130
E.10 OpenGLInteroperability .............................................................................................................. 130
E.10.1 cuGLInit()........................................................................................................................... 130
E.10.2 cuGLRegisterBufferObject() ........................................................................................130
E.10.3 cuGLMapBufferObject()....................................................................................................130
E.10.4 cuGLUnmapBufferObject()................................................................................................131
E.10.5 cuGLUnregisterBufferObject().....................................................................................131
E.11 Direct3DInteroperability ............................................................................................................. 131
E.11.1 cuD3D9Begin().....................................................................................................................131
E.11.2 cuD3D9End()..........................................................................................................................131
E.11.3 cuD3D9RegisterVertexBuffer() ...................................................................................131
E.11.4 cuD3D9MapVertexBuffer()................................................................................................131
E.11.5 cuD3D9UnmapVertexBuffer()...........................................................................................132
E.11.6 cuD3D9UnregisterVertexBuffer()................................................................................132
E.11.7 cuD3D9GetDevice()............................................................................................................132
F TextureFetching .................................................................................................................... 133
F.1 Nearest-Point
F.2
线
F.3
........................................................................................................................................... 136
..................................................................................................................... 134
..................................................................................................................................... 135
- 10 -
Chapter 1 介绍CUDA
1.1
1-1
GPU
1-1CPU GPU
GPU
1-2 GPU
1-2
- 11 -
更加具体地看,GPU 是特别适合于并行数据运算的问题-同一个程序在许多并行数据元素,
并带有高运算密度(算术运算与内存操作的比例)。由于同一个程序要执行每个数据元素,
降低了对复杂的流量控制要求; 并且,因为它执行许多数据元素并且据有高运算密度,内存
访问的延迟可以被忽略。
并行数据处理,意味着数据元素以并行线程处理。许多处理大量数据集,例如数组的应用程
序可以使用一个并行数据的编程模型来加速计算。在3D 渲染上,大的像素集和顶点被映射
到并行线程。同样,图像和媒体处理的应用程序例如着色的图像后处理,录像编码和解码,
图像缩放比例,立体视觉,以及图像识别也可以映射图像块和像素到并行处理线程。实际上,
在图像着色和处理领域外的许多算法同样可以通过并行数据处理得到加速,从一般信号处理
或物理模拟到金融计算或者生物计算。
然而直到今天,尽管强大的计算能力包装进了GPU,而它对非图形应用的有效支持依然有限:
GPU 只能通过图型API 来编程,导致新手很难学习和非图形API 上很不充分的应用。
GPU DRAM 可以用一般方式下读取,GPU 程序可以从任何DRAM 部分收集数据元素。
但不可写,在一般方式下的GPU 程序不能写入信息到DRAM 的任何部分,相比CPU 丧失
了很多编程的灵活性。
有些应用是由于DRAM 内存带宽而形成的瓶颈,未能充分利用GPU 的计算能力。
本文描述的是一个崭新的硬件和编程模型,它直接答复了这些问题并且展示了GPU 如何成
为一个真正的通用并行数据计算设备。
1.2 CUDA
GPU
GPU
API
便
CUDACompute Unified Device Architecture
GPU
CUDA
G80
访
GPU
CUDA 软件堆栈由几层组成,如图1-3 所示:一个硬件驱动程序,一个应用程序编程接口(API)
和它的Runtime, 还有二个高级的通用数学库,CUFFT CUBLAS。硬件被设计成支持轻
量级的驱动和
Runtime 层面,因而提高性能。
- 12 -
CUDA API
C 便
1-3
1-4
CPU
CUDA
DRAM
DRAM
1-4
- 13 -
CUDA
1-5
On-chip
线
DRAM overfetch round-trips
DRAM
1-5。共享内存使数据更接近ALU
- 14 -
Chapter 2 编程模型
2.1
线
CUDA
GPU
线
使
API
DRAM
DRAM
DRAM
2.2
线
线线
kernel
kernel
2-1
CPU
线
(DMA)
kernel
2-1
kernel
线线
线
- 15 -
2.2.1
线
线线
Kernel
线线
线
(
D
x
ID
D
(
D
y
D
z
)
线
ID
线线
x
D
y
)
2.2.2
线
线
线
(
xyz
(x, y)
线
)
线
ID
ID (x +
ID (x +
2 -3-
y D
y D
访
x
)
x
+
z DxD
y
)
线线
线
kernel
线
kernel
线
线
kernel
.
每个块是由它的块ID 确定的,块的ID 是在栅格之内的块编号。根据块ID 可以帮助进行
ٛ
复杂寻址,一个应用程序可也以指定一个栅格作为任意大小的一个二维数组,并且通过一个
2-组件索引替换来制定每个块。对于一个大小为 (
块的
ID (x +
y D
x
)
D
x
D
y
)
块,这个块的索引是(x,y),
- 16 -
2.3
线
使
DRAM On-Chip
线
线
kernel
2-2
线
使
2-2
DRAM On-chip
- 17 -
Chapter 3 硬件实
3.1
on-chip
SIMD
使
32
访
3-1
使
on-chip
(SIMD)
3-1
- 18 -
3.2
线
线
on-chip
线
线
warps;
; 线
warp
线
warp
active
warp
warp
active
使
warp
0
;
warp
线线
访
SIMD
SIMD
线
warp
线
线
使线线
线
在一个块内的warp 序是未定的,但通过协调全局或者共享内存的存取,它可以同
的执行。如一个通过warp 线程执行的指写入全局或共享内存的同一位置,写的序是
未定的。
线
线
- 19 -
3.3
A
3.4
A
1.x
1
为一个应用程序使用多GPU 作为CUDA 设备,必须保证这些GPU 是一样的型。如果系
统工作在SLI 模式下,那么只有一个GPU 可以作为CUDA 设备,由于所有的GPU 在驱动
堆栈层的融合了。SLI 模式要在控制面板中关闭,这样多个GPU 作为CUDA
设备。
3.5
GPU 指定一些DRAM 来存primary surface 的内,这些内被用于输出。如
户改变显示的分辨率或者色那么primary surface 的存储需求量将改变。比如,如
户将显示分辨率1280x1024x32bit1600x1200x32bit系统必须指定7.68MB
primary surface 而不在是5.24MB。(使全屏抗锯齿的应用程序要更多的primary surface
空间)。另外,比如在Windows 中使用Alt+Tab切换,或者Ctrl+Alt+Del 的操作同样需要
额外的primary surface 空间。
模式切换增加了primary surface 的内存空间系统将占CUDA 指定的内存空间,导
致程序崩溃
- 20 -
Chapter 4 应用程序编程接口(API
4.1
C
CUDA
C
C
4.2
;
一个runtime 库分成:
一个主机组件,在4.5 部分描述,它在主机上运行并且提供函数来控制和访问一个
或多个计算设备;
一个设备组件,在4.4 部分描述,它在设备运行并且提特定设备的数;
一个共的组件,在第4.3 部分描述,它提置矢型和主机与设备编码支持
C 标准库的一个集。
的是,只有来自C 标准库的数支持在设备上运行,是由Runtime 的组件提
数。
4.2 语言扩展
C 语言的展是四重的:
函数型限定指定一个数是执行在主机或者执行在设备,和是主机或者从设
备上用(4.2.1 部分);
型限定指定设备上一个量的内存位置(第4.2.2 部分);
一个新的指指定一个来自主机的kernel 如何在设备上执行 (4.2.3 部分);
个内量指定栅格和块的维数,还有块和线程的指标 (第4.2.4 部分)
每个包CUDA 语言扩展的文件必须通过CUDA 译器nvcc ,在4.2.5 部分
地加以描述。 nvcc 的一个详细的描述可以在一单独的文件中找到。
nvcc
- 21 -
4.2.1
4.2.1.1 __device__
__device__限定词声明一个数是:
在设备上执行的,
可从设备用。
4.2.1.2 __global__
__global__
4.2.1.3 __host__
__host__
__host__限定词声明数是:
__host____host__
__host__
__host____host__
kernel
主机上执行的,
可从主机调用。
声明一个带有__host__
__global__
__global__限定词;其他情况下这个主机
__global____global__
然而,__host__
__host__ 限定也可以用于与__device__
__host__ __host__
__host__限定或者声明有任何__ho
__host____host__
__device__限定的组合,这情况下,这个数是
__device____device__
主机和设备方编
__host__
st__, __device__
__ho__ho
st__st__
__device__,或
__device____device__
4.2.1.4
__device____global__
__device____global__
__device____global__
__device__
不能一使用__global____host__限定
;
- 22 -
__global__
__global__
4.2.3
__global__
4.2.2
4.2.2.1 __device__
__device__
线
void
__global__
runtime
访
__global__
__device__
256
4.2.2.2 __constant__
__constant__
线
4.2.2.3 __shared__
__shared__
__shared__限定,与__device__
__shared____shared__
__shared__
__shared__ __shared__
__device__起选择使用,声明一个量:
__device____device__
__device__
runtime
使
访
驻留在线程块的共享内存空间中
具有块的生存
只有块之内的所有线程是可访问的。
在线程共享的量有完全序一致性。只有执行过一个__syncthreads
__syncthreads()数,从其他
__syncthreads__syncthreads
线程的写才保证量被定为可挥发的,否则只要一个状态,编译器将自
由的优化共享内存的读写。
- 23 -
当声明一个在共享内存的量作为一个外部数组,例如
extern_shared_float shared[]
(
4.2.3
)
offset(
)
short array0[128];
float array1[64];
int array2[256];
extern __shared__ short array[];
__device__ void func() //__device__or__global__function
{
Short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
4.2.2.4
__shared____constant__
__device__,__shared____constant__
__device____constant__
__constant__
struct union
extern
runtime
4.5.3.6
__shared__
ld.local st.local
lmem
ptx
(
- ptx -keep
使
访
使
使
--ptxas-option=-v
.local
使
使
4.5.2.3
)
使
- 24 -
只能通过设备__device__,__shared____constant__变量来取地址。
__device____constant__变量的地址只能通过主机代得,cudaGetSymbolAddress()
参见4.5.2.3 部分。
4.2.3
__global__
stream
4.5.1.5
streams
S>>>
Dg
Dg dim3
Dg Dg
dim3 (参见4.3.1.2 部分)并且指定栅格的维数和大小,这样Dg.x * Dg.y
dim3dim3
发送的块的数量;
Db
Db dim3
Db Db
Db.z
Db.z 于每个块的线程数量;
Db.z Db.z
Ns
S
__global__ void Func(float* parameter);
;
dim3 (参见4.3.1.2 部分)并且指定每个块的维数和大小,这样Db.x * Db.y *
dim3 dim3
size_t
cudaStream_t
; Ns
0
streamS
使
0
<<< Dg, Db, Ns,
Dg.x * Dg.y
Dg.x * Dg.y Dg.x * Dg.y
Db.x * Db.y *
Db.x * Db.y * Db.x * Db.y *
4.2.2.3
Func<<< Dg, Db, Ns >>>(parameter);
- 25 -
Dg Db
4.2.4
4.2.4.1 gridDim
dim3 (
4.2.4.2 blockIdx
uint3 (
4.2.4.3 blockDim
dim3 (
4.2.4.4 threadIdx
uint3 (
A.1
Ns
4.3.1.2
4.3.1.1
4.3.1.2
4.3.1.1
)
)
)
)
线
4.2.4.5
4.2.5 NVCC
nvcc
CUDA
nvcc
cubin
使
使
CUDA
C
API
cubin
(
- 26 -
4.5.3
)
CUDA Runtime
4.2.3
cubin
Kernel(
4.5.2
)
C++
使
non-void
CUDA
C
C++
void
C++
nvcc
4.2.5.1 __noinline__
__device__
inline
C++
malloc()
__noinline__
classes, inheritance
使
NVCC
C++
typecast
inline
__noinline__
4.2.5.2 #pragma unroll
:
#pragma unroll
#pragma unroll 5
For (int i = 0; i < n; ++i)
循环将5 次。请自行确定展动作不会影响到程序的正确性。
#pragma unroll 后面附值行程计数为循环完全否则
- 27 -
4.3
Runtime
4.3.1
Runtime
使
4.3.1.1 char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1,
short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3,
uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
float1, float2, float3, float4
y, z, w
访
1
make_<type name>
2
3
4
;
x,
int2 make_int2(int x, int y);
(xy)
int2
4.3.1.2 dim3
uint3
dim3
1
4.3.2
使
B-1
C/C++
C runtime
4.3.3
clock_t clock();
kernel
线线
线
线
- 28 -
4.3.4
CUDA
Texture reference
Texture reference
texture fetches
texture referece
使
4.5.2.6 4.5.3.9
5.4
使
GPU
使
fetch
kernel
kernel
4.4.5
使
texture reference
Texture fetch
texelstexture elements
runtime
4.3.4.1 Texture Reference
texture reference
fetch
texture
texture reference
Texrure<Type, Dim, ReadMode> texRef;
此时:
Type
Type 指定的数据型是在时返回的; Type
Type Type
4.3.1.1 部分定的所有的型;
DDDDim
im 指定texture reference 的维数,它12; Dim
im im
ReadMode
ReadMode cudaReadModeNormalizedFloat
ReadMode ReadMode
cudaReadModeNormalizedFloat
cudaReadModeNormalizedFloat cudaReadModeElementType
cudaReadModeNormalizedFloat cudaReadModeNormalizedFloat
Type
Type 被限定在本的型和型和
Type Type
Dim 的是默认1 的一个可选自变量;
Dim Dim
cudaReadModeElementType;它是
cudaReadModeElementTypecudaReadModeElementType
16-bit 8-bit
texture reference
- 29 -
[-1.01.0]
unsigned
cudaReadModeElementType
ReadMode
cudaReadModeElementType
4.3.4.2 Runtime Texture Reference
runtime API 4.5.3.9 driver API
texture reference
)
[0N)
x [063]y [031]
64x32
[0N)N
0xff
Normalized
normalized
[0.01.0]signed
8-bit
1
runtime
normalized
x [
0.01.0
(
[
0.01.0
)
y [
4.5.2.6
64x32
)
0.01.0
)
Normalized
[0 N)
0 0N N-1使
0.75
[
0.01.0
使
)
线
texel
bilinear
F
unnormalized
normalized
texel
normalized
"
warp
"
Warp
1.25
texel
0.25
-
1.25
- 30 -
4.3.4.3
线
CUDA
一个理可以被在线性的内存或者一个CUDA 数组参见部分4.5.1.2)。
理分在线性内存
只有维数为1 时;
不支持理过滤;
只能使用non-normalized 坐标寻址
不能支持不同的寻址模式:超出范围理访问返回0
4.4
Runtime
runtime
4.4.1
B-1
Runtime
;
__
(
4.4.2
void __syncthreads();
线线
__syncthreads()
__syncthreads()
__sin(x)
线
)
(
-use_fast_math
线线
访
B-2
)
read-after-write, write-after-read,
访
线
write-after-write
穿
访
- 31 -
4.4.3
rn
rz
ru
rd
IEEE-754
int __float2int_[rn,rz,ru,rd](float);
Unsignde int __float2unit_[rn,rz,ru,zd](float);
float __int2float_[rn,rz,ru,rd](int);
float __int2float_[rn,rz,ru,rd](unsigned int);
用指定的入模式转换无符数到点数。
4.4.4 Type Casting
float __int_as_float(int);
type cast
-2
__int_as_float(0xC0000000)
int __float_as_int(float);
0x3f800000
type cast
__float_as_int (1.0f)
- 32 -
4.4.5
4.4.5.1
tex1Dfetch()
访
template<class Type>
Type tex1Dfetch(
texture<Type, 1, cudaReadModeElementType> texRef,
int x);
float tex1Dfetch(
texture<unsigned char, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed char, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed short, 1, cudaReadModeNormalizedFloat> texRef,
int x);
x
2-4-
线
texture reference texRef
32-bit
float4 tex1Dfetch(
texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef,
int x);
x
线
texture reference texRef
4.4.5.2 CUDA
CUDA
tex1D()tex2D()
访
:
template<class Type, enum cudaTextureReadMode readMode>
Type tex1D(texture<Type, 1, readMode> texRef, float x);
template<class Type, enum cudaTextureReadMode readMode>
Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);
x y
CUDA
texture reference texRef
Texture
reference
的编译时定的)和运行(可的)的定了,坐标如何被解时将些处理生,和返回参见部分4.3.4.1 4.3.4.2)。
- 33 -
4.4.6
1.1
使C
线
32-bit
32-bit
32-bit
4.5
Runtime
Runtime
使
Context
--
线
访
atomicAdd()
Texture reference
OpenGL Direct3D
API
API
API
API
CUDA
API
CUDA runtime API
CUDA
API
这些API 排斥:一个应用程序应该选择其中之一来使用。
CUDA runtime
C
API
CUDA runtime(
context
4.2.5
)
Nvcc
使
CUDA runtime
- 34 -
CUDA
API
使
Kernel
4.2.3
仿(
CUDA
API
CUDA runtime API
cuda
4.5.1
4.5.1.1
kernel
API
4.5.2.2
cubin
cudart
cuda
(
kernel
4.2.5
4.5.2.9
)
)
使
CUDA
cu
使
4.5.3.2
CUDA
API
API
线线
线
CUDA
线使
线
runtime
4.5.1.2
线
线使
CUDA
32-bit
CUDA
4
API
)
32
CUDA
8-
16- 32-bit
kernel
1
16-
(
CUDA
线
CUDA
4.5.2.3 4.5.3.6
2
- 35 -
使
page-locked
4.5.1.3 OpenGL Interoperability
malloc()
page-locked
pageable
D.5.6 D.5.7E.8.5 E.8.6
runtime
page-locked
page-locked
OpenGL 缓冲器可以被映射到CUDA 地址空间,使CUDA读取被OpenGL 写入的数据,
或者使CUDA写入被OpenGL 消耗的数据。 4.5.2.7 部分描述了在runtime API 下如何使
用,4.5.3.10 部分描述了驱动API 下如何使用。
4.5.1.4 Direct3D Interoperability
Direct3D 9.0
CUDA
使
CUDA
Direct3D
使
CUDA
API
Direct3D
使
4.5.2.8
runtime API
使
4.5.3.11
一个CUDA context 每次只可以互用一个Direct3D 设备,通过把begin/end 函数括起来调用。
参见4.5.2.84.5.3.11 部分的描述。
CUDA context Direct3D
Direct3D
API
使
cuD3D9GetDevice()
runtime API
使
E.11.7
GPU
cudaD3D9GetDevice()
CUDA
D.9.7
使
Direct3D
使
D3DCREATE_HARDWARE_VERTEXPROCESSING
CUDA 不支持:
Direct3D 9.0 之外的本,
顶点缓冲器之外的
Direct3D
- 36 -
cudaD3D9GetDevice()cuD3D9GetDevice()
Direct3D loading balance CUDA over interoperability
4.5.1.5
使
Kernel
CU_DEVICE_ATTRIBUTE_GPU_OVERLAP
__global__
page-locked
runtime
cuGridLaunch()cuGridLaunchAsync()
Async
E.2.6
Direct3D
cuDeviceGetAttribute()
cudaMallocPitch()
CUDA context
4.5.2.3
cuMemAllocPitch()
4.5.3.6
stream
4.5.2.4
runtime API
使
kernel
cudaStreamQuery()cuStreamQuery()
D.3.2 E.5.2
cudaStreamSynchronize() cuStreamSynchronize()
CUDA
2D
kernel
4.5.3.7
API
使
使
E.5.2 E.5.3
- 37 -
cudaThreadSynchronize()cuThreadSynchronize()
D.2.1 E.3.5
runtime
4.5.3.8
API
使
page-locked
CUDA
CUDA_LAUNCH_BLOCKING
debug
4.5.2.5
1
runtime API
runtime
使
4.5.2 Runtime API
4.5.2 Runtime API
4.5.2 Runtime API 4.5.2 Runtime API
4.5.2.1
Runtime
RuntimeAPI
;
Runtime
Runtime
4.5.2.2
cudaGetDeviceCount()cudaGetDeviceProperties()
D.1
int deviceCount;
cudaGetDeviceCount(&deviceCount);
Int device;
for (device = 0; device < deviceCount; ++device) {
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp,device);
}
- 38 -
cudaSetDevice()
cudaSetDevice(device);
线
一个设备必须在任何__global__数或者所有来自附D 的任何用之前选择;
device 0 动地被选择,并且所有后的设备选择将效的。
4.5.2.3
D.5 数用来分放设备内存,访问在全局内存任意声明量分的内存, 和从主机内存到设备内存之的数据传输
线
线
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(float));
2D
cudaMalloc()cudaMallocPitch()
使
cudaMallocPitch()
pitch
访
256
访
x
cudaFree()
2D
2D
// host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch,
width*sizeof(float),height);
myKernel<<<100,512>>>(devPtr,pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
CUDA
cudaMallocArray()
cudaCreateChannelDesc()
cudaFreeArray()
- 39 -
cudaMallocArray()
x
cudaChannelFormatDescchannelDesc=
cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
32-bit
2D
cudaGetSymbolAddress()用来得指向全局内存一个声明量分的内存地址。分内 存的大小用cudaGetSymbolSize()取得。
D.5 列出了不同的数用来拷贝内存,包用cudaMalloc()的线性内存,用
cudaMallocPitch() 的线性内存,CUDA 数组,全局变量分的内存或常驻内存。
cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch,
2D
CUDA
width * sizeof(float),height,
cudaMemcpyDeviceToDevice);
float data[256];
int size = sizeof(data);
float* devPtr;
cudaMalloc((void**)&devPtr, size);
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
4.5.2.4
D.3
cudaStream_t stream[2];
for(inti=0;i<2;++i)
cudaStreamCreate(&stream[i]);
- 40 -
for (int i = 0; i < 2; ++i)
for (int i = 0; i < 2; ++i)
for (int i = 0; i < 2; ++i)
cudaThreadSynchronize();
hostPtr
float*hostPtr;
cudaMallocHost((void**)&hostPtr,2*size;
cudaThreadSynchronize()
4.5.2.5
cudaEvent_tstart,sto;
cudaEventCreate(&stat);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i)
for (int i = 0; i < 2; ++i)
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr+i*size,hostPtr+i*size,
myKernel<<<100, 512, 0, stream[i]>>>
(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
inputDevPtr
D.4
cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
myKernel<<<100, 512, 0, stream[i]>>>
(outputDev + i * size, inputDev + i * size, size);
size,cudaMemcpyHostToDevice,stream[i]);
hostPtr
size, cudaMemcpyHostToDevice, stream[i]);
outputDevPtr hostPtr
inputDevPtr
hostPtr
cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,,,,
size, cudaMemcpyDeviceToHost, stream[i]); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop);
kernel
myKernel()
使
page-locked
- 41 -
4.5.2.6 Texture Reference
D.6
API
texture reference
texture Reference
struct textureReference
{
int normalized;
enum cudaTextureFilterMode filterMode;
enum cudaTextureAddressMode addressMode[2];
struct cudaChannelFormatDesc channelDesc;
}
normalized
[0,1]
normalized
[0,width-1][0,height-1]
filterMode
API
width height
cudaFilterModePoint cudaFilterModeLinear
cudaFilterModeLinear
texel
texel
cudaFilterModeLinear
addressMode
cudaAddressModeClamp cudaAddressModeWrap
cudaAddressModeWrap
channelDesc
normalized
struct cudaChannelFormatDesc {
int x, y, z, w;
cudaFilterModePoint
texel
线
addressMode
cudaAddressModeClamp
cudaAddressModeWrap
enum cudaChannelFormatKind f;
};
- 42 -
x,y,z,w
f
cudaChannelFormatKindSigned
cudaChannelFormatKindUnsigned
cudaChannelFormatKindFloat
normalized
normalizedaddressMode
normalizednormalized
addressMode,和filterMode
addressModeaddressMode
filterMode 可以在主机代直接修改。它只应用在
filterMode filterMode
CUDA 数组的texture reference
cudaBindTexture()cudaBindTextureToArray()
kernel
使
texture<float, 1, cudaReadModeElementType> texRef;
textureReference* texRefPtr;
cudaGetTextureReference(&texRefPtr, “texRef”);
cudaChannelFormatDesc channelDesc =
texture reference
API
texture reference
texture reference
devPtr
线
使
cudaCreateChannelDesc<float>();
cudaBindTexture(0, texRefPtr, devPtr, &channelDesc, size);
使
texture<float, 1, cudaReadModeElementType> texRef;
cudaBindTexture(0, texRef, devPtr, size);
使
texture<float, 2, cudaReadModeElementType> texRef;
textureReference* texRefPtr;
cudaGetTextureReference(&texRefPtr, “texRef”);
cudaChannelFormatDesc channelDesc;
cudaGetChannelDesc(&channelDesc, cuArray);
cudaBindTextureToArray(texRef, cuArray, &channelDesc);
使
texture<float, 2, cudaReadModeElementType> texRef;
API
API
API
texture reference CUDA
cuArray
cudaBindTextureToArray(texRef, cuArray);
- 43 -
texture reference
texture reference
cudaUnbindTexture()
4.5.2.7 OpenGL Interoperability
D.8
GLuint bufferObj;
cudaGLRegisterBufferObject(bufferObj);
OpenGL
texture reference
CUDA
使
cudaGLRegisterBufferObject()
注册以后,缓冲可以通过kenrel 使用设备内存读取或写入,内存设备的地址通过
cudaGLMapBufferObject()返回
GLuint bufferObj;
float* devPtr;
cudaGLMapBufferObject((void**)&devPtr, bufferObj);
4.5.2.8 Direct3D Interoperability
D.9
Direct3D
LPDIRECT3DVERTEXBUFFER9 vertexBuffer;
cudaD3D9RegisterVertexBuffer(vertexBuffer);
cudaD3D9MapVertexBuffer()
LPDIRECT3DVERTEXBUFFER9 vertexBuffer;
float* devPtr;
cudaD3D9MapVertexBuffer((void**)&devPtr, vertexBuffer);
cudaGLUnmapBufferObject()cudaGLUnregisterBufferObject()
Direct3D
cudaD3D9Begin()
cudaD3D9UnmapVertexBuffer()
kenrel
CUDA
使
使
cudaD3D9RegisterVertexBuffer()
cudaD3D9End()
cudaD3D9UnregisterVertexBuffer()
- 44 -
4.5.2.9
(使
使
仿
线
仿使
使仿
-deviceemu
仿
线
线线
)
__DEVICE_EMULATION__
线
使
runtime
256 KB
仿线线
仿
cudaErrorMixedDeviceExxcution
仿
runtime
使
由于设备码被编主机上运行,这个码可以在设备不能运行的充进来,
使
像是对文件或者屏幕(printf()等)入和输出操作。
使
runtime
仿仿仿
当内存元在同一时间被栅格之内的多线程访问,运行在设备仿真方式下
与运行在设备上然不同,因为在仿真模式下线程是序的执行。
一个指向主机上的全局内存或者指设备上的主机内存的引用,设备执行几
定在一些未定的方式上失之,设备仿真可以生正确的
- 45 -
仿
特别是,一些主机平台度的确度点计算的中间往往造
成设备仿真模式下,度上的很大不同。当出现这些情况时开发人员尝试以下方法,但
任何一方法不能完全保证工作:
声明一些量因为不定而强制确度存
使用gcc 译器–ffloat-storegcc
使用Visual C++译器/Op /fp
对于Linux 使用_FPU_GETCW()_FPU_SETCW(),对于Windows 使用_controlfp()
数来强制一部分点计算
unsigned int originalCW;
_FPU_GETCW(originalCW);
unsigned int cw = (originalCW & ~0x300) | 0x000;
_FPU_SETCW(cw);
unsigned int originalCW = _controlfp(0, 0);
_controlfp(_PC_24, _MCW_PC);
_FPU_SETCW(originalCW);
_controlfp(originalCW, 0xfffff);
24
(
A)
仿
- 46 -
4.5.3
API
4-1
Object Handle Description
Device CUdevice CUDA-capable device
Context N/A Roughly equivalent to a CPU process
Module CUmodule Roughly equivalent to a dynamic library
Function CUfunction Kernel
Heap memory CUdeviceptr Pointer to device memory
CUDA Array CUarray Opaque container for 1D or 2D data on the device, readable via
Texture reference CUtexref Object that describes how to interpret texture memory data
4.5.3.1
API
CUDA
CUDA
4-1
API
API
texture references
其他数(E)被用之要使用cuInit()初始
4.5.3.2 设备管理
E.2 数用来管理当前系统中的设备。
cuDeviceGetCount()cuDeviceGet()用来枚举这些设备,E.2 中的其他函数用来
性:
int deviceCount;
cuDeviceGetCount(&deviceCount);
int device;
for (int device = 0; device < deviceCount; ++device) {
CUdevice cuDevice;
cuDeviceGet(&cuDevice, device)
int major, minor;
cuDeviceComputeCapability(&major, &minor, cuDevice);
}
4.5.3.3 Context管理
E.3 数用来创建和分CUDA context 一个CUDA context 于一个CPU 处理。在计算API 之内执行的所有和行为被缩 在
CUDA context
,并且context 销毁时系统自动地理这些源。了对例如模块和理引用以外,
- 47 -
每个context 有它立的32bit 地址空间。因,来不同的CUDA context
CUdeviceptr 值引用不同的内存元。
Context 主机线程有一个一一对应的制。一个主机线程当前只可以有一个设备
context一个context 创建cuCtxCreate(),它成为了当前调用的主机线程。
以一个context 操作的CUDA (大多数数不包设备枚举或者
context 管理) 将返回CUDA_ERROR_INVALID_CONTEXT,如果当前的线
程不是一个合法的context
为了进运行在同一个context 授权码之用性,驱动API 一个由 每个确定定的context 的使用量计数。例如,如个库被加使用同一个CUDA
context ,每个库必须调cuCtxAttach()加使用量计数,而且已经context
使用调用cuCtxDetach()减少使用量计数器。当使用量计数器为0 context 就被销毁
了。对大多数库来,应用程序应在加初始库之创建一个CUDA context; 样, 应用程序可以创建一个用于它context,并且库仅简单的操作context 交给它的任
4.5.3.4 模块管理 E.4 数用来加卸载模块,并得模块柄,量的指数的定
模块是动地可加的包设备码和数据的缩包,如同Windows 中的DLL, 它通过
nvcc 输出。名对于所有,包括函数,全局变量和理引用,在模块范围内提,以
便方编写的模块在同一CUDA context 可以用。
下面的示,加一个模块并为kernel 取得一个句柄
CUmodule cuModule;
cuModuleLoad(&cuModule, "myModule.cubin");
CUfunction cuFunction;
cuModuleGetFunction(&cuFunction, cuModule, “myKernel”);
- 48 -
4.5.3.5
线线
cuParam*()
cuLanuch()
E.7
ID
kernel
kernel
cuFuncSetSharedSize()
cuFuncSetBlockShape()
kernel
culanuchGrid()
cuFuncSetBlockShape(cuFunction, blockWidth, blockHeight, 1); int offset = 0; int i; cuParamSeti(cuFunction, offset, i); Offset += sizeof(i); float f; cuParamSetf(cuFunction, offset, f); offset += sizeof(f); char data[32]; cuParamSetv(cuFunction, offset, (void*)data, sizeof(data)); offset += sizeof(data); cuParamSetSize(cuFunction, offset); cuFuncSetSharedSize(cuFunction, numElements * sizeof(float)); cuLaunchGrid(cuFuntion, gridWidth, gridHeight);
4.5.3.6 内存管理
E.8 数用来分放设备内存,并从主机和设备内存之前传输数据。
线性内存通过cuMemAlloc()cuMemAllocPitch()来分cuMemFree()放。
下面的示,在线性内存一个256 点元素的数组:
CUdeviceptr devPtr; cuMemAlloc(&devPtr, 256 * sizeof(float));
2D 数组建议使用cuMemMallocPitch(),从而保证访问行地址,或拷贝2D 数组到设
备内存的域的最佳性能。返回的pitch 必须用来访问数组元素。下面的示,分 一个宽x 高带有点数的2D 数组,和在设备如何循环数组元素:
// host code CUdeviceptr devPtr; int pitch; cuMemAllocPitch(&devPtr, &pitch,
width * sizeof(float), height, 4); cuModuleGetFunction(&cuFunction,cuModule,"myKernel"); cuFuncSetBlockShape(cuFunction, 512, 1, 1); cuParamSeti(cuFunction, 0, devPtr); cuParamSetSize(cuFunction, sizeof(devPtr)); cuLaunchGrid(cuFunction, 100, 1);
- 49 -
// device code __global__ void myKernel(float* devPtr) {
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
CUDA 数组通过cuArrayCreate(),通过cuArrayDestroy()放。
下面的示,分一个宽x 高带有一个32-bit 点数的CUDA 数组:
CUDA_ARRAY_DESCRIPTOR desc; desc.Format = CU_AD_T_FLOAT; desc.NumChannels = 1; 1; desc.Width = width; desc.Height = height; CUarray cuArray; cuArrayCreate(&cuArray, &desc);
E.8 列出了不同的数用来拷贝内存,包用cuMemMalloc()的线性内存,用
cuMemMallocPitch()的线性内存,CUDA 数组。
下面的示,拷贝一个2D 数组到之子中CUDA 数组:
CUDA_MEMCPY2D copyParam; memset(©Param, 0, sizeof(copyParam)); copyParam.dstMemoryType = CU_MEMORYTYPE_ARRAY; copyParam.dstArray = cuArray; copyParam.srcMemoryType = CU_MEMORYTYPE_DEVICE; copyParam.srcDevice = devPtr; copyParam.srcPitch = pitch; copyParam.WidthInBytes = width * sizeof(float); copyParam.Height = height; cuMemcpy2D(©Param);
下面的示,拷贝一些主机内存数组到设备内存:
float data[256]; int size = sizeof(data); CUdeviceptr devPtr; cuMemMalloc(&devPtr, size); cuMemcpyHtoD(devPtr, data, size);
- 50 -
4.5.3.7 流管理
E.5 数用来创建销毁流,并且定一个流的所有操作是否完成。 下面的示,创建两个流:
CUStream stream[2]; for (int i = 0; i < 2; ++i)
cuStreamCreate(&stream[i], 0);
下面的示,每一个流被依执行一:从主机到设备的内存拷贝kernel 动, 从设备到主机的内存拷贝
for (int i = 0; i < 2; ++i)
cuMemcpyHtoDAsync(inputDevPtr + i * size, hostPtr + i * size,
size, stream[i]);
for (int i = 0; i < 2; ++i) {
cuFuncSetBlockShape(cuFunction, 512, 1, 1); int offset = 0;
cuParamSeti(cuFunction, offset, outputDevPtr); offset += sizeof(int); cuParamSeti(cuFunction, offset, inputDevPtr);
offset += sizeof(int); cuParamSeti(cuFunction, offset, size); offset += sizeof(int);
cuParamSetSize(cuFunction, offset);
cuLaunchGridAsync(cuFunction, 100, 1, stream[i]
}
for (int i = 0; i < 2; ++i)
cuMemcpyDtoHAsync(hostPtr + i * size, outputDevPtr + i * size,
size, stream[i]);
cuCtxSynchronize();
hostPtr
float* hostPtr;
cuMemMallocHost((void**)&hostPtr, 2 * size);
cuCtxSynchronize()
4.5.3.8
CUEvent start,
stop;cuEventCreate(&start);
cuEventCreate(&stop);
inputDevPtr
E.6
hostPtr
outputDevPtr hostPtr
inputDevPtr
hostPtr
cuFunction
使
page-locked
- 51 -
cuEventRecord(start, 0);
for (int i = 0; i < 2; ++i)
cuMemcpyHtoDAsync(inputDevPtr + i * size, hostPtr + i * size,
size, stream[i]);
for (int i = 0; i < 2; ++i) {
cuFuncSetBlockShape(cuFunction, 512, 1, 1);
int offset = 0;
cuParamSeti(cuFunction, offset, outputDevPtr);
offset += sizeof(int);
cuParamSeti(cuFunction, offset, inputDevPtr);
offset += sizeof(int);
cuParamSeti(cuFunction, offset, size);
offset += sizeof(int);
cuParamSetSize(cuFunction, offset);
cuLaunchGridAsync(cuFunction, 100, 1, stream[i]
}
for (int i = 0; i < 2; ++i)
cuMemcpyDtoHAsync(hostPtr + i * size, outputDevPtr + i * size,
size, stream[i]);
cuEventRecord(stop, 0);
cuEventSynchronize(stop);
float elapsedTime;
cuEventElapsedTime(&elapsedTime, start, stop);
4.5.3.9 Texture Reference
E.9
kernel
texture reference
cuTexRefSetAddress()cuTexRefSetArray()
cuModule
texture reference
texture reference texRef
texture reference
使
texture<float, 2, cudaReadModeElementType> texRef;
texRef
CUtexref cuTexRef;
cuModuleGetTexRef(&cuTexRef, cuModule, “texRef”);
texture reference
devPtr
线
cuTexRefSetAddress(Null, cuTexRef, devPtr, size);
texture reference CUDA
cuArray
cuTexRefSetArray(cuTexRef, cuArrary, CU_TRSA_OVERRIDE_FORMAT);
E.9 列出种函数用于,设定texture reference 的寻址模式,过模式,格式,和其 他标。绑定一个理到texture reference 的格式必须配声明texture reference 否则取的果将是未定的。
- 52 -
4.5.3.10 OpenGL Interoperability
E.10
OpenGL
OpenGL
cuGLInit()
CUDA
使
cuGLRegisterBufferObject()
GLuint bufferObj;
cuGLRegisterBufferObject(bufferObj);
cuGLMapBufferObject()
kenrel
使
GLuint bufferObj;
CUdeviceptr devPtr;
Int size;
cuGLMapBufferObject(&devPtr, &size, bufferObj);
cuGLUnmapBufferObject()cuGLUnregisterBufferObject()
4.5.3.11 Direct3D Interoperability
E.11
Direct3D
Direct3D
cuD3D9Begin()
CUDA
使
cuD3D9End()
cuD3D9RegisterVertexBuffer()
LPDIRECT3DVERTEXBUFFER9 vertexBuffer;
cuD3D9RegisterVertexBuffer(vertexBuffer);
cuD3D9MapVertexBuffer()
kenrel
使
LPDIRECT3DVERTEXBUFFER9 vertexBuffer;
CUdeviceptr devPtr;
Int size;
cuD3D9MapVertexBuffer(&devPtr, &size, vertexBuffer);
cuD3D9UnmapVertexBuffer()cuD3D9UnregisterVertexBuffer()
- 53 -
Chapter 5 性能指导
5.1
5.1.1
warp
线
warp
线
warp
线
使
使(
线
线;
线
(
5.1.1.1 算术指
5.1.1
5.2
)
5.1.2
)
一个warp ,一个多处理器需要:
4 时钟周执行点相加,点相乘点乘加,数相加,逐字节操作,比小,
大,变换令;
16 钟周执行数,方根,__log(x)(参见B-2)
32 位整16 钟周,但__mul24 __umul24 (参见附录B)提在4 个
有符号的和号的24 位整法。对于将来的架构__[u]mul24 将会32
位整,因建议保留32 位整法以便以后使用。
(i/n)
(i>>log2(n))(i%n)
(i&(n-1))
n
n 2
- 54 -
warp
32
__sin(x)__cos(x)__exp(x)
36
char
short
(
__fdividef(xy)
32
)使
B-1
情况可以通过使用下面方法避免,:
f
1.x
f
使
3.141592653589793f 1.0f 0.5f
20
(
int
使
sinf() logf() expf()
B)
5.1.1.2
(ifswitch, do, for, while)
warp
线
ID
warp
(threadIdx / WSIZE)
warp
使
线
WSIZE warp
- 55 -
3.2
warp
线
warp
warp
4.2.5.2
warp
#pragma unroll
if
switch
使
线
warp
7
4
5.1.1.3
4
warp
访
__shared__ float shared[32];
__device__ float device[32];
shared[threadIdx.x] = device[threadIdx.x];
4
4
400 600
线
5.1.1.4
线
线
__syncthreads
4
400
600
warp
5.1.2
每个内存空间的有效带宽大的取于内存存取模式,在以下分述。
- 56 -
线便线线
线
5.1.2.1
使
32-bit64-bit 128-bit
__device__ type device[32];
type data = device[tid];
type
(
type
4.3.1.1
使
struct __align(8)__ {
float a;
float b;
};
struct __align(16)__ {
float a;
float b;
sizeof(type)
sizeof(type)
float2 float4
4
8,
16
)
__align__(8)__align__(16)
float c;
};
- 57 -
__align__ (16)
16
struct __align(16)__ {
float a;
float b;
float c
float d
float e;
};
128-bit
32-bit
线
half-warp
访
N 线访
HalfWarpBaseAddress + N
half-warp
便
half-warp
HalfWarpBaseAddress
type*
HalfWarpBaseAddress
BaseAddress
256
16*sizeof(type)
HalfWarpBaseAddress-BaseAddress
线
half-warp
访
线
warp
5-1
访
5-2
type
D.5 E.8
16*sizeof(type)
访使
5-3
16*sizeof(type)
half-warp
访
64-bit
访
32-bit
访
128-bit
访
32-bit
访
- 58 -
5-1
- 59 -
访
5-2
访
- 60 -
5-3
- 61 -
访
访
线
type* BaseAddress
BaseAddress + tid
type*
type
BaseAddress
访
width 2D
BaseAddress + width * ty + tx
线
half-warp
warp
线
ID tid
线
访
使
(txty)
线
访
使
;
type
type*
16
width 16
16
E.8
16
cudaMAllocPitch()cuMemAllocPitch()
使
访
5.1.2.2
线
half-warp
线线
线
warp
线
D.5
half-warp
线
- 62 -
5.1.2.3
便
2D
warp
5.1.2.4
访
线
bank
n
n
bank
bank
线
warp
线
访
访
访
n
bank
2
n
bank
2
bank
1.x
warp
bank
32
bank
32-bit
bank
n
bank
bank
16 (
5.1
访
使
32 bits
)
warp
的共享内存求被分成:一个求于warp 部分,而一个求于同warp 的后部分。
- 63 -
,在warp 部分的一个线程和于同一个warp 的后部的线程之分不存在
bank
__shared__ float shared[32];
float data = shared[BaseIndex + s * tid];
线
访
32-bit
线
ID tid
s
在这种情况下,每当s*n bank m 数量的数或着相,每当n 是m/d 数,这里d
m 和s ,线程tid 和tid+n 访问同一个库。因,只有warp 大小的一
是小于或等m/d 时,这里没bank 冲。对于计算兼容1.x 的设备,只有d 等1
换句话说,只有s 数时,因为m 是二的次方,这个转换没有bank 冲
5-4
5-5
线
访
访
5-6
32 bits
bank
访
char
__shared__ char shared[32];
char data = shared[BaseIndex + tid];
shared[0], shared[1], shared[2], shared[3]
访
char data = shared[BaseIndex + 4 * tid];
__shared__ struct type shared[32];
struct type data = shared[BaseIndex + tid];
struct type {
};
访
float x, y, z;
bank
bank
type
32-bit
访
- 64 -
二个单独的内存读操作带有bank ,如type 被定成为,
struct type {
float x, y;
};
因为每被一个度为二个32-bit 访问。
二个单独的内存读操作带有bank ,如typ
type
typtyp
struct type {
float f;
char c;
};
线
广
half-warp
线
访
2
广
32-bit
使
e 被定成为,
e e
32-bit
广
bank
广
5-7
广
广
bank
half-warp
线
访
bank
- 65 -
32-bit
5-4
bank
- 66 -
访
5-5
bank
访
- 67 -
5-6
bank
访
- 68 -
5-7 带有广播制的共享内存访问模式
- 69 -
5.1.2.5
bank
访
read-after-write
线
int4
5.2
线
线线
bank
bank
192
read-after-write
线
线
64
float4
线
线
线
5.1.2.5
线
kernel
kernel
warp
线
--ptxas-options=-v
1.x
线
使
64
kernel
- 70 -
R A
5.3
ceil(T,32) T
occupancy
B
64
线
warp
32
使
warp
T 线
100 ; 1000
192
256
occupancy
使
线
线
CUDA
使
使
使
page-locked
使
5.4 Texture Fetch 对比全局常驻内存读取
通过Texture fetch 的内存读取相比从全局常驻内存读取有几个
是被存的,如texture fetch 中将更高的带宽
全局常驻内存读取内存访问模式的约束参见部分5.1.2.1 5.1.2.2),
从而得更高的性能
寻址计算的延迟更低,从而提高随机访问数据的性能
- 71 -
在一个操作,包装的数据可以通过广播到不同的中;
8-bit 16-bit 入数据可以被转换成在范围[0.0, 1.0][-1.0, 1.0]点数(参见
部分4.3.4.1)。
果纹理是一个CUDA 数组(参见4.3.4.2),硬件提供其它有用的能力对于不同的应用程序,
特别是图像处理:
kernel
5.5
使
使
kernel
kernel
CUDA
线
线
kernel
线
线
__syncthreads()
使
线
kernel
线
- 72 -
kernel
使
使
5.1.2
5.3
5.2
5.1.2.15.1.2.25.1.2.3 5.1.2.4
访
使
使
访
使
访
bank
访
B-2
SIMD
- 73 -
Chapter 6 矩阵乘法的例
6.1
线
线
C
sub
线(A)
(wAhA)(wBwA)
C
block_size
AB
C
sub
16 线
C
sub
C
warp
如图6-1 所示,
C
于二个矩阵乘积:维度(wAblock_size)的子矩阵A具有指向
sub
相同的行,维度(block_sizewA)矩阵B具有指
(
C
5.2
)
相同的。为了适应设备的
sub
C
sub
这二个矩阵分成维度block_size 的许多个正方形矩阵,并且
C
是计算来的
sub
这些方矩阵乘积和。每一个乘积是通过,加二个对应的正方形矩阵全局内存到共享
内存,一线程每每个矩阵的一个元素,然后通过每个线程计算乘积的一个元素
的。每线程收集这些乘积的每个进入一个最后这些结果写到全局内存完成。
(wA / block_size)
AB
- 74 -
6-1
线C
C
线
sub
C
sub
- 75 -
6.2
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the device multiplication function
__global__ void Muld(float*, float*, int, int, float*);
// Host multiplication function // Compute C = A * B // hA is the height of A // wA is the width of A // wB is the width of B
void Mul(const float* A, const float* B, int hA, int wA, int wB, float* C)
{
int size;
// Load A and B to the device
float* Ad;
size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
float* Bd;
size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
// Allocate C on the device
float* Cd;
size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size);
// Compute the execution configuration
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);
// Launch the device computation
Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
// Read C from the device
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }
- 76 -
// Device multiplication function called by Mul() // Compute C = A * B // wA is the width of A // wB is the width of B
__global__ void Muld(float* A, float* B, int wA, int wB, float* C)
{
// Block index
int bx = blockIdx.x; int by = blockIdx.y;
// Thread index
int tx = threadIdx.x; int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// The element of the block sub-matrix that is computed // by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required // to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a < aEnd; a += aStep, b += bStep) {
// Shared memory for the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Shared memory for the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory to shared memory; // each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
- 77 -
// Multiply the two matrices together;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
6.3
Mul()
Muld()
6.3.1 Mul()
Mul()
A
Mul()
A B
B
使
warpper
C
cudaMalloc()
kernel
Muld()
ABC
使
cudaMemcpy()
Muld()
使
cudaMemcpy()
使
cudaFree()
AB
C
C
ABC
- 78 -
6.3.2 Muld()
Muld()Mul()
使线
Muld()
AB
AB
C
sub
Muld()
C
sub
5.1.2.1 5.1.2.4
Muld()
如在5.1.2.1 部分建议的,wAwB16 数,全局内存确定是集的,因为ab
c BLOCK_SIZE数,BLOCK_SIZE16
15
half-warp
bank
访
访
As[ty][k]
bank
As[ty][tx]Bs[ty][tx]Bs[k][tx]
线
访
ty k
线
线
bank
tx 0
访
- 79 -
A
1.1
CUDA
3.31.x
4.4.6
runtime
- 80 -
4.5.2.2 4.5.3.2
A.1
线x-y-z-
线
线
512
65535
512512
64
Warp
32
线
64KB
8
线
CUDA
warp
8192
16KB
8KB
8KB
24
768
texture reference
16 bank
13
2
15
16
线
Kernel
8
warp
2
CUDA
texture reference
texture reference
2
27
2
2
- 81 -
32
线
A.2 浮点数标准
计算设备循单度的二进制点数IEEE-754 标准,不同的是:
加法和法通被合并成一个乘-加指FMAD
法通过非标准数实现;
方根通过非标准方根数实现;
对于加法和法,只支持通过态舍入模式实入到数和入到不支持
直接入到正/
有动态配置入模式
不支持未的数;浮点算法和比令转换带有级的未操作数到
点操作
Underflow
监测制,是被mask
不支持signaling NaN
一个操作的结果一个或多个NaNNaN 模式是0x7fffffff 。根据IEEE-754R 标
,一个入到
fminf()fmin()fmaxf()
x86
IEEE-754
fmax()
NaN
NaN
- 82 -
B 数学
B.1
B.1
runtime
B-1
B.2
CUDA runtime
IEEE
rintf()
使
CUDA runtime
truncf()ceilf()
min()max()
rintf()
0.5 ulp
-
roundf()
floorf()
roundf()
FMAD
8
- 83 -
- 84 -
- 85 -
B.2
B-2
runtime
GPU
B-1
__
__sinf(x)
__fadd_rz(x,y)
使x y
__fmul_rz(x,y)
使x y
法和__fdividef(x,y)有同样的度,但对于
126
2
< y < 2
__fdividef(x,y) ,而法可以得到正确的。同样的,对于
128
2
,如x 是大,__fdividef(x,y) 的NaN(),规的
128
126
2
< y <
返回无大。
__[u]mul24(x,y)计算24 低有效x y 的乘积,并且32 低有
x 和y 的8 高有效被忽略。
__[u]mulhi(x,y)
x y
64
__[u]mul64hi(x,y)
64
x y
__saturate(x)
x
0
0x
1 1x [01]
__[u]sad(x,y,z)Sum of Absolute Difference
z x y
__clz(x)
32
x
__clzll(x)
64
x 1 1
linux
ffs
__ffsll(x)
__ffsll()
64
x 1 1
0
Linux
ffsll
32
128
64
x
x
x 0
__ffs()
0
x 0
__ffs(x)
- 86 -
- 87 -
C 子函
使
C.1
C.1.1 atomicAdd()
int atomicAdd(int* address, int val);
unsigned int atomicAdd(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算(old + val),果返回全
内存的同一地址。这个操作由一个操作执行。返回old
C.1.2 atomicSub()
int atomicSub(int* address, int val);
unsigned int atomicSub(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算(old - val),果返回全
内存的同一地址。这个操作由一个操作执行。返回old
C.1.3 atomicExch()
int atomicExch(int* address, int val);
unsigned int atomicExch(unsigned int* address,
unsigned int val);
float atomicExch(float* address, float val);
全局内存读取地址为address 的32-bit old,存val 返回全局内存中的同一地址。
这二个操作由一个操作执行。返回old
C.1.4 atomicMin()
int atomicMin(int* address, int val);
unsigned int atomicMin(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算old val 果返
回全局内存的同一地址。这个操作由一个操作执行。返回
- 88 -
old
C.1.5 atomicMax()
int atomicMax(int* address, int val);
unsigned int atomicMax(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算old val 果返
回全局内存的同一地址。这个操作由一个操作执行。返回old
C.1.6 atomicInc()
unsigned int atomicInc(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit 字old,计算((old >= val) ? 0 :
(old+1))果返回全局内存的同一地址。这个操作由一个操作执行。
old
C.1.7 atomicDec()
unsigned int atomicDec(unsigned int* address,
unsigned int val);
全局内存读取地址为address32-bit 字old,计算(((old ==0)| (old > val)) ?
val : (old-1))果返回全局内存的同一地址。这个操作由一个操作执行。
返回old
C.1.8 atomicCAS()
int atomicCAS(int* address, int compare, int val); unsigned int atomicCAS(unsignedint*address,
unsigned int compare,
unsigned int val);
全局内存读取地址为address 32-bit 字old,计算(old == compare ? val :
old)果返回全局内存的同一地址。这个操作由一个操作执行。返回old
(比置换)。
- 89 -
C.2
C.2.1 atomicAnd()
int atomicAnd(int* address, int val);
unsigned int atomicAnd(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算(old & val),果返回全
内存的同一地址。这个操作由一个操作执行。返回old
C.2.2 atomicOr()
int atomicOr(int* address, int val);
unsigned int atomicOr(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算(old | val),果返回全
内存的同一地址。这个操作由一个操作执行。返回old
C.2.3 atomicXor()
int atomicXor(int* address, int val);
unsigned int atomicXor(unsigned int* address,
unsigned int val);
全局内存读取地址为address 32-bit old,计算(old ^ val),果返回全
内存的同一地址。这个操作由一个操作执行。返回
- 90 -
old
D Runtime API Reference
Runtime API
APIcuda_runtime_api.h
APIcuda_runtime.h
C++
D.1
API
C
C++
CUDA
nvcc
API
nvcc
使
C++
D.1.1 cudaGetDeviceCount()
cudaError_t cudaGetDeviceCount(int* count);
返 回 计 算 性 大 于 1.0 的 设 备 数 量 到 指 针 *count 。 如 果 没 有 相 设 备 ,
cudaGetDeviceCount()返回1device 0 支持设备仿真模式。
D.1.2 cudaSetDevice()
cudaError_t cudaSetDevice(int dev);
dev
线
D.1.3 cudaGetDevice()
cudaError_t cudaGetDevice(int* dev);
线
*dev
D.1.4 cudaGetDeviceProperties()
cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp*
prop, int dev);
dev
*prop
- 91 -
cudaGetDeviceProp
struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t textureAlignment;
};
name
totalGlobalMem
sharedMemPerBlock
regsPerBlock
warpSize warp
memPitch
maxThreadsPerBlock
maxThreadsDim[3]
maxGridSize[3]
totalConstMem
major minor
clockRate
cudaMallocPitch()
KHz
byte
byte
线
byte
pitch
textureAlignment
textureAlignment 是对;纹理地址对textureAlignment
textureAlignment textureAlignment
texture fetch 应用offset
D.1.5 cudaChooseDevice()
- 92 -
textureAlignment 字节的,不要在
textureAlignment textureAlignment
cudaError_t cudaChooseDevice(int* dev,
const struct cudaDeviceProp* prop);
*prop
*dev
D.2
线
D.2.1 cudaThreadSynchronize()
cudaError_t cudaThreadSynchronize(void);
阻止直到设备上所有求的任执行cudaThreadSynchronize()返回一个错误
果其中的一个任
D.2.2 cudaThreadExit()
cudaError_t cudaThreadExit(void);
线
runtime
API
D.3
D.3.1 cudaStreamCreate()
cudaError_t cudaStreamCreate(cudaStream_t* stream);
D.3.2 cudaStreamQuery()
cudaError_t cudaStreamQuery(cudaStream_t stream);
runtime
cudasuccess
cudaErrorNotReady
D.3.3 cudaStreamSyncronize()
cudaError_t cudaStreamSyncronize(cudaStream_t stream);
- 93 -
D.3.4 cudaStreamDestroy()
cudaError_t cudaStreamDestroy(cudaStream_t stream);
D.4
D.4.1 cudaEventCreate()
cudaError_t cudaEventCreate(cudaEvent_t* event);
D.4.2 cudaEventRecord()
cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream);
录一个件。如stream 是非的,所有的操作件被;否则当CUDA
context 所有的操作件被录。由于这个操作是的,必须使用
cudaEventQuery /或cudaEventSyncronize 定何时事件被真的录了。如
cudaEventRecord 用了,并且件还有被录,返回
cudaErrorInvalidValue
D.4.3 cudaEventQuery()
cudaError_t cudaEventQuery(cudaEvent_t event);
cudasuccess
cudaEventQuery()
cudaErrorNotReady
cudaErrorInvalidValue
D.4.4 cudaEventSyncronize()
cudaError_t cudaEventSyncronize(cudaEvent_t event);
cudaErrorInvalidValue
cudaEventRecord()
- 94 -
D.4.5 cudaEventDestroy()
cudaError_t cudaEventDestroy(cudaEvent_t event);
D.4.6 cudaEventElapsedTime()
cudaError_t cudaEventElapsedTime(float* time,
cudaEvent_t start,
cudaEvent_t end);
D.5
D.5.1 cudaMalloc()
cudaError_t cudaMalloc(void** devPtr, size_t count);
stream
millisecond
cudaErrorInvalidValue
在设备上分count 字节的线性内存,并返回内存的指针*devPtr。分的内存适合
任何型的量。如cudaMalloc()返回cudaErrorAlloction
D.5.2 cudaMallocPitch()
cudaError_t cudaMallocPitch(void** devPtr,
size_t* pitch,
size_t widthInBytes,
size_t height);
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
widthInBytes*height
线
pitch
2D
T
*pitch
*devPtr
Pitch
对于2D 数组的分建议使用cudaMallocPitch()内存。由于pitch 限制限于
硬件,特别是应用程序从设备内存的不同域执行一个2D 的内存拷贝是线性内存
还是
CUDA 数组)。
- 95 -
D.5.3 cudaFree()
cudaError_t cudaFree(void* devPtr);
放被devPtr的内存空间,devPtr 必须带有cudaMalloc()cudaMallocPitch()返回
否则将返回一个错误,或者cudaFree(devPtr)已经在之被使用过。如果devPtr0
不执行任何操作。如果cudaFree()调用失将返回cudaErrorInvalidDevicePointer
D.5.4 cudaMallocArray()
cudaError_t cudaMallocArray(struct cudaArray** array,
const struct cudaChannelFormatDesc* desc,
size_t width, size_t height);
根据cudaChannelFormatDesc
cudaChannelFormatDesc 结构desc
cudaChannelFormatDesc cudaChannelFormatDesc
CUDA 数组的
D.5.5 cudaFreeArray()
cudaError_t cudaFreeArray(struct cudaArray* array);
CUDA
array
desc 一个CUDA 数组,并返回一个在*array
desc desc
array 0
*array 的新
*array *array
D.5.6 cudaMallocHost()
cudaError_t cudaMallocHost(void** hostPtr, size_t size);
一个size 字节可用于设备访问的page-locked 主机内存。驱动程序追踪由这个
拟内存的范围,并动加速数的用,例如cudaMemcpy*()。由于内存可以被
设备直接访问,相比由例如malloc()pagable 内存,很多的读写速度。
但是,分过多的page-locked 内存减少系统可用物理内存的大小,从而降低系统整体的
性能。
D.5.7 cudaFreeHost()
cudaError_t cudaFreeHost(void* hostPtr);
hostPtr
hostPtr
cudaMallocHost()
- 96 -
D.5.8 cudaMemSet()
cudaError_t cudaMemset(void* devPtr, int value, size_t count);
使
D.5.9 cudaMemSet2D()
cudaError_t cudaMemset2D(void* dstPtr, size_t pitch,
value
devPtr
int value, size_t width, size_t height);
count
设定由dstPtr 指的一个矩阵vaulepitch 是由dstPtr 指2D 数组
内存宽度字节2D 数组每行的后包含自充的数(为保证列需求)。
pitch 是由cudaMallocPitch() 返回的值时,执行速度最快
D.5.10 cudaMemcpy()
cudaError_t cudaMemcpy(void* dst, const void* src,
size_t count,
enum cudaMemcpyKind kind);
cudaError_t cudaMemcpyAsync(void* dst,constvoid*src,
size_t count,
enum cudaMemcpyKind kind,
cudaStream_t stream);
拷贝count 字节,从src的内存域到dst的内存域,kind 可以是
cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHost,或
cudaMemcpyDeviceToDevice拷贝。内存域可能不会重cudaMemcpy() 使用
匹配拷贝的指针srcdst 将导致结果是未定的。
cudaMemcpyAsync()异步的,并且可以作为一个可选参数通过流使用。它只能应用于
page-locked 主机内存。如使用一个指pagable 的内存指作为入,将返回
错误
- 97 -
D.5.11 cudaMemcpy2D()
cudaError_t cudaMemcpy2D(void* dst, size_t dpitch,
const void* src, size_t spitch,
size_t width, size_t height,
enum cudaMemcpyKind kind);
cudaError_t cudaMemcpy2DAsync(void* dst, size_t dpitch,
const void* src, size_t spitch,
size_t width, size_t height,
enum cudaMemcpyKind kind,
cudaStream_t stream);
拷贝一个矩阵,从src的内存域到dst的内存域,kind 可以是 cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHost,或 cudaMemcpyDeviceToDevice拷贝。dpitchspitch 是由dstsrc2D 数组的内存宽度字节。内存域可能不会重cudaMemcpy2D()使用不配拷贝的指 srcdst 导致是未定的。如果dpitchspitch 大于许的大值参见附 D.1.4 memPitch),cudaMemcpy2D()将返回一个错误
cudaMemcpy2DAsync()
使
D.5.12 cudaMemcpyToArray()
cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray,
cudaError_t cudaMemcpyToArrayAsync(struct cudaArray* dstArray,
pagable
size_t dstX, size_t dstY,
const void* src, size_t count,
enum cudaMemcpyKind kind);
size_t dstX, size_t dstY,
const void* src, size_t count,
enum cudaMemcpyKind kind,
cudaStream_t stream);
使
page-locked
拷贝count 字节,从src的内存域到dstArrayCUDA 数组,从数组的dstX, dstYkind 可以是cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHost,或cudaMemcpyDeviceToDevice拷贝
cudaMemcpyToArrayAsync()
page-locked
使
pagable
- 98 -
使
D.5.13 cudaMemcpy2DToArray()
cudaError_t cudaMemcpy2DToArray(struct cudaArray* dstArray,
size_t dstX, size_t dstY,
const void* src, size_t spitch,
size_t width, size_t height,
enum cudaMemcpyKind kind);
cudaError_t cudaMemcpy2DToArrayAsync(struct cudaArray* dstArray,
size_t dstX, size_t dstY,
const void* src, size_t spitch,
size_t width, size_t height,
enum cudaMemcpyKind kind,
cudaStream_t stream);
拷贝一个矩阵,从src的内存域到dstArrayCUDA 数组,从数组的dstX, dstYkind 可以是cudaMemcpyHostToHostcudaMemcpyHostToDevice
cudaMemcpyDeviceToHost,或cudaMemcpyDeviceToDevice拷贝spitch 是由src2D 数组的内存宽度字节2D 数组每行的后包含自充的数( 为保证队列需求)。如果spitch 大于许的参见附D.1.4 中memPitch), cudaMemcpy2DToArray()将返回一个错误
cudaMemcpy2DToArrayAsync()的,并且可以作为一个可选参数通过流使用。它只
能应用于page-locked 主机内存。如使用一个指pagable 的内存指作为入,将返回一个错误
D.5.14 cudaMemcpyFromArray()
cudaError_t cudaMemcpyFromArray(void* dst,
cudaError_t cudaMemcpyFromArrayAsync(void* dst,
const struct cudaArray* srcArray,
size_t srcX, size_t srcY,
size_t count,
enum cudaMemcpyKind kind);
const struct cudaArray* srcArray,
size_t srcX, size_t srcY,
size_t count,
enum cudaMemcpyKind kind,
cudaStream_t stream);
拷贝count 字节,从srcArrayCUDA 数组,从数组的srcX, srcY)始,dst的内存域,kind 可以是cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHost,或cudaMemcpyDeviceToDevice拷贝
- 99 -
cudaMemcpy2DToArrayAsync()
page-locked
D.5.15 cudaMemcpy2DFromArray()
cudaError_t cudaMemcpy2DFromArray(void* dst, size_t dpitch,
cudaError_t cudaMemcpy2DFromArrayAsync(void* dst, size_t dpitch,
使
pagable
const struct cudaArray* srcArray,
size_t srcX, size_t srcY,
size_t width, size_t height,
enum cudaMemcpyKind kind);
const struct cudaArray* srcArray,
size_t srcX, size_t srcY,
size_t width, size_t height,
enum cudaMemcpyKind kind,
cudaStream_t stream);
使
拷贝一个矩阵,从srcArrayCUDA 数组,从数组的srcX, srcY,到 dst的内存域,kind 可以是cudaMemcpyHostToHostcudaMemcpyHostToDevice cudaMemcpyDeviceToHost,或cudaMemcpyDeviceToDevice拷贝dpitch 是由dst2D 数组的内存宽度字节2D 数组每行的后包含自充的数( 为保证队列需求)。如果dpitch 大于许的参见附D.1.4 中memPitch), cudaMemcpy2DFromArray()将返回一个错误
cudaMemcpy2DFromArrayAsync()
page-locked
D.5.16 cudaMemcpyArrayToArray()
cudaError_t cudaMemcpyArrayToArray(struct cudaArray* dstArray,
拷贝count
count 字节,从srcArray
count count
dstArray
dstArray CUDA 数组,从数组的dstX, dstY),kind 可以是
dstArray dstArray
使
srcArray CUDA 数组,从数组的srcX
srcArray srcArray
pagable
size_t dstX, size_t dstY,
const struct cudaArray* srcArray,
size_t srcX, size_t srcY,
size_t count,
enum cudaMemcpyKind kind);
使
srcX, srcY
srcXsrcX
srcY
srcYsrcY
cudaMemcpyHostToHost
cudaMemcpyDeviceToDevice
cudaMemcpyHostToDevicecudaMemcpyDeviceToHost
- 100 -
Loading...