NVIDIA CUDA User Manual

NVIDIA CUDA 编程指南
NVIDIA
Jan. 2008 Version 1.1
GPU
.....................................................................................................................1
NVIDIA CUDA
Chapter1
1.1
1.2 CUDA
Chapter2
2.1
线
2.2
线
2.2.1
2.2.2
2.3
线 线
Chapter3
3.1
3.2
3.3
3.4
3.5
CUDA…….....................................................................................................................11
.........................................................................................................................1
GPU
............................................................................................................................... 15
.....................................................................................................................15
.......................................................................................................................................15
..........................................................................................................................................16
...................................................................................................................................16
...........................................................................................................................................17
on-chip
................................................................................................................................18
SIMD
...........................................................................................................................................19
........................................................................................................................................20
...............................................................................................................................................20
...........................................................................................................................................20
………………………….............................................11
..............................................................................................12
....................................................................................18
Chapter4
4.1
C
4.2
4.2.1
API ...................................................................................................21
..............................................................................................................................21
...........................................................................................................................................21
......................................................................................................................... 22
4.2.1.1 __device__..................................................................................................................22
4.2.1.2 __global__..........................................................................................................22
4.2.1.3 __host__............................................................................................................22
4.2.1.4
4.2.2
....................................................................................................................22
........................................................................................................................23
4.2.2.1 __device__..................................................................................................23
4.2.2.2 __constant__ .......................................................................................................23
4.2.2.3 __shared__............................................................................................................23
4.2.2.4
4.2.3
4.2.4
.............................................................................................................................24
...........................................................................................................................25
..............................................................................................................................26
4.2.4.1 gridDim.............................................................................................................................. 26
4.2.4.2 blockIdx...........................................................................................................................26
4.2.4.3 blockDim............................................................................................................................26
- 2 -
4.2.4.4 threadIdx......................................................................................................................... 26
4.2.4.5
4.2.5 NVCC
..................................................................................................................................... 26
.................................................................................................................................. 26
4.2.5.1 __noinline__ ...................................................................................................................27
4.2.5.2 #pragmaunroll .................................................................................................................27
4.3
4.3.1
Runtime
........................................................................................................................... 28
............................................................................................................................. 28
4.3.1.1
char1,uchar1,char2,uchar2,char3,uchar3,char4,uchar4,short1,ushort1,short2,us
hort2,short3,ushort3,short4,ushort4,int1,uint1,int2,uint2,int3,uint3,int4,ui
nt4,long1,ulong1,long2,ulong2,long3,ulong3,long4,ulong4,float1,float2,float3
,float4.......................................................................................................................................... 28
4.3.1.2 dim3
4.3.2
4.3.3
4.3.4
4.3.4.1 Texture Reference
4.3.4.2 RuntimeTexture Reference
........................................................................................................................... 28
..................................................................................................................................... 28
..................................................................................................................................... 28
..................................................................................................................................... 29
.......................................................................................................29
.........................................................................................30
4.3.4.3
4.4
4.4.1
4.4.2
4.4.3
4.4.4 TypeCasting
4.4.5
4.4.5.1
4.4.5.2 CUDA
4.4.6
4.5
4.5.1
4.5.1.1
4.5.1.2
线
Runtime
........................................................................................................................... 31
..................................................................................................................................... 31
..................................................................................................................................... 31
............................................................................................................................. 32
..................................................................................................................................... 33
Runtime
...................................................................................................................................... 34
........................................................................................................................... 34
..................................................................................................................................... 35
..................................................................................................................................... 35
..................................................................................................................................... 35
CUDA
.......................................................................................31
........................................................................................................................32
................................................................................................................33
............................................................................................................33
4.5.1.3 OpenGL Interoperability ...................................................................................................... 36
4.5.1.4 Direct3D Interoperability ..................................................................................................... 36
4.5.1.5
...................................................................................................................37
- 3 -
4.5.2 RuntimeAPI.................................................................................................................................... 38
4.5.2.1
4.5.2.2
4.5.2.3
4.5.2.4
4.5.2.5
4.5.2.6 Texture Reference
..................................................................................................................................... 38
..................................................................................................................................... 40
.................................................................................................................................. 38
.................................................................................................................................. 39
.................................................................................................................................. 41
...........................................................................................................42
4.5.2.7 OpenGL Interoperability .......................................................................................................... 44
4.5.2.8 Direct3D Interoperability ......................................................................................................... 44
4.5.2.9
4.5.3
4.5.3.1
4.5.3.2
4.5.3.3 Context
4.5.3.4
4.5.3.5
4.5.3.6
4.5.3.7
使仿
API ........................................................................................................................................ 47
..................................................................................................................................... 47
................................................................................................................................. 47
............................................................................................................................. 47
................................................................................................................................. 48
................................................................................................................................. 49
................................................................................................................................. 49
..................................................................................................................................... 51
.............................................................................................................45
4.5.3.8
4.5.3.9 Texture Reference
................................................................................................................................. 51
..........................................................................................................52
4.5.3.10 OpenGL Interoperability ...................................................................................................... 53
4.5.3.11 Direct3D Interoperability ...................................................................................................... 53
Chapter5
5.1
5.1.1
5.1.1.1
5.1.1.2
5.1.1.3
5.1.1.4
5.1.2
5.1.2.1
5.1.2.2
5.1.2.3
5.1.2.4
5.1.2.5
................................................................................................................................ 54
........................................................................................................................................... 54
................................................................................................................................. 54
............................................................................................................................. 54
.......................................................................................................................... 55
............................................................................................................................. 56
............................................................................................................................. 56
.................................................................................................................................... 56
............................................................................................................................. 57
............................................................................................................................. 62
............................................................................................................................. 63
............................................................................................................................. 63
................................................................................................................................. 70
- 4 -
5.2
线
5.3
.......................................................................................................................... 70
....................................................................................................................71
5.4 Texture Fetch
5.5
Chapter6
6.1
6.2
6.3
............................................................................................................................................... 74
........................................................................................................................................... 76
.........................................................................................71
.......................................................................................................................... 72
....................................................................................................................74
.................................................................................................................................... 78
6.3.1 Mul().................................................................................................................................. 78
6.3.2 Muld()................................................................................................................................ 79
A
A.1
A.2
B
B.1
runtime
B.2
runtime
C
C.1
................................................................................................................................... 80
....................................................................................................................................... 81
.................................................................................................................................... 82
................................................................................................................................... 83
............................................................................................................................. 83
........................................................................................................................... 86
................................................................................................................................... 88
....................................................................................................................................... 88
C.1.1 atomicAdd() ................................................................................................................... 88
C.1.2 atomicSub() ................................................................................................................... 88
C.1.3 atomicExch() ................................................................................................................. 88
C.1.4 atomicMin() ................................................................................................................... 88
C.1.5 atomicMax() ................................................................................................................... 89
C.1.6 atomicInc() ....................................................................................................................89
C.1.7 atomicDec() ................................................................................................................... 89
C.1.8 atomicCAS() ................................................................................................................... 89
C.2
.................................................................................................................................... 90
C.2.1 atomicAnd() ................................................................................................................... 90
C.2.2 atomicOr()..................................................................................................................….. 90
C.2.3 atomicXor() ................................................................................................................... 90
D Runtime API Reference ............................................................................................................ 91
D.1
........................................................................................................................................ 91
D.1.1 cudaGetDeviceCount().......................................................................................................91
D.1.2 cudaSetDevice()..................................................................................................................91
D.1.3 cudaGetDevice()..................................................................................................................91
D.1.4 cudaGetDeviceProperties() .............................................................................91
D.1.5 cudaChooseDevice() .......................................................................................................93
- 5 -
D.2
线
...................................................................................................................................... 93
D.2.1 cudaThreadSynchronize()..................................................................................................93
D.2.2 cudaThreadExit()..................................................................................................................93
D.3
........................................................................................................................................... 93
D.3.1 cudaStreamCreate() ...........................................................................................................93
D.3.2 cudaStreamQuery()...............................................................................................................93
D.3.3 cudaStreamSyncronize().....................................................................................................93
D.3.4 cudaStreamDestroy() .........................................................................................................94
D.4
....................................................................................................................................... 94
D.4.1 cudaEventCreate()...............................................................................................................94
D.4.2 cudaEventRecord()................................................................................................................94
D.4.3 cudaEventQuery()...................................................................................................................94
D.4.4 cudaEventSyncronize()........................................................................................................94
D.4.5 cudaEventDestroy() .............................................................................................................95
D.4.6 cudaEventElapsedTime()......................................................................................................95
D.5
........................................................................................................................................ 95
D.5.1 cudaMalloc() ..........................................................................................................................95
D.5.2 cudaMallocPitch().................................................................................................................95
D.5.3 cudaFree()................................................................................................................................96
D.5.4 cudaMallocArray()................................................................................................................96
D.5.5 cudaFreeArray().....................................................................................................................96
D.5.6 cudaMallocHost()...................................................................................................................96
D.5.7 cudaFreeHost().......................................................................................................................96
D.5.8 cudaMemSet() ..........................................................................................................................97
D.5.9 cudaMemSet2D().......................................................................................................................97
D.5.10 cudaMemcpy() .........................................................................................................................97
D.5.11 cudaMemcpy2D() ...................................................................................................................98
D.5.12 cudaMemcpyToArray() .........................................................................................................98
D.5.13 cudaMemcpy2DToArray().....................................................................................................99
D.5.14 cudaMemcpyFromArray().....................................................................................................99
D.5.15 cudaMemcpy2DFromArray()................................................................................................100
D.5.16 cudaMemcpyArrayToArray()..............................................................................................100
D.5.17 cudaMemcpy2DArrayToArray() ........................................................................................101
D.5.18 cudaMemcpyToSymbol().....................................................................................................101
D.5.19 cudaMemcpyFromSymbol()..................................................................................................101
D.5.20 cudaGetSymbolAddress()..................................................................................................102
D.5.21 cudaGetSymbolSize() .......................................................................................................102
- 6 -
D.6 Texture Reference
D.6.1
API ................................................................................................................................... 102
................................................................................................................102
D.6.1.1 cudaCreateChannelDesc() ............................................................................................102
D.6.1.2 cudaGetChannelDesc()...................................................................................................102
D.6.1.3 cudaGetTextureReference().........................................................................................103
D.6.1.4 cudaBindTexture().........................................................................................................103
D.6.1.5 cudaBindTextureToArray()..........................................................................................103
D.6.1.6 cudaUnBindTexture()......................................................................................................103
D.6.1.7 cudaGetTextureAlignmentOffset()............................................................................104
D.6.2
API .....................................................................................................................................104
D.6.2.1 cudaCreateChannelDesc()...............................................................................................104
D.6.2.2 cudaBindTexture()............................................................................................................104
D.6.2.3 cudaBindTextureToArray().............................................................................................105
D.6.2.4 cudaUnBindTexture() ......................................................................................................105
D.7
...................................................................................................................................... 105
D.7.1 cudaConfigureCall() .........................................................................................................105
D.7.2 cudaLaunch() ........................................................................................................................105
D.7.3 cudaSetupArgument() .........................................................................................................106
D.8 OpenGL Interoperability ............................................................................................................... 106
D.8.1 cudaGLRegisterBufferObject().......................................................................................106
D.8.2 cudaGLMapBufferObject()..................................................................................................106
D.8.3 cudaGLUnMapBufferObject() ............................................................................................106
D.8.4 cudaGLUnRegisterBufferObject()...................................................................................106
D.9 Direct3D Interoperability .............................................................................................................. 107
D.9.1 cudaD3D9Begin()...................................................................................................................107
D.9.2 cudaD3D9End().......................................................................................................................107
D.9.3 cudaD3D9RegisterVertexBuffer()...................................................................................107
D.9.4 cudaD3D9MapVertexBuffer() ............................................................................................107
D.9.5 cudaD3D9UnMapVertexBuffer().........................................................................................107
D.9.6 cudaD3D9UnRegisterVertexBuffer() ............................................................................107
D.9.7 cudaD3D9GetDevice() ........................................................................................................108
D.10
......................................................................................................................................108
D.10.1 cudaGetLastError() .........................................................................................................108
D.10.2 cudaGetErrorString()......................................................................................................108
E DriverAPIReference ................................................................................................................ 109
E.1
........................................................................................................................................... 109
E.1.1 cuInit()................................................................................................................................. 109
- 7 -
E.2
....................................................................................................................................... 109
E.2.1 cuGetDeviceCount().............................................................................................................109
E.2.2 cuDeviceGet() ......................................................................................................................109
E.2.3 cuDeviceGetName()...............................................................................................................109
E.2.4 cuDeviceTotalMem().............................................................................................................109
E.2.5 cuDeviceComputeCapability() .......................................................................................110
E.2.6 cuDeviceGetAttribute()....................................................................................................110
E.2.7 cuDeviceGetProperties()..................................................................................................111
E.3 Context
...................................................................................................................................111
E.3.1 cuCtxCreate() .....................................................................................................................111
E.3.2 cuCtxAttach() .....................................................................................................................112
E.3.3 cuCtxDetach() .....................................................................................................................112
E.3.4 cuCtxGetDevice().................................................................................................................112
E.3.5 cuCtxSynchronize().............................................................................................................112
E.4
.......................................................................................................................................112
E.4.1 cuModuleLoad().....................................................................................................................112
E.4.2 cuModuleLoadData()............................................................................................................113
E.4.3 cuModuleLoadFatBinary().................................................................................................113
E.4.4 cuModuleUnload()................................................................................................................113
E.4.5 cuModuleGetFunction().....................................................................................................113
E.4.6 cuModuleGetGlobal() .......................................................................................................113
E.4.7 cuModuleGetTexRef() .......................................................................................................114
E.5
.........................................................................................................................................114
E.5.1 cuStreamCreate()...............................................................................................................114
E.5.2 cuStreamQuery().................................................................................................................114
E.5.3 cuStreamSynchronize()....................................................................................................114
E.5.4 cuStreamDestroy().............................................................................................................114
E.6
.....................................................................................................................................114
E.6.1 cuEventCreate().................................................................................................................114
E.6.2 cuEventRecord().................................................................................................................115
E.6.3 cuEventQuery()...................................................................................................................115
E.6.4 cuEventSynchronize() ....................................................................................................115
E.6.5 cuEventDestroy()...............................................................................................................115
E.6.6 cuEventElapsedTime() ....................................................................................................115
E.7
.....................................................................................................................................116
E.7.1 cuFuncSetBlockShape()....................................................................................................116
E.7.2 cuFuncSetSharedSize().....................................................................................................116
- 8 -
E.7.3 cuParamSetSize()................................................................................................................116
E.7.4 cuParamSeti() .....................................................................................................................116
E.7.5 cuParamSetf() .....................................................................................................................116
E.7.6 cuParamSetv() .....................................................................................................................116
E.7.7 cuParamSetTexRef()............................................................................................................117
E.7.8 cuLaunch().............................................................................................................................117
E.7.9 cuLaunchGrid().....................................................................................................................117
E.8 Memory
..................................................................................................................................117
E.8.1 cuMemGetInfo().....................................................................................................................117
E.8.2 cuMemAlloc() .......................................................................................................................118
E.8.3 cuMemAllocPitch()...............................................................................................................118
E.8.4 cuMemFree()............................................................................................................................118
E.8.5 cuMemAllocHost().................................................................................................................118
E.8.6 cuMemFreeHost()...................................................................................................................119
E.8.7 cuMemGetAddressRange()....................................................................................................119
E.8.8 cuArrayCreate()...................................................................................................................119
E.8.9 cuArrayGetDescriptor()....................................................................................................120
E.8.10 cuArrayDestroy()...............................................................................................................121
E.8.11 cuMemset()........................................................................................................................... 121
E.8.12 cuMemset2D() .......................................................................................................................121
E.8.13 cuMemcpyHtoD()...................................................................................................................121
E.8.14 cuMemcpyDtoH()..................................................................................................................122
E.8.15 cuMemcpyDtoD()...................................................................................................................122
E.8.16 cuMemcpyDtoA()...................................................................................................................122
E.8.17 cuMemcpyAtoD()...................................................................................................................123
E.8.18 cuMemcpyAtoH()...................................................................................................................123
E.8.19 cuMemcpyHtoA()...................................................................................................................123
E.8.20 cuMemcpyAtoA()...................................................................................................................124
E.8.21 cuMemcpy2D() ......................................................................................................................124
E.9 Texture Reference
.................................................................................................................126
E.9.1 cuTexRefCreate().................................................................................................................126
E.9.2 cuTexRefDestroy()...............................................................................................................127
E.9.3 cuTexRefSetArray().............................................................................................................127
E.9.4 cuTexRefSetAddress() .......................................................................................................127
E.9.5 cuTexRefSetFormat() ..........................................................................................................128
E.9.6 cuTexRefSetAddressMode()................................................................................................128
E.9.7 cuTexRefSetFilterMode()..................................................................................................128
- 9 -
E.9.8 cuTexRefSetFlags().............................................................................................................129
E.9.9 cuTexRefGetAddress() ......................................................................................................129
E.9.10 cuTexRefGetArray()...........................................................................................................129
E.9.11 cuTexRefGetAddressMode()..............................................................................................129
E.9.12 cuTexRefGetFilterMode()................................................................................................129
E.9.13 cuTexRefGetFormat() .......................................................................................................130
E.9.14 cuTexRefGetFlags()...........................................................................................................130
E.10 OpenGLInteroperability .............................................................................................................. 130
E.10.1 cuGLInit()........................................................................................................................... 130
E.10.2 cuGLRegisterBufferObject() ........................................................................................130
E.10.3 cuGLMapBufferObject()....................................................................................................130
E.10.4 cuGLUnmapBufferObject()................................................................................................131
E.10.5 cuGLUnregisterBufferObject().....................................................................................131
E.11 Direct3DInteroperability ............................................................................................................. 131
E.11.1 cuD3D9Begin().....................................................................................................................131
E.11.2 cuD3D9End()..........................................................................................................................131
E.11.3 cuD3D9RegisterVertexBuffer() ...................................................................................131
E.11.4 cuD3D9MapVertexBuffer()................................................................................................131
E.11.5 cuD3D9UnmapVertexBuffer()...........................................................................................132
E.11.6 cuD3D9UnregisterVertexBuffer()................................................................................132
E.11.7 cuD3D9GetDevice()............................................................................................................132
F TextureFetching .................................................................................................................... 133
F.1 Nearest-Point
F.2
线
F.3
........................................................................................................................................... 136
..................................................................................................................... 134
..................................................................................................................................... 135
- 10 -
Chapter 1 介绍CUDA
1.1
1-1
GPU
1-1CPU GPU
GPU
1-2 GPU
1-2
- 11 -
更加具体地看,GPU 是特别适合于并行数据运算的问题-同一个程序在许多并行数据元素,
并带有高运算密度(算术运算与内存操作的比例)。由于同一个程序要执行每个数据元素,
降低了对复杂的流量控制要求; 并且,因为它执行许多数据元素并且据有高运算密度,内存
访问的延迟可以被忽略。
并行数据处理,意味着数据元素以并行线程处理。许多处理大量数据集,例如数组的应用程
序可以使用一个并行数据的编程模型来加速计算。在3D 渲染上,大的像素集和顶点被映射
到并行线程。同样,图像和媒体处理的应用程序例如着色的图像后处理,录像编码和解码,
图像缩放比例,立体视觉,以及图像识别也可以映射图像块和像素到并行处理线程。实际上,
在图像着色和处理领域外的许多算法同样可以通过并行数据处理得到加速,从一般信号处理
或物理模拟到金融计算或者生物计算。
然而直到今天,尽管强大的计算能力包装进了GPU,而它对非图形应用的有效支持依然有限:
GPU 只能通过图型API 来编程,导致新手很难学习和非图形API 上很不充分的应用。
GPU DRAM 可以用一般方式下读取,GPU 程序可以从任何DRAM 部分收集数据元素。
但不可写,在一般方式下的GPU 程序不能写入信息到DRAM 的任何部分,相比CPU 丧失
了很多编程的灵活性。
有些应用是由于DRAM 内存带宽而形成的瓶颈,未能充分利用GPU 的计算能力。
本文描述的是一个崭新的硬件和编程模型,它直接答复了这些问题并且展示了GPU 如何成
为一个真正的通用并行数据计算设备。
1.2 CUDA
GPU
GPU
API
便
CUDACompute Unified Device Architecture
GPU
CUDA
G80
访
GPU
CUDA 软件堆栈由几层组成,如图1-3 所示:一个硬件驱动程序,一个应用程序编程接口(API)
和它的Runtime, 还有二个高级的通用数学库,CUFFT CUBLAS。硬件被设计成支持轻
量级的驱动和
Runtime 层面,因而提高性能。
- 12 -
CUDA API
C 便
1-3
1-4
CPU
CUDA
DRAM
DRAM
1-4
- 13 -
CUDA
1-5
On-chip
线
DRAM overfetch round-trips
DRAM
1-5。共享内存使数据更接近ALU
- 14 -
Chapter 2 编程模型
2.1
线
CUDA
GPU
线
使
API
DRAM
DRAM
DRAM
2.2
线
线线
kernel
kernel
2-1
CPU
线
(DMA)
kernel
2-1
kernel
线线
线
- 15 -
2.2.1
线
线线
Kernel
线线
线
(
D
x
ID
D
(
D
y
D
z
)
线
ID
线线
x
D
y
)
2.2.2
线
线
线
(
xyz
(x, y)
线
)
线
ID
ID (x +
ID (x +
2 -3-
y D
y D
访
x
)
x
+
z DxD
y
)
线线
线
kernel
线
kernel
线
线
kernel
.
每个块是由它的块ID 确定的,块的ID 是在栅格之内的块编号。根据块ID 可以帮助进行
ٛ
复杂寻址,一个应用程序可也以指定一个栅格作为任意大小的一个二维数组,并且通过一个
2-组件索引替换来制定每个块。对于一个大小为 (
块的
ID (x +
y D
x
)
D
x
D
y
)
块,这个块的索引是(x,y),
- 16 -
2.3
线
使
DRAM On-Chip
线
线
kernel
2-2
线
使
2-2
DRAM On-chip
- 17 -
Chapter 3 硬件实
3.1
on-chip
SIMD
使
32
访
3-1
使
on-chip
(SIMD)
3-1
- 18 -
3.2
线
线
on-chip
线
线
warps;
; 线
warp
线
warp
active
warp
warp
active
使
warp
0
;
warp
线线
访
SIMD
SIMD
线
warp
线
线
使线线
线
在一个块内的warp 序是未定的,但通过协调全局或者共享内存的存取,它可以同
的执行。如一个通过warp 线程执行的指写入全局或共享内存的同一位置,写的序是
未定的。
线
线
- 19 -
3.3
A
3.4
A
1.x
1
为一个应用程序使用多GPU 作为CUDA 设备,必须保证这些GPU 是一样的型。如果系
统工作在SLI 模式下,那么只有一个GPU 可以作为CUDA 设备,由于所有的GPU 在驱动
堆栈层的融合了。SLI 模式要在控制面板中关闭,这样多个GPU 作为CUDA
设备。
3.5
GPU 指定一些DRAM 来存primary surface 的内,这些内被用于输出。如
户改变显示的分辨率或者色那么primary surface 的存储需求量将改变。比如,如
户将显示分辨率1280x1024x32bit1600x1200x32bit系统必须指定7.68MB
primary surface 而不在是5.24MB。(使全屏抗锯齿的应用程序要更多的primary surface
空间)。另外,比如在Windows 中使用Alt+Tab切换,或者Ctrl+Alt+Del 的操作同样需要
额外的primary surface 空间。
模式切换增加了primary surface 的内存空间系统将占CUDA 指定的内存空间,导
致程序崩溃
- 20 -
Chapter 4 应用程序编程接口(API
4.1
C
CUDA
C
C
4.2
;
一个runtime 库分成:
一个主机组件,在4.5 部分描述,它在主机上运行并且提供函数来控制和访问一个
或多个计算设备;
一个设备组件,在4.4 部分描述,它在设备运行并且提特定设备的数;
一个共的组件,在第4.3 部分描述,它提置矢型和主机与设备编码支持
C 标准库的一个集。
的是,只有来自C 标准库的数支持在设备上运行,是由Runtime 的组件提
数。
4.2 语言扩展
C 语言的展是四重的:
函数型限定指定一个数是执行在主机或者执行在设备,和是主机或者从设
备上用(4.2.1 部分);
型限定指定设备上一个量的内存位置(第4.2.2 部分);
一个新的指指定一个来自主机的kernel 如何在设备上执行 (4.2.3 部分);
个内量指定栅格和块的维数,还有块和线程的指标 (第4.2.4 部分)
每个包CUDA 语言扩展的文件必须通过CUDA 译器nvcc ,在4.2.5 部分
地加以描述。 nvcc 的一个详细的描述可以在一单独的文件中找到。
nvcc
- 21 -
4.2.1
4.2.1.1 __device__
__device__限定词声明一个数是:
在设备上执行的,
可从设备用。
4.2.1.2 __global__
__global__
4.2.1.3 __host__
__host__
__host__限定词声明数是:
__host____host__
__host__
__host____host__
kernel
主机上执行的,
可从主机调用。
声明一个带有__host__
__global__
__global__限定词;其他情况下这个主机
__global____global__
然而,__host__
__host__ 限定也可以用于与__device__
__host__ __host__
__host__限定或者声明有任何__ho
__host____host__
__device__限定的组合,这情况下,这个数是
__device____device__
主机和设备方编
__host__
st__, __device__
__ho__ho
st__st__
__device__,或
__device____device__
4.2.1.4
__device____global__
__device____global__
__device____global__
__device__
不能一使用__global____host__限定
;
- 22 -
__global__
__global__
4.2.3
__global__
4.2.2
4.2.2.1 __device__
__device__
线
void
__global__
runtime
访
__global__
__device__
256
4.2.2.2 __constant__
__constant__
线
4.2.2.3 __shared__
__shared__
__shared__限定,与__device__
__shared____shared__
__shared__
__shared__ __shared__
__device__起选择使用,声明一个量:
__device____device__
__device__
runtime
使
访
驻留在线程块的共享内存空间中
具有块的生存
只有块之内的所有线程是可访问的。
在线程共享的量有完全序一致性。只有执行过一个__syncthreads
__syncthreads()数,从其他
__syncthreads__syncthreads
线程的写才保证量被定为可挥发的,否则只要一个状态,编译器将自
由的优化共享内存的读写。
- 23 -
当声明一个在共享内存的量作为一个外部数组,例如
extern_shared_float shared[]
(
4.2.3
)
offset(
)
short array0[128];
float array1[64];
int array2[256];
extern __shared__ short array[];
__device__ void func() //__device__or__global__function
{
Short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
4.2.2.4
__shared____constant__
__device__,__shared____constant__
__device____constant__
__constant__
struct union
extern
runtime
4.5.3.6
__shared__
ld.local st.local
lmem
ptx
(
- ptx -keep
使
访
使
使
--ptxas-option=-v
.local
使
使
4.5.2.3
)
使
- 24 -
只能通过设备__device__,__shared____constant__变量来取地址。
__device____constant__变量的地址只能通过主机代得,cudaGetSymbolAddress()
参见4.5.2.3 部分。
4.2.3
__global__
stream
4.5.1.5
streams
S>>>
Dg
Dg dim3
Dg Dg
dim3 (参见4.3.1.2 部分)并且指定栅格的维数和大小,这样Dg.x * Dg.y
dim3dim3
发送的块的数量;
Db
Db dim3
Db Db
Db.z
Db.z 于每个块的线程数量;
Db.z Db.z
Ns
S
__global__ void Func(float* parameter);
;
dim3 (参见4.3.1.2 部分)并且指定每个块的维数和大小,这样Db.x * Db.y *
dim3 dim3
size_t
cudaStream_t
; Ns
0
streamS
使
0
<<< Dg, Db, Ns,
Dg.x * Dg.y
Dg.x * Dg.y Dg.x * Dg.y
Db.x * Db.y *
Db.x * Db.y * Db.x * Db.y *
4.2.2.3
Func<<< Dg, Db, Ns >>>(parameter);
- 25 -
Dg Db
4.2.4
4.2.4.1 gridDim
dim3 (
4.2.4.2 blockIdx
uint3 (
4.2.4.3 blockDim
dim3 (
4.2.4.4 threadIdx
uint3 (
A.1
Ns
4.3.1.2
4.3.1.1
4.3.1.2
4.3.1.1
)
)
)
)
线
4.2.4.5
4.2.5 NVCC
nvcc
CUDA
nvcc
cubin
使
使
CUDA
C
API
cubin
(
- 26 -
4.5.3
)
CUDA Runtime
4.2.3
cubin
Kernel(
4.5.2
)
C++
使
non-void
CUDA
C
C++
void
C++
nvcc
4.2.5.1 __noinline__
__device__
inline
C++
malloc()
__noinline__
classes, inheritance
使
NVCC
C++
typecast
inline
__noinline__
4.2.5.2 #pragma unroll
:
#pragma unroll
#pragma unroll 5
For (int i = 0; i < n; ++i)
循环将5 次。请自行确定展动作不会影响到程序的正确性。
#pragma unroll 后面附值行程计数为循环完全否则
- 27 -
4.3
Runtime
4.3.1
Runtime
使
4.3.1.1 char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1,
short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3,
uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
float1, float2, float3, float4
y, z, w
访
1
make_<type name>
2
3
4
;
x,
int2 make_int2(int x, int y);
(xy)
int2
4.3.1.2 dim3
uint3
dim3
1
4.3.2
使
B-1
C/C++
C runtime
4.3.3
clock_t clock();
kernel
线线
线
线
- 28 -
4.3.4
CUDA
Texture reference
Texture reference
texture fetches
texture referece
使
4.5.2.6 4.5.3.9
5.4
使
GPU
使
fetch
kernel
kernel
4.4.5
使
texture reference
Texture fetch
texelstexture elements
runtime
4.3.4.1 Texture Reference
texture reference
fetch
texture
texture reference
Texrure<Type, Dim, ReadMode> texRef;
此时:
Type
Type 指定的数据型是在时返回的; Type
Type Type
4.3.1.1 部分定的所有的型;
DDDDim
im 指定texture reference 的维数,它12; Dim
im im
ReadMode
ReadMode cudaReadModeNormalizedFloat
ReadMode ReadMode
cudaReadModeNormalizedFloat
cudaReadModeNormalizedFloat cudaReadModeElementType
cudaReadModeNormalizedFloat cudaReadModeNormalizedFloat
Type
Type 被限定在本的型和型和
Type Type
Dim 的是默认1 的一个可选自变量;
Dim Dim
cudaReadModeElementType;它是
cudaReadModeElementTypecudaReadModeElementType
16-bit 8-bit
texture reference
- 29 -
[-1.01.0]
unsigned
cudaReadModeElementType
ReadMode
cudaReadModeElementType
4.3.4.2 Runtime Texture Reference
runtime API 4.5.3.9 driver API
texture reference
)
[0N)
x [063]y [031]
64x32
[0N)N
0xff
Normalized
normalized
[0.01.0]signed
8-bit
1
runtime
normalized
x [
0.01.0
(
[
0.01.0
)
y [
4.5.2.6
64x32
)
0.01.0
)
Normalized
[0 N)
0 0N N-1使
0.75
[
0.01.0
使
)
线
texel
bilinear
F
unnormalized
normalized
texel
normalized
"
warp
"
Warp
1.25
texel
0.25
-
1.25
- 30 -
Loading...
+ 106 hidden pages