【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide(v12.6)”。
本文为原创文章,未经本人允许,禁止转载。转载请注明出处。
1.Graphics Interoperability
OpenGL和Direct3D的一些资源可以映射到CUDA的地址空间中,这样做的目的是:1)使CUDA能够读取由OpenGL或Direct3D写入的数据;2)使CUDA能够写入数据,以供OpenGL或Direct3D使用。
在使用第2部分和第3部分中提到的函数进行映射之前,资源必须先注册到CUDA中。这些函数会返回一个指向CUDA图形资源的指针,其类型为struct cudaGraphicsResource
。注册资源可能会有较高的开销,因此通常每个资源只注册一次。可以使用cudaGraphicsUnregisterResource()
来取消注册CUDA图形资源。每个打算使用该资源的CUDA上下文都需要单独注册它。
一旦资源注册到CUDA,就可以通过cudaGraphicsMapResources()
和cudaGraphicsUnmapResources()
根据需要多次进行映射和取消映射。通过调用cudaGraphicsResourceSetMapFlags()
来指定使用方式(只写或只读),CUDA驱动程序会利用这些使用方式优化资源管理。
映射后的资源可以通过cudaGraphicsResourceGetMappedPointer()
(针对buffers)和cudaGraphicsSubResourceGetMappedArray()
(针对CUDA array)返回的device内存地址在CUDA kernel函数中进行读写操作。
当一个资源被映射时,通过OpenGL、Direct3D或其他CUDA context访问该资源会导致结果未定义。第2部分和第3部分提供了针对每种图形API的详细信息以及一些代码示例。第4部分提供了系统处于SLI模式时的详细信息。
2.OpenGL Interoperability
可以被映射到CUDA地址空间的OpenGL资源包括OpenGL buffer、texture和renderbuffer对象。
buffer对象通过调用cudaGraphicsGLRegisterBuffer()
注册。在CUDA中,它表现为一个device指针,因此可以被kernel函数读取和写入,也可以通过cudaMemcpy()
进行数据传输。
texture或renderbuffer对象通过cudaGraphicsGLRegisterImage()
函数注册。在CUDA中,它表现为一个CUDA array。kernel可以通过将该array绑定到纹理引用或表面引用来读取数据。如果资源在注册时使用了cudaGraphicsRegisterFlagsSurfaceLoadStore
标志,kernel也可以通过表面写函数将数据写入其中。该数组也可以通过调用cudaMemcpy2D()
来读写。cudaGraphicsGLRegisterImage()
支持所有具有1个、2个或4个分量的纹理格式,且内部数据类型可以是浮点型(如GL_RGBA_FLOAT32
)、归一化整型(如GL_RGBA8
、GL_INTENSITY16
)、非归一化整型(如GL_RGBA8UI
)。
正在共享资源的OpenGL context对于进行任何OpenGL互操作性API调用的主机线程来说必须是最新的。
需要注意的是,当一个OpenGL texture被设置为无绑定状态(bindless,例如通过调用glGetTextureHandle*
或glGetImageHandle*
API请求图像或纹理句柄)时,它将无法再注册到CUDA。应用程序需要在请求图像或纹理句柄之前,先将纹理注册为CUDA互操作资源。
以下代码示例是使用一个kernel动态修改存储在VBO(vertex buffer object)中的一个大小为width
$\times$height
的二维网格的顶点数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
GLuint positionsVBO;
struct cudaGraphicsResource* positionsVBO_CUDA;
int main()
{
// Initialize OpenGL and GLUT for device 0
// and make the OpenGL context current
...
glutDisplayFunc(display);
// Explicitly set device 0
cudaSetDevice(0);
// Create buffer object and register it with CUDA
glGenBuffers(1, &positionsVBO);
glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
unsigned int size = width * height * 4 * sizeof(float);
glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&positionsVBO_CUDA,
positionsVBO,
cudaGraphicsMapFlagsWriteDiscard);
// Launch rendering loop
glutMainLoop();
...
}
void display()
{
// Map buffer object for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVBO_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVBO_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
// Unmap buffer object
cudaGraphicsUnmapResources(1, &positionsVBO_CUDA, 0);
// Render from buffer object
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
glVertexPointer(4, GL_FLOAT, 0, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawArrays(GL_POINTS, 0, width * height);
glDisableClientState(GL_VERTEX_ARRAY);
// Swap buffers
glutSwapBuffers();
glutPostRedisplay();
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
void deleteVBO()
{
cudaGraphicsUnregisterResource(positionsVBO_CUDA);
glDeleteBuffers(1, &positionsVBO);
}
__global__ void createVertices(float4* positions, float time,
unsigned int width, unsigned int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// calculate simple sine wave pattern
float freq = 4.0f;
float w = sinf(u * freq + time)
* cosf(v * freq + time) * 0.5f;
// Write positions
positions[y * width + x] = make_float4(u, w, v, 1.0f);
}
在Windows平台上,对于Quadro GPU,可以使用cudaWGLGetDevice()
函数来获取与wglEnumGpusNV()
返回句柄关联的CUDA device。在多GPU配置中,当OpenGL渲染运行在Quadro GPU上,而CUDA计算运行在系统中其他GPU上时,Quadro GPU提供比GeForce GPU和Tesla GPU更高性能的OpenGL互操作性。
3.Direct3D Interoperability
Direct3D互操作性支持Direct3D 9Ex、Direct3D 10和Direct3D 11。
CUDA context只能与满足以下条件的Direct3D device进行互操作:
- Direct3D 9Ex device必须满足:
DeviceType
设置为D3DDEVTYPE_HAL
,BehaviorFlags
设置为D3DCREATE_HARDWARE_VERTEXPROCESSING
标志。 - Direct3D 10和Direct3D 11 device必须满足:
DriverType
设置为D3D_DRIVER_TYPE_HARDWARE
。
Direct3D资源可以映射到CUDA的地址空间,包括Direct3D buffers、纹理和表面。这些资源使用cudaGraphicsD3D9RegisterResource()
、cudaGraphicsD3D10RegisterResource()
和cudaGraphicsD3D11RegisterResource()
进行注册。
以下代码示例是使用一个kernel动态修改存储在VBO(vertex buffer object)中的一个大小为width
$\times$height
的二维网格的顶点数据。
3.1.Direct3D 9 Version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
IDirect3D9* D3D;
IDirect3DDevice9* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
IDirect3DVertexBuffer9* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
int dev;
// Initialize Direct3D
D3D = Direct3DCreate9Ex(D3D_SDK_VERSION);
// Get a CUDA-enabled adapter
unsigned int adapter = 0;
for (; adapter < g_pD3D->GetAdapterCount(); adapter++) {
D3DADAPTER_IDENTIFIER9 adapterId;
g_pD3D->GetAdapterIdentifier(adapter, 0, &adapterId);
if (cudaD3D9GetDevice(&dev, adapterId.DeviceName)
== cudaSuccess)
break;
}
// Create device
...
D3D->CreateDeviceEx(adapter, D3DDEVTYPE_HAL, hWnd,
D3DCREATE_HARDWARE_VERTEXPROCESSING,
¶ms, NULL, &device);
// Use the same device
cudaSetDevice(dev);
// Create vertex buffer and register it with CUDA
unsigned int size = width * height * sizeof(CUSTOMVERTEX);
device->CreateVertexBuffer(size, 0, D3DFVF_CUSTOMVERTEX,
D3DPOOL_DEFAULT, &positionsVB, 0);
cudaGraphicsD3D9RegisterResource(&positionsVB_CUDA,
positionsVB,
cudaGraphicsRegisterFlagsNone);
cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
cudaGraphicsMapFlagsWriteDiscard);
// Launch rendering loop
while (...) {
...
Render();
...
}
...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
// Unmap vertex buffer
cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);
// Draw and present
...
}
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
__global__ void createVertices(float4* positions, float time,
unsigned int width, unsigned int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Calculate simple sine wave pattern
float freq = 4.0f;
float w = sinf(u * freq + time)
* cosf(v * freq + time) * 0.5f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
3.2.Direct3D 10 Version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ID3D10Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D10Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
int dev;
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
if (cudaD3D10GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
// Create swap chain and device
...
D3D10CreateDeviceAndSwapChain(adapter,
D3D10_DRIVER_TYPE_HARDWARE, 0,
D3D10_CREATE_DEVICE_DEBUG,
D3D10_SDK_VERSION,
&swapChainDesc, &swapChain,
&device);
adapter->Release();
// Use the same device
cudaSetDevice(dev);
// Create vertex buffer and register it with CUDA
unsigned int size = width * height * sizeof(CUSTOMVERTEX);
D3D10_BUFFER_DESC bufferDesc;
bufferDesc.Usage = D3D10_USAGE_DEFAULT;
bufferDesc.ByteWidth = size;
bufferDesc.BindFlags = D3D10_BIND_VERTEX_BUFFER;
bufferDesc.CPUAccessFlags = 0;
bufferDesc.MiscFlags = 0;
device->CreateBuffer(&bufferDesc, 0, &positionsVB);
cudaGraphicsD3D10RegisterResource(&positionsVB_CUDA,
positionsVB,
cudaGraphicsRegisterFlagsNone);
cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
cudaGraphicsMapFlagsWriteDiscard);
// Launch rendering loop
while (...) {
...
Render();
...
}
...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
// Unmap vertex buffer
cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);
// Draw and present
...
}
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
__global__ void createVertices(float4* positions, float time,
unsigned int width, unsigned int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Calculate simple sine wave pattern
float freq = 4.0f;
float w = sinf(u * freq + time)
* cosf(v * freq + time) * 0.5f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
3.3.Direct3D 11 Version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
ID3D11Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D11Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
int dev;
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
if (cudaD3D11GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
// Create swap chain and device
...
sFnPtr_D3D11CreateDeviceAndSwapChain(adapter,
D3D11_DRIVER_TYPE_HARDWARE,
0,
D3D11_CREATE_DEVICE_DEBUG,
featureLevels, 3,
D3D11_SDK_VERSION,
&swapChainDesc, &swapChain,
&device,
&featureLevel,
&deviceContext);
adapter->Release();
// Use the same device
cudaSetDevice(dev);
// Create vertex buffer and register it with CUDA
unsigned int size = width * height * sizeof(CUSTOMVERTEX);
D3D11_BUFFER_DESC bufferDesc;
bufferDesc.Usage = D3D11_USAGE_DEFAULT;
bufferDesc.ByteWidth = size;
bufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
bufferDesc.CPUAccessFlags = 0;
bufferDesc.MiscFlags = 0;
device->CreateBuffer(&bufferDesc, 0, &positionsVB);
cudaGraphicsD3D11RegisterResource(&positionsVB_CUDA,
positionsVB,
cudaGraphicsRegisterFlagsNone);
cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
cudaGraphicsMapFlagsWriteDiscard);
// Launch rendering loop
while (...) {
...
Render();
...
}
...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
// Unmap vertex buffer
cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);
// Draw and present
...
}
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
__global__ void createVertices(float4* positions, float time,
unsigned int width, unsigned int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Calculate simple sine wave pattern
float freq = 4.0f;
float w = sinf(u * freq + time)
* cosf(v * freq + time) * 0.5f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
4.SLI Interoperability
在多GPU系统中,所有支持CUDA的GPU都可以通过CUDA驱动程序和运行时作为独立设备进行访问。然而,当系统处于SLI模式(Scalable Link Interface,是一种由NVIDIA开发的技术,主要用于将多个GPU连接起来协同处理图形渲染任务)时,存在一些特殊的注意事项,如下所述。
在SLI模式下,多个GPU会被绑定为一个逻辑单元来协同工作,如果在一块GPU上进行CUDA内存分配时,也会消耗SLI配置中其他GPU的内存,如果这些GPU与Direct3D或OpenGL设备相关联,则可能会导致内存分配比预期更早失败。
因此,应用程序应该为SLI配置中的每个GPU创建多个CUDA context。虽然这并非严格要求,但这样做可以避免device之间不必要的数据传输。应用程序可以使用cudaD3D[9|10|11]GetDevices()
(针对Direct3D)和cudaGLGetDevices()
(针对OpenGL)来获取当前和下一帧中执行渲染的CUDA device的句柄。根据这些信息,应用程序可以选择合适的device,并将Direct3D或OpenGL资源映射到由cudaD3D[9|10|11]GetDevices()
或cudaGLGetDevices()
返回的CUDA device上,前提是deviceList
参数被设置为cudaD3D[9|10|11]DeviceListCurrentFrame
或cudaGLDeviceListCurrentFrame
。
需要注意的是,通过cudaGraphicsD9D[9|10|11]RegisterResource
和cudaGraphicsGLRegister[Buffer|Image]
返回的资源只能在进行注册的device上使用。因此,在SLI配置中,当不同帧的数据在不同的CUDA device上计算时,必须为每个device单独注册资源。