【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide（v12.6）”。
本文为原创文章，未经本人允许，禁止转载。转载请注明出处。

1.Graphics Interoperability

OpenGL和Direct3D的一些资源可以映射到CUDA的地址空间中，这样做的目的是：1）使CUDA能够读取由OpenGL或Direct3D写入的数据；2）使CUDA能够写入数据，以供OpenGL或Direct3D使用。

在使用第2部分和第3部分中提到的函数进行映射之前，资源必须先注册到CUDA中。这些函数会返回一个指向CUDA图形资源的指针，其类型为struct cudaGraphicsResource。注册资源可能会有较高的开销，因此通常每个资源只注册一次。可以使用cudaGraphicsUnregisterResource()来取消注册CUDA图形资源。每个打算使用该资源的CUDA上下文都需要单独注册它。

一旦资源注册到CUDA，就可以通过cudaGraphicsMapResources()和cudaGraphicsUnmapResources()根据需要多次进行映射和取消映射。通过调用cudaGraphicsResourceSetMapFlags()来指定使用方式（只写或只读），CUDA驱动程序会利用这些使用方式优化资源管理。

映射后的资源可以通过cudaGraphicsResourceGetMappedPointer()（针对buffers）和cudaGraphicsSubResourceGetMappedArray()（针对CUDA array）返回的device内存地址在CUDA kernel函数中进行读写操作。

当一个资源被映射时，通过OpenGL、Direct3D或其他CUDA context访问该资源会导致结果未定义。第2部分和第3部分提供了针对每种图形API的详细信息以及一些代码示例。第4部分提供了系统处于SLI模式时的详细信息。

2.OpenGL Interoperability

可以被映射到CUDA地址空间的OpenGL资源包括OpenGL buffer、texture和renderbuffer对象。

buffer对象通过调用cudaGraphicsGLRegisterBuffer()注册。在CUDA中，它表现为一个device指针，因此可以被kernel函数读取和写入，也可以通过cudaMemcpy()进行数据传输。

texture或renderbuffer对象通过cudaGraphicsGLRegisterImage()函数注册。在CUDA中，它表现为一个CUDA array。kernel可以通过将该array绑定到纹理引用或表面引用来读取数据。如果资源在注册时使用了cudaGraphicsRegisterFlagsSurfaceLoadStore标志，kernel也可以通过表面写函数将数据写入其中。该数组也可以通过调用cudaMemcpy2D()来读写。cudaGraphicsGLRegisterImage()支持所有具有1个、2个或4个分量的纹理格式，且内部数据类型可以是浮点型（如GL_RGBA_FLOAT32）、归一化整型（如GL_RGBA8、GL_INTENSITY16）、非归一化整型（如GL_RGBA8UI）。

正在共享资源的OpenGL context对于进行任何OpenGL互操作性API调用的主机线程来说必须是最新的。

需要注意的是，当一个OpenGL texture被设置为无绑定状态（bindless，例如通过调用glGetTextureHandle*或glGetImageHandle* API请求图像或纹理句柄）时，它将无法再注册到CUDA。应用程序需要在请求图像或纹理句柄之前，先将纹理注册为CUDA互操作资源。

以下代码示例是使用一个kernel动态修改存储在VBO（vertex buffer object）中的一个大小为width$\times$height的二维网格的顶点数据：

GLuint positionsVBO;
struct cudaGraphicsResource* positionsVBO_CUDA;

int main()
{
    // Initialize OpenGL and GLUT for device 0
    // and make the OpenGL context current
    ...
    glutDisplayFunc(display);

    // Explicitly set device 0
    cudaSetDevice(0);

    // Create buffer object and register it with CUDA
    glGenBuffers(1, &positionsVBO);
    glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
    unsigned int size = width * height * 4 * sizeof(float);
    glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);
    cudaGraphicsGLRegisterBuffer(&positionsVBO_CUDA,
                                 positionsVBO,
                                 cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    glutMainLoop();

    ...
}

void display()
{
    // Map buffer object for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVBO_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVBO_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap buffer object
    cudaGraphicsUnmapResources(1, &positionsVBO_CUDA, 0);

    // Render from buffer object
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
    glVertexPointer(4, GL_FLOAT, 0, 0);
    glEnableClientState(GL_VERTEX_ARRAY);
    glDrawArrays(GL_POINTS, 0, width * height);
    glDisableClientState(GL_VERTEX_ARRAY);

    // Swap buffers
    glutSwapBuffers();
    glutPostRedisplay();
}

void deleteVBO()
{
    cudaGraphicsUnregisterResource(positionsVBO_CUDA);
    glDeleteBuffers(1, &positionsVBO);
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] = make_float4(u, w, v, 1.0f);
}

在Windows平台上，对于Quadro GPU，可以使用cudaWGLGetDevice()函数来获取与wglEnumGpusNV()返回句柄关联的CUDA device。在多GPU配置中，当OpenGL渲染运行在Quadro GPU上，而CUDA计算运行在系统中其他GPU上时，Quadro GPU提供比GeForce GPU和Tesla GPU更高性能的OpenGL互操作性。

3.Direct3D Interoperability

Direct3D互操作性支持Direct3D 9Ex、Direct3D 10和Direct3D 11。

CUDA context只能与满足以下条件的Direct3D device进行互操作：

Direct3D 9Ex device必须满足：DeviceType设置为D3DDEVTYPE_HAL，BehaviorFlags设置为D3DCREATE_HARDWARE_VERTEXPROCESSING标志。
Direct3D 10和Direct3D 11 device必须满足：DriverType设置为D3D_DRIVER_TYPE_HARDWARE。

Direct3D资源可以映射到CUDA的地址空间，包括Direct3D buffers、纹理和表面。这些资源使用cudaGraphicsD3D9RegisterResource()、cudaGraphicsD3D10RegisterResource()和cudaGraphicsD3D11RegisterResource()进行注册。

以下代码示例是使用一个kernel动态修改存储在VBO（vertex buffer object）中的一个大小为width$\times$height的二维网格的顶点数据。

3.1.Direct3D 9 Version

IDirect3D9* D3D;
IDirect3DDevice9* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
IDirect3DVertexBuffer9* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Initialize Direct3D
    D3D = Direct3DCreate9Ex(D3D_SDK_VERSION);

    // Get a CUDA-enabled adapter
    unsigned int adapter = 0;
    for (; adapter < g_pD3D->GetAdapterCount(); adapter++) {
        D3DADAPTER_IDENTIFIER9 adapterId;
        g_pD3D->GetAdapterIdentifier(adapter, 0, &adapterId);
        if (cudaD3D9GetDevice(&dev, adapterId.DeviceName)
            == cudaSuccess)
            break;
    }

     // Create device
    ...
    D3D->CreateDeviceEx(adapter, D3DDEVTYPE_HAL, hWnd,
                        D3DCREATE_HARDWARE_VERTEXPROCESSING,
                        &params, NULL, &device);

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    device->CreateVertexBuffer(size, 0, D3DFVF_CUSTOMVERTEX,
                               D3DPOOL_DEFAULT, &positionsVB, 0);
    cudaGraphicsD3D9RegisterResource(&positionsVB_CUDA,
                                     positionsVB,
                                     cudaGraphicsRegisterFlagsNone);
    cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                    cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}

void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}

3.2.Direct3D 10 Version

ID3D10Device* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
ID3D10Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Get a CUDA-enabled adapter
    IDXGIFactory* factory;
    CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
    IDXGIAdapter* adapter = 0;
    for (unsigned int i = 0; !adapter; ++i) {
        if (FAILED(factory->EnumAdapters(i, &adapter))
            break;
        if (cudaD3D10GetDevice(&dev, adapter) == cudaSuccess)
            break;
        adapter->Release();
    }
    factory->Release();

    // Create swap chain and device
    ...
    D3D10CreateDeviceAndSwapChain(adapter,
                                  D3D10_DRIVER_TYPE_HARDWARE, 0,
                                  D3D10_CREATE_DEVICE_DEBUG,
                                  D3D10_SDK_VERSION,
                                  &swapChainDesc, &swapChain,
                                  &device);
    adapter->Release();

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    D3D10_BUFFER_DESC bufferDesc;
    bufferDesc.Usage          = D3D10_USAGE_DEFAULT;
    bufferDesc.ByteWidth      = size;
    bufferDesc.BindFlags      = D3D10_BIND_VERTEX_BUFFER;
    bufferDesc.CPUAccessFlags = 0;
    bufferDesc.MiscFlags      = 0;
    device->CreateBuffer(&bufferDesc, 0, &positionsVB);
    cudaGraphicsD3D10RegisterResource(&positionsVB_CUDA,
                                      positionsVB,
                                      cudaGraphicsRegisterFlagsNone);
                                      cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                      cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}

void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}

3.3.Direct3D 11 Version

ID3D11Device* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
ID3D11Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Get a CUDA-enabled adapter
    IDXGIFactory* factory;
    CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
    IDXGIAdapter* adapter = 0;
    for (unsigned int i = 0; !adapter; ++i) {
        if (FAILED(factory->EnumAdapters(i, &adapter))
            break;
        if (cudaD3D11GetDevice(&dev, adapter) == cudaSuccess)
            break;
        adapter->Release();
    }
    factory->Release();

    // Create swap chain and device
    ...
    sFnPtr_D3D11CreateDeviceAndSwapChain(adapter,
                                         D3D11_DRIVER_TYPE_HARDWARE,
                                         0,
                                         D3D11_CREATE_DEVICE_DEBUG,
                                         featureLevels, 3,
                                         D3D11_SDK_VERSION,
                                         &swapChainDesc, &swapChain,
                                         &device,
                                         &featureLevel,
                                         &deviceContext);
    adapter->Release();

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    D3D11_BUFFER_DESC bufferDesc;
    bufferDesc.Usage          = D3D11_USAGE_DEFAULT;
    bufferDesc.ByteWidth      = size;
    bufferDesc.BindFlags      = D3D11_BIND_VERTEX_BUFFER;
    bufferDesc.CPUAccessFlags = 0;
    bufferDesc.MiscFlags      = 0;
    device->CreateBuffer(&bufferDesc, 0, &positionsVB);
    cudaGraphicsD3D11RegisterResource(&positionsVB_CUDA,
                                      positionsVB,
                                      cudaGraphicsRegisterFlagsNone);
    cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                    cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}

void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

    __global__ void createVertices(float4* positions, float time,
                          unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

// Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}

4.SLI Interoperability

在多GPU系统中，所有支持CUDA的GPU都可以通过CUDA驱动程序和运行时作为独立设备进行访问。然而，当系统处于SLI模式（Scalable Link Interface，是一种由NVIDIA开发的技术，主要用于将多个GPU连接起来协同处理图形渲染任务）时，存在一些特殊的注意事项，如下所述。

在SLI模式下，多个GPU会被绑定为一个逻辑单元来协同工作，如果在一块GPU上进行CUDA内存分配时，也会消耗SLI配置中其他GPU的内存，如果这些GPU与Direct3D或OpenGL设备相关联，则可能会导致内存分配比预期更早失败。

因此，应用程序应该为SLI配置中的每个GPU创建多个CUDA context。虽然这并非严格要求，但这样做可以避免device之间不必要的数据传输。应用程序可以使用cudaD3D[9|10|11]GetDevices()（针对Direct3D）和cudaGLGetDevices()（针对OpenGL）来获取当前和下一帧中执行渲染的CUDA device的句柄。根据这些信息，应用程序可以选择合适的device，并将Direct3D或OpenGL资源映射到由cudaD3D[9|10|11]GetDevices()或cudaGLGetDevices()返回的CUDA device上，前提是deviceList参数被设置为cudaD3D[9|10|11]DeviceListCurrentFrame或cudaGLDeviceListCurrentFrame。

需要注意的是，通过cudaGraphicsD9D[9|10|11]RegisterResource和cudaGraphicsGLRegister[Buffer|Image]返回的资源只能在进行注册的device上使用。因此，在SLI配置中，当不同帧的数据在不同的CUDA device上计算时，必须为每个device单独注册资源。

【CUDA编程】【18】【3.Programming Interface】【3.2.CUDA Runtime】【3.2.15.Graphics Interoperability】

OpenGL Interoperability，Direct3D Interoperability，SLI Interoperability