【CUDA编程】【19】【3.Programming Interface】【3.2.CUDA Runtime】【3.2.16.External Resource Interoperability】

Vulkan Interoperability,OpenGL Interoperability,Direct3D 12 Interoperability,Direct3D 11 Interoperability,NVIDIA Software Communication Interface Interoperability (NVSCI)

Posted by x-jeff on December 19, 2024

【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide(v12.6)”
本文为原创文章,未经本人允许,禁止转载。转载请注明出处。

1.External Resource Interoperability

外部资源互操作性允许CUDA导入由其他API显式导出的特定资源。这些对象通常通过操作系统的原生句柄由其他API导出,例如Linux上的文件描述符或Windows上的NT句柄。资源也可以通过其他统一接口导出,比如NVIDIA Software Communication Interface。有两种类型的资源可以导入:内存对象和同步对象。

可以使用cudaImportExternalMemory()将内存对象导入到CUDA中。导入的内存对象可以通过cudaExternalMemoryGetMappedBuffer()映射为device指针,或者通过cudaExternalMemoryGetMappedMipmappedArray()映射为CUDA mipmapped array,然后在kernel中访问。根据内存对象的类型,可能会在同一个内存对象上设置多个映射。这些映射必须与导出API中设置的映射一致。任何不匹配的映射都会导致未定义行为。导入的内存对象必须使用cudaDestroyExternalMemory()进行释放。释放内存对象不会释放与该对象相关的任何映射。因此,任何映射到该对象的device指针必须使用cudaFree()显式释放,而任何映射到该对象的CUDA mipmapped array必须使用cudaFreeMipmappedArray()显式释放。在内存对象被销毁后访问与其相关的映射是非法的。

同步对象可以使用cudaImportExternalSemaphore()导入到CUDA中。导入的同步对象可以通过cudaSignalExternalSemaphoresAsync()进行信号触发,也可以通过cudaWaitExternalSemaphoresAsync()进行等待。在相应的信号被触发之前发出等待操作是非法的。此外,根据导入的同步对象的类型,可能会对其触发和等待的方式施加额外的限制,这些限制会在后续部分中描述。导入的信号量对象必须使用cudaDestroyExternalSemaphore()进行释放。在销毁信号量对象之前,所有未完成的信号触发和等待操作必须完成。

2.Vulkan Interoperability

Vulkan是一个由Khronos Group开发的跨平台、低开销的图形与计算API。它被设计用来替代OpenGL,提供更高效的GPU控制和更接近硬件的性能优化,适用于图形渲染和通用计算任务。

2.1.Matching device UUIDs

当导入由Vulkan导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。每个物理设备都有一个唯一的标识符(UUID),可以通过比较UUID来确保CUDA和Vulkan操作的是同一个GPU。此外,Vulkan物理设备不应属于多GPU设备组,即vkEnumeratePhysicalDeviceGroups返回的物理设备数量必须为1。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int getCudaDeviceForVulkanPhysicalDevice(VkPhysicalDevice vkPhysicalDevice) {
    VkPhysicalDeviceIDProperties vkPhysicalDeviceIDProperties = {};
    vkPhysicalDeviceIDProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ID_PROPERTIES;
    vkPhysicalDeviceIDProperties.pNext = NULL;

    VkPhysicalDeviceProperties2 vkPhysicalDeviceProperties2 = {};
    vkPhysicalDeviceProperties2.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
    vkPhysicalDeviceProperties2.pNext = &vkPhysicalDeviceIDProperties;

    vkGetPhysicalDeviceProperties2(vkPhysicalDevice, &vkPhysicalDeviceProperties2);

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        if (!memcmp(&deviceProp.uuid, vkPhysicalDeviceIDProperties.deviceUUID, VK_UUID_SIZE)) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}

2.2.Importing Memory Objects

在Linux和Windows 10上,Vulkan导出的专用内存对象和非专用内存对象都可以被导入到CUDA中。在Windows 7上,只有专用内存对象可以被导入。在导入Vulkan专用内存对象时,必须设置标志cudaExternalMemoryDedicated

通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT导出的Vulkan内存对象,可以使用与该对象关联的文件描述符导入到CUDA中,如下代码所示。需要注意的是,一旦文件描述符被导入,CUDA会接管该文件描述符的所有权。在成功导入后再次使用该文件描述符会导致未定义行为。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cudaExternalMemory_t importVulkanMemoryObjectFromFileDescriptor(int fd, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueFd;
    desc.handle.fd = fd;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it

    return extMem;
}

通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT导出的Vulkan内存对象,可以使用与该对象关联的NT句柄导入到CUDA中,如下代码所示。需要注意的是,CUDA不会接管NT句柄的所有权,应用程序需要在不再需要该句柄时显式关闭它。NT句柄持有对资源的引用,因此在释放底层内存之前,必须显式释放该句柄。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cudaExternalMemory_t importVulkanMemoryObjectFromNTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
    desc.handle.win32.handle = handle;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT导出的Vulkan对象,如果存在命名句柄,也可以使用该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importVulkanMemoryObjectFromNamedNTHandle(LPCWSTR name, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT导出的Vulkan内存对象,可以使用与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层内存的引用,因此当所有其他对该资源的引用被销毁时,它会自动被销毁。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importVulkanMemoryObjectFromKMTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32Kmt;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

2.3.Mapping Buffers onto Imported Memory Objects

device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Vulkan API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()进行释放。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {

    void *ptr = NULL;

    cudaExternalMemoryBufferDesc desc = {};



    memset(&desc, 0, sizeof(desc));



    desc.offset = offset;

    desc.size = size;



    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);



    // Note: ‘ptr’ must eventually be freed using cudaFree()

    return ptr;

}

2.4.Mapping Mipmapped Arrays onto Imported Memory Objects

一个CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应Vulkan API创建映射时指定的值一致。另外,如果mipmapped array在Vulkan中被绑定为颜色目标,则必须设置标志cudaArrayColorAttachment。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()进行释放。以下代码示例展示了在将mipmapped arrays映射到导入内存对象时,如何将Vulkan参数转换为对应的CUDA参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForVulkanFormat(VkFormat format)
{
    cudaChannelFormatDesc d;

    memset(&d, 0, sizeof(d));

    switch (format) {
    case VK_FORMAT_R8_UINT:             d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8_SINT:             d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R8G8_UINT:           d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8G8_SINT:           d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R8G8B8A8_UINT:       d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8G8B8A8_SINT:       d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16_UINT:            d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16_SINT:            d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16G16_UINT:         d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16G16_SINT:         d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16G16B16A16_UINT:   d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16G16B16A16_SINT:   d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32_UINT:            d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32_SINT:            d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32_SFLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case VK_FORMAT_R32G32_UINT:         d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32G32_SINT:         d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32G32_SFLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case VK_FORMAT_R32G32B32A32_UINT:   d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32G32B32A32_SINT:   d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32G32B32A32_SFLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    return d;
}

cudaExtent getCudaExtentForVulkanExtent(VkExtent3D vkExt, uint32_t arrayLayers, VkImageViewType vkImageViewType) {
    cudaExtent e = { 0, 0, 0 };

    switch (vkImageViewType) {
    case VK_IMAGE_VIEW_TYPE_1D:         e.width = vkExt.width; e.height = 0;            e.depth = 0;           break;
    case VK_IMAGE_VIEW_TYPE_2D:         e.width = vkExt.width; e.height = vkExt.height; e.depth = 0;           break;
    case VK_IMAGE_VIEW_TYPE_3D:         e.width = vkExt.width; e.height = vkExt.height; e.depth = vkExt.depth; break;
    case VK_IMAGE_VIEW_TYPE_CUBE:       e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_1D_ARRAY:   e.width = vkExt.width; e.height = 0;            e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_2D_ARRAY:   e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    default: assert(0);
    }

    return e;
}

unsigned int getCudaMipmappedArrayFlagsForVulkanImage(VkImageViewType vkImageViewType, VkImageUsageFlags vkImageUsageFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (vkImageViewType) {
    case VK_IMAGE_VIEW_TYPE_CUBE:       flags |= cudaArrayCubemap;                    break;
    case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case VK_IMAGE_VIEW_TYPE_1D_ARRAY:   flags |= cudaArrayLayered;                    break;
    case VK_IMAGE_VIEW_TYPE_2D_ARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (vkImageUsageFlags & VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT) {
        flags |= cudaArrayColorAttachment;
    }

    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }
    return flags;
}

2.5.Importing Synchronization Objects

通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BIT导出的Vulkan信号量对象,可以使用与该对象关联的文件描述符导入到CUDA中,如下所示。需要注意的是,一旦文件描述符被导入,CUDA会接管该文件描述符的所有权。在成功导入后再次使用该文件描述符会导致未定义行为。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromFileDescriptor(int fd) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueFd;
    desc.handle.fd = fd;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it

    return extSem;
}

通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT导出的Vulkan信号量,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,CUDA不会接管NT句柄的所有权,应用程序必须在不再需要该句柄时显式关闭它。NT句柄持有对资源的引用,因此在释放底层信号量对象之前,必须显式释放该句柄。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT导出的Vulkan信号量对象,如果存在命名句柄,也可以使用该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT导出的Vulkan信号量对象,可以使用与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层信号量的引用,当所有其他对该资源的引用被销毁时,它会被自动销毁。

1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromKMTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32Kmt;
    desc.handle.win32.handle = (void *)handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

2.6.Signaling/Waiting on Imported Synchronization Objects

导入的Vulkan信号量对象可以通过以下方式触发信号。触发此类信号量对象会将其设置为已触发状态。等待此信号的操作必须在Vulkan中发出。此外,等待该信号的操作必须在信号被触发之后发出。

1
2
3
4
5
6
7
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

导入的Vulkan信号量对象可以通过以下方式进行等待。等待此类信号量对象时,它会等待信号量进入已触发状态,然后将其重置回未触发状态。等待操作所依赖的信号量触发操作,必须在Vulkan中发出。此外,信号量的触发操作必须在等待操作发出之前执行。

1
2
3
4
5
6
7
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

3.OpenGL Interoperability

传统的OpenGL-CUDA互操作(见:OpenGL Interoperability)是通过CUDA直接使用由OpenGL创建的句柄来实现的。然而,由于OpenGL也可以使用由Vulkan创建的内存和同步对象,因此存在一种实现OpenGL-CUDA互操作的替代方法。基本上,由Vulkan导出的内存和同步对象可以同时导入到OpenGL和CUDA中,并用于协调OpenGL和CUDA之间的内存访问。有关如何导入Vulkan导出的内存和同步对象的详细信息,请参考以下OpenGL扩展:

  • GL_EXT_memory_object
  • GL_EXT_memory_object_fd
  • GL_EXT_memory_object_win32
  • GL_EXT_semaphore
  • GL_EXT_semaphore_fd
  • GL_EXT_semaphore_win32

4.Direct3D 12 Interoperability

4.1.Matching Device LUIDs

当导入由Direct3D 12导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。可以通过CUDA device和Direct3D 12 device的LUID来确定操作的是同一个本地物理设备,代码示例如下所示。需要注意的是,Direct3D 12 device不能创建在链接节点适配器(linked node adapter)上。也就是说,通过ID3D12Device::GetNodeCount返回的节点数量必须为1。

这里简单解释下UUID和LUID。

UUID全称是通用唯一标识符(Universally Unique IDentifier),是一个128位的编码,在全球范围内,每个物理设备都有自己唯一的UUID。

LUID全称是本地唯一标识符(Locally Unique IDentifier),是一个64位的编码,仅局限于本地单个系统,且仅适用于Windows操作系统。相比UUID,LUID的计算成本和存储成本更低,性能更高效。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int getCudaDeviceForD3D12Device(ID3D12Device *d3d12Device) {
    LUID d3d12Luid = d3d12Device->GetAdapterLuid();

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        char *cudaLuid = deviceProp.luid;

        if (!memcmp(&d3d12Luid.LowPart, cudaLuid, sizeof(d3d12Luid.LowPart)) &&
            !memcmp(&d3d12Luid.HighPart, cudaLuid + sizeof(d3d12Luid.LowPart), sizeof(d3d12Luid.HighPart))) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}

4.2.Importing Memory Objects

在调用ID3D12Device::CreateHeap时设置标志D3D12_HEAP_FLAG_SHARED可以创建可共享的Direct3D 12堆内存对象(heap memory object),可以使用与该对象关联的NT句柄将其导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该NT句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层内存。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importD3D12HeapFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

如果存在命名句柄,可共享的Direct3D 12堆内存对象也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cudaExternalMemory_t importD3D12HeapFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
    desc.handle.win32.name = (void *)name;
    desc.size = size;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

通过在调用D3D12Device::CreateCommittedResource时设置标志D3D12_HEAP_FLAG_SHARED创建的可共享的Direct3D 12提交资源(committed resource),可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。在导入Direct3D 12提交资源时,必须设置标志cudaExternalMemoryDedicated。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该NT句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层内存。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cudaExternalMemory_t importD3D12CommittedResourceFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

如果存在命名句柄,可共享的Direct3D 12提交资源也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D12CommittedResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

4.3.Mapping Buffers onto Imported Memory Objects

device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Direct3D 12 API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()进行释放。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: 'ptr' must eventually be freed using cudaFree()
    return ptr;
}

4.4.Mapping Mipmapped Arrays onto Imported Memory Objects

CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应的Direct3D 12 API创建映射时指定的值一致。另外,如果该mipmapped array可以在Direct3D 12中绑定为渲染目标,则必须设置标志cudaArrayColorAttachment。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()进行释放。以下代码示例展示了在将mipmapped arrays映射到导入内存对象时,如何将Vulkan参数转换为对应的CUDA参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
    cudaChannelFormatDesc d;

    memset(&d, 0, sizeof(d));

    switch (dxgiFormat) {
    case DXGI_FORMAT_R8_UINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8_SINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8_UINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8_SINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8B8A8_UINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8B8A8_SINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16_UINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16_SINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16_UINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16_SINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16B16A16_UINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16B16A16_SINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_UINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32_SINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_FLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32_UINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32_SINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32_FLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32B32A32_UINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32B32A32_SINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
    }

    return d;
}

cudaExtent getCudaExtentForD3D12Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d12SRVDimension) {
    cudaExtent e = { 0, 0, 0 };

    switch (d3d12SRVDimension) {
    case D3D12_SRV_DIMENSION_TEXTURE1D:        e.width = width; e.height = 0;      e.depth = 0;                break;
    case D3D12_SRV_DIMENSION_TEXTURE2D:        e.width = width; e.height = height; e.depth = 0;                break;
    case D3D12_SRV_DIMENSION_TEXTURE3D:        e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURECUBE:      e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURE1DARRAY:   e.width = width; e.height = 0;      e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURE2DARRAY:   e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    default: assert(0);
    }

    return e;
}

unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D12_SRV_DIMENSION d3d12SRVDimension, D3D12_RESOURCE_FLAGS d3d12ResourceFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (d3d12SRVDimension) {
    case D3D12_SRV_DIMENSION_TEXTURECUBE:      flags |= cudaArrayCubemap;                    break;
    case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case D3D12_SRV_DIMENSION_TEXTURE1DARRAY:   flags |= cudaArrayLayered;                    break;
    case D3D12_SRV_DIMENSION_TEXTURE2DARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (d3d12ResourceFlags & D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET) {
        flags |= cudaArrayColorAttachment;
    }
    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }

    return flags;
}

4.5.Importing Synchronization Objects

通过在调用ID3D12Device::CreateFence时设置标志D3D12_FENCE_FLAG_SHARED创建的可共享的Direct3D 12栅栏对象(fence object),可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层信号量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D12FenceFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

如果存在命名句柄,可共享的Direct3D 12栅栏对象也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D12FenceFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

4.6.Signaling/Waiting on Imported Synchronization Objects

导入的Direct3D 12栅栏对象可以通过以下方式触发。触发此类栅栏对象会将其值设置为指定的值。等待该信号的对应等待操作必须在Direct3D 12中发出。此外,等待该信号的等待操作必须在信号发出后进行。

1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

导入的Direct3D 12栅栏对象可以通过以下方式等待。等待此类栅栏对象时,会一直等待,直到栅栏值大于或等于指定的值。等待该信号的对应触发操作必须在Direct3D 12中发出。此外,等待操作所依赖的信号必须在等待操作发出之前进行触发。

1
2
3
4
5
6
7
8
9
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

5.Direct3D 11 Interoperability

5.1.Matching Device LUIDs

当导入由Direct3D 11导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。可以通过CUDA device和Direct3D 11 device的LUID来确定操作的是同一个本地物理设备,代码示例如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int getCudaDeviceForD3D11Device(ID3D11Device *d3d11Device) {
    IDXGIDevice *dxgiDevice;
    d3d11Device->QueryInterface(__uuidof(IDXGIDevice), (void **)&dxgiDevice);

    IDXGIAdapter *dxgiAdapter;
    dxgiDevice->GetAdapter(&dxgiAdapter);

    DXGI_ADAPTER_DESC dxgiAdapterDesc;
    dxgiAdapter->GetDesc(&dxgiAdapterDesc);

    LUID d3d11Luid = dxgiAdapterDesc.AdapterLuid;

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        char *cudaLuid = deviceProp.luid;

        if (!memcmp(&d3d11Luid.LowPart, cudaLuid, sizeof(d3d11Luid.LowPart)) &&
            !memcmp(&d3d11Luid.HighPart, cudaLuid + sizeof(d3d11Luid.LowPart), sizeof(d3d11Luid.HighPart))) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}

5.2.Importing Memory Objects

可共享的Direct3D 11纹理资源(即ID3D11Texture1DID3D11Texture2DID3D11Texture3D)可以通过在调用ID3D11Device:CreateTexture1DID3D11Device:CreateTexture2DID3D11Device:CreateTexture3D时设置以下任一标志来创建:

  • D3D11_RESOURCE_MISC_SHAREDD3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX(适用于Windows 7)。
  • D3D11_RESOURCE_MISC_SHARED_NTHANDLE(适用于Windows 10)。

可共享的Direct3D 11 buffer资源,即ID3D11Buffer,可以通过在调用ID3D11Device::CreateBuffer时指定上述任一标志来创建。通过指定D3D11_RESOURCE_MISC_SHARED_NTHANDLE创建的可共享资源,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层内存。在导入Direct3D 11资源时,必须设置标志cudaExternalMemoryDedicated

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cudaExternalMemory_t importD3D11ResourceFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

如果存在命名句柄,可共享的Direct3D 11资源也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D11ResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

通过指定D3D11_RESOURCE_MISC_SHAREDD3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX创建的可共享Direct3D 11资源,可以使用与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层内存的引用,当所有对该资源的其他引用被销毁时,它会被自动销毁。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D11ResourceFromKMTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11ResourceKmt;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

5.3.Mapping Buffers onto Imported Memory Objects

device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Direct3D 11 API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()进行释放。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: ‘ptr’ must eventually be freed using cudaFree()
    return ptr;
}

5.4.Mapping Mipmapped Arrays onto Imported Memory Objects

CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应的Direct3D 11 API创建映射时指定的值一致。另外,如果该mipmapped array可以在Direct3D 12中绑定为渲染目标,则必须设置标志cudaArrayColorAttachment。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()进行释放。以下代码示例展示了如何在将mipmapped arrays映射到导入内存对象时,将Direct3D 11参数转换为对应的CUDA参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
    cudaChannelFormatDesc d;
    memset(&d, 0, sizeof(d));
    switch (dxgiFormat) {
    case DXGI_FORMAT_R8_UINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8_SINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8_UINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8_SINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8B8A8_UINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8B8A8_SINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16_UINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16_SINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16_UINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16_SINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16B16A16_UINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16B16A16_SINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_UINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32_SINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_FLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32_UINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32_SINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32_FLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32B32A32_UINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32B32A32_SINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    return d;
}

cudaExtent getCudaExtentForD3D11Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d11SRVDimension) {
    cudaExtent e = { 0, 0, 0 };

    switch (d3d11SRVDimension) {
    case D3D11_SRV_DIMENSION_TEXTURE1D:        e.width = width; e.height = 0;      e.depth = 0;                break;
    case D3D11_SRV_DIMENSION_TEXTURE2D:        e.width = width; e.height = height; e.depth = 0;                break;
    case D3D11_SRV_DIMENSION_TEXTURE3D:        e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURECUBE:      e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURE1DARRAY:   e.width = width; e.height = 0;      e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURE2DARRAY:   e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    default: assert(0);
    }
    return e;
}

unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D11_SRV_DIMENSION d3d11SRVDimension, D3D11_BIND_FLAG d3d11BindFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (d3d11SRVDimension) {
    case D3D11_SRV_DIMENSION_TEXTURECUBE:      flags |= cudaArrayCubemap;                    break;
    case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case D3D11_SRV_DIMENSION_TEXTURE1DARRAY:   flags |= cudaArrayLayered;                    break;
    case D3D11_SRV_DIMENSION_TEXTURE2DARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (d3d11BindFlags & D3D11_BIND_RENDER_TARGET) {
        flags |= cudaArrayColorAttachment;
    }

    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }

    return flags;
}

5.5.Importing Synchronization Objects

通过在调用ID3D11Device5::CreateFence时设置标志D3D11_FENCE_FLAG_SHARED创建的可共享的Direct3D 11栅栏对象,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层信号量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11FenceFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

如果存在命名句柄,可共享的Direct3D 11栅栏对象也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D11FenceFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

与可共享的Direct3D 11资源关联的Direct3D 11 keyed mutex对象,即IDXGIKeyedMutex,可以通过在创建时设置标志D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX来创建,并可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层信号量资源。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11KeyedMutexFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

如果存在命名句柄,可共享的Direct3D 11 keyed mutex对象也可以通过该命名句柄进行导入,如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D11KeyedMutexFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

可共享的Direct3D 11 keyed mutex对象可以通过与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层内存的引用,当所有对该资源的其他引用被销毁时,D3DKMT句柄会被自动销毁。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11FenceFromKMTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutexKmt;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

5.6.Signaling/Waiting on Imported Synchronization Objects

导入的Direct3D 11栅栏对象可以通过以下方式触发。触发此类栅栏对象会将其值设置为指定的值。等待该信号的对应等待操作必须在Direct3D 11中发出。此外,等待该信号的等待操作必须在信号触发操作之后发出。

1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

导入的Direct3D 11栅栏对象可以通过以下方式等待。等待此类栅栏对象时,会阻塞任务,直到栅栏值大于或等于指定值。等待该信号的对应触发操作必须在Direct3D 11中发出。此外,等待操作所依赖的信号必须在等待操作发出之前进行触发。

1
2
3
4
5
6
7
8
9
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

导入的Direct3D 11 keyed mutex对象可以通过以下方式触发。通过一个键值触发此类keyed mutex对象,会释放该键值对应的互斥锁。等待该信号的对应等待操作必须在Direct3D 11中发出,并使用相同的键值。此外,Direct3D 11的等待操作必须在信号触发操作之后进行。

1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.keyedmutex.key = key;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

导入的Direct3D 11 keyed mutex对象可以通过以下方式进行等待。在等待此类keyed mutex对象时,需要指定一个以毫秒为单位的超时时间值。等待操作会阻塞,直到keyed mutex值等于指定的键值,或者直到超时时间到达。超时时间也可以是一个无限值。如果指定了无限值,则永远不会超时。必须使用Windows的INFINITE宏来指定无限值。等待该信号的对应触发操作必须在Direct3D 11中发出。此外,Direct3D 11的触发操作必须在等待操作之前完成。

1
2
3
4
5
6
7
8
9
10
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, unsigned int timeoutMs, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.keyedmutex.key = key;
    params.params.keyedmutex.timeoutMs = timeoutMs;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

6.NVIDIA Software Communication Interface Interoperability (NVSCI)

NvSciBuf和NvSciSync是为以下目的开发的接口:

  • NvSciBuf:允许应用程序在内存中分配和交换buffers。用于在不同的组件或进程之间共享内存缓冲区,同时确保内存的安全性和一致性。
  • NvSciSync:允许应用程序在操作边界处管理同步对象。提供跨设备或处理单元的同步机制,确保操作按照预期顺序执行。

6.1.Importing Memory Objects

为了分配一个与指定CUDA device兼容的NvSciBuf对象,必须在NvSciBuf属性列表中设置相应的GPU ID(NvSciBufGeneralAttrKey_GpuId)。应用程序还可以选择指定以下属性:

  • NvSciBufGeneralAttrKey_NeedCpuAccess:指定是否需要CPU访问buffer。
  • NvSciBufRawBufferAttrKey_Align:指定NvSciBufType_RawBuffer的对齐要求。
  • NvSciBufGeneralAttrKey_RequiredPerm:可以为每个NvSciBuf内存对象实例配置不同的访问权限。例如,设置GPU对buffer只有只读权限,可以通过调用NvSciBufObjDupWithReducePerm()并设置NvSciBufAccessPerm_Readonly作为输入参数创建一个NvSciBuf对象副本。然后将该新创建的副本对象导入到CUDA中,权限将被限制。
  • NvSciBufGeneralAttrKey_EnableGpuCache:控制GPU L2缓存功能。
  • NvSciBufGeneralAttrKey_EnableGpuCompression:指定是否启用GPU压缩功能。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
NvSciBufObj createNvSciBufObject() {
   // Raw Buffer Attributes for CUDA
    NvSciBufType bufType = NvSciBufType_RawBuffer;
    uint64_t rawsize = SIZE;
    uint64_t align = 0;
    bool cpuaccess_flag = true;
    NvSciBufAttrValAccessPerm perm = NvSciBufAccessPerm_ReadWrite;

    NvSciRmGpuId gpuid[] ={};
    CUuuid uuid;
    cuDeviceGetUuid(&uuid, dev));

    memcpy(&gpuid[0].bytes, &uuid.bytes, sizeof(uuid.bytes));
    // Disable cache on dev
    NvSciBufAttrValGpuCache gpuCache[] = ;
    NvSciBufAttrValGpuCompression gpuCompression[] = ;
    // Fill in values
    NvSciBufAttrKeyValuePair rawbuffattrs[] = {
         { NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
         { NvSciBufRawBufferAttrKey_Size, &rawsize, sizeof(rawsize) },
         { NvSciBufRawBufferAttrKey_Align, &align, sizeof(align) },
         { NvSciBufGeneralAttrKey_NeedCpuAccess, &cpuaccess_flag, sizeof(cpuaccess_flag) },
         { NvSciBufGeneralAttrKey_RequiredPerm, &perm, sizeof(perm) },
         { NvSciBufGeneralAttrKey_GpuId, &gpuid, sizeof(gpuid) },
         { NvSciBufGeneralAttrKey_EnableGpuCache &gpuCache, sizeof(gpuCache) },
         { NvSciBufGeneralAttrKey_EnableGpuCompression &gpuCompression, sizeof(gpuCompression) }
    };

    // Create list by setting attributes
    err = NvSciBufAttrListSetAttrs(attrListBuffer, rawbuffattrs,
            sizeof(rawbuffattrs)/sizeof(NvSciBufAttrKeyValuePair));

    NvSciBufAttrListCreate(NvSciBufModule, &attrListBuffer);

    // Reconcile And Allocate
    NvSciBufAttrListReconcile(&attrListBuffer, 1, &attrListReconciledBuffer,
                       &attrListConflictBuffer)
    NvSciBufObjAlloc(attrListReconciledBuffer, &bufferObjRaw);
    return bufferObjRaw;
}
1
2
3
4
NvSciBufObj bufferObjRo; // Readonly NvSciBuf memory obj
// Create a duplicate handle to the same memory buffer with reduced permissions
NvSciBufObjDupWithReducePerm(bufferObjRaw, NvSciBufAccessPerm_Readonly, &bufferObjRo);
return bufferObjRo;

分配的NvSciBuf内存对象可以通过NvSciBufObj句柄导入到CUDA中,如下所示。应用程序应查询分配的NvSciBufObj,以获取填充CUDA外部内存描述符(CUDA External Memory Descriptor)所需的属性。注意,属性列表和NvSciBuf对象应由应用程序维护。如果导入到CUDA的NvSciBuf对象同时也被其他驱动程序映射,则应用程序必须根据NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency输出属性的值,使用NvSciSync对象(见第6.4部分)作为适当的屏障,以维护CUDA与其他驱动程序之间的一致性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cudaExternalMemory_t importNvSciBufObject (NvSciBufObj bufferObjRaw) {

    /*************** Query NvSciBuf Object **************/
    NvSciBufAttrKeyValuePair bufattrs[] = {
                { NvSciBufRawBufferAttrKey_Size, NULL, 0 },
                { NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency, NULL, 0 },
                { NvSciBufGeneralAttrKey_EnableGpuCompression, NULL, 0 }
    };
    NvSciBufAttrListGetAttrs(retList, bufattrs,
        sizeof(bufattrs)/sizeof(NvSciBufAttrKeyValuePair)));
                ret_size = *(static_cast<const uint64_t*>(bufattrs[0].value));

    // Note cache and compression are per GPU attributes, so read values for specific gpu by comparing UUID
    // Read cacheability granted by NvSciBuf
    int numGpus = bufattrs[1].len / sizeof(NvSciBufAttrValGpuCache);
    NvSciBufAttrValGpuCache[] cacheVal = (NvSciBufAttrValGpuCache *)bufattrs[1].value;
    bool ret_cacheVal;
    for (int i = 0; i < numGpus; i++) {
        if (memcmp(gpuid[0].bytes, cacheVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
            ret_cacheVal = cacheVal[i].cacheability);
        }
    }

    // Read compression granted by NvSciBuf
    numGpus = bufattrs[2].len / sizeof(NvSciBufAttrValGpuCompression);
    NvSciBufAttrValGpuCompression[] compVal = (NvSciBufAttrValGpuCompression *)bufattrs[2].value;
    NvSciBufCompressionType ret_compVal;
    for (int i = 0; i < numGpus; i++) {
        if (memcmp(gpuid[0].bytes, compVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
            ret_compVal = compVal[i].compressionType);
        }
    }

    /*************** NvSciBuf Registration With CUDA **************/

    // Fill up CUDA_EXTERNAL_MEMORY_HANDLE_DESC
    cudaExternalMemoryHandleDesc memHandleDesc;
    memset(&memHandleDesc, 0, sizeof(memHandleDesc));
    memHandleDesc.type = cudaExternalMemoryHandleTypeNvSciBuf;
    memHandleDesc.handle.nvSciBufObject = bufferObjRaw;
    // Set the NvSciBuf object with required access permissions in this step
    memHandleDesc.handle.nvSciBufObject = bufferObjRo;
    memHandleDesc.size = ret_size;
    cudaImportExternalMemory(&extMemBuffer, &memHandleDesc);
    return extMemBuffer;
 }

6.2.Mapping Buffers onto Imported Memory Objects

device指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小可以根据分配的NvSciBufObj的属性进行填充。所有映射的device指针必须通过cudaFree()释放。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: 'ptr' must eventually be freed using cudaFree()
    return ptr;
}

6.3.Mapping Mipmapped Arrays onto Imported Memory Objects

CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射的偏移量、维度和格式可以根据分配的NvSciBufObj的属性进行填充。所有映射的mipmapped arrays必须通过cudaFreeMipmappedArray()释放。以下代码示例展示了在将mipmapped arrays映射到导入的内存对象时,如何将NvSciBuf属性转换为对应的CUDA参数。

注意:

mip层级数必须为1。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

6.4.Importing Synchronization Objects

与指定CUDA device兼容的NvSciSync属性可以通过cudaDeviceGetNvSciSyncAttributes()生成。返回的属性列表可用于创建一个与指定CUDA device兼容的NvSciSyncObj对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
NvSciSyncObj createNvSciSyncObject() {
    NvSciSyncObj nvSciSyncObj
    int cudaDev0 = 0;
    int cudaDev1 = 1;
    NvSciSyncAttrList signalerAttrList = NULL;
    NvSciSyncAttrList waiterAttrList = NULL;
    NvSciSyncAttrList reconciledList = NULL;
    NvSciSyncAttrList newConflictList = NULL;

    NvSciSyncAttrListCreate(module, &signalerAttrList);
    NvSciSyncAttrListCreate(module, &waiterAttrList);
    NvSciSyncAttrList unreconciledList[2] = {NULL, NULL};
    unreconciledList[0] = signalerAttrList;
    unreconciledList[1] = waiterAttrList;

    cudaDeviceGetNvSciSyncAttributes(signalerAttrList, cudaDev0, CUDA_NVSCISYNC_ATTR_SIGNAL);
    cudaDeviceGetNvSciSyncAttributes(waiterAttrList, cudaDev1, CUDA_NVSCISYNC_ATTR_WAIT);

    NvSciSyncAttrListReconcile(unreconciledList, 2, &reconciledList, &newConflictList);

    NvSciSyncObjAlloc(reconciledList, &nvSciSyncObj);

    return nvSciSyncObj;
}

按照上述方式创建的NvSciSync对象可以通过NvSciSyncObj句柄导入到CUDA中,如下所示。需要注意的是,即使在导入后,NvSciSyncObj句柄的所有权仍归应用程序所有。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalSemaphore_t importNvSciSyncObject(void* nvSciSyncObj) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeNvSciSync;
    desc.handle.nvSciSyncObj = nvSciSyncObj;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Deleting/Freeing the nvSciSyncObj beyond this point will lead to undefined behavior in CUDA

    return extSem;
}

6.5.Signaling/Waiting on Imported Synchronization Objects

导入的NvSciSyncObj对象可以按照以下方式发出信号。对基于NvSciSync的信号量对象进行信号操作会初始化作为输入传递的fence参数。该fence参数由与上述信号对应的等待操作所等待。此外,等待该信号的等待操作必须在信号操作发出之后发起。如果标志设置为cudaExternalSemaphoreSignalSkipNvSciBufMemSync,则默认作为信号操作一部分执行的内存同步操作(针对此进程中所有导入的NvSciBuf)将被跳过。当NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency为FALSE时,应设置此标志。

1
2
3
4
5
6
7
8
9
10
11
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
    cudaExternalSemaphoreSignalParams signalParams = {};

    memset(&signalParams, 0, sizeof(signalParams));

    signalParams.params.nvSciSync.fence = (void*)fence;
    signalParams.flags = 0; //OR cudaExternalSemaphoreSignalSkipNvSciBufMemSync

    cudaSignalExternalSemaphoresAsync(&extSem, &signalParams, 1, stream);

}

导入的NvSciSyncObj对象可以按照以下方式被等待。等待基于NvSciSync的信号量对象时,会等待输入的fence参数被对应的信号发出方设置为已信号状态。此外,信号操作必须在等待操作之前发出。如果标志设置为cudaExternalSemaphoreWaitSkipNvSciBufMemSync,则默认作为信号操作一部分执行的内存同步操作(针对此进程中所有导入的NvSciBuf)将被跳过。当NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency为FALSE时,应设置此标志。

1
2
3
4
5
6
7
8
9
10
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
     cudaExternalSemaphoreWaitParams waitParams = {};

    memset(&waitParams, 0, sizeof(waitParams));

    waitParams.params.nvSciSync.fence = (void*)fence;
    waitParams.flags = 0; //OR cudaExternalSemaphoreWaitSkipNvSciBufMemSync

    cudaWaitExternalSemaphoresAsync(&extSem, &waitParams, 1, stream);
}