【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide(v12.6)”。
本文为原创文章,未经本人允许,禁止转载。转载请注明出处。
1.External Resource Interoperability
外部资源互操作性允许CUDA导入由其他API显式导出的特定资源。这些对象通常通过操作系统的原生句柄由其他API导出,例如Linux上的文件描述符或Windows上的NT句柄。资源也可以通过其他统一接口导出,比如NVIDIA Software Communication Interface。有两种类型的资源可以导入:内存对象和同步对象。
可以使用cudaImportExternalMemory()
将内存对象导入到CUDA中。导入的内存对象可以通过cudaExternalMemoryGetMappedBuffer()
映射为device指针,或者通过cudaExternalMemoryGetMappedMipmappedArray()
映射为CUDA mipmapped array,然后在kernel中访问。根据内存对象的类型,可能会在同一个内存对象上设置多个映射。这些映射必须与导出API中设置的映射一致。任何不匹配的映射都会导致未定义行为。导入的内存对象必须使用cudaDestroyExternalMemory()
进行释放。释放内存对象不会释放与该对象相关的任何映射。因此,任何映射到该对象的device指针必须使用cudaFree()
显式释放,而任何映射到该对象的CUDA mipmapped array必须使用cudaFreeMipmappedArray()
显式释放。在内存对象被销毁后访问与其相关的映射是非法的。
同步对象可以使用cudaImportExternalSemaphore()
导入到CUDA中。导入的同步对象可以通过cudaSignalExternalSemaphoresAsync()
进行信号触发,也可以通过cudaWaitExternalSemaphoresAsync()
进行等待。在相应的信号被触发之前发出等待操作是非法的。此外,根据导入的同步对象的类型,可能会对其触发和等待的方式施加额外的限制,这些限制会在后续部分中描述。导入的信号量对象必须使用cudaDestroyExternalSemaphore()
进行释放。在销毁信号量对象之前,所有未完成的信号触发和等待操作必须完成。
2.Vulkan Interoperability
Vulkan是一个由Khronos Group开发的跨平台、低开销的图形与计算API。它被设计用来替代OpenGL,提供更高效的GPU控制和更接近硬件的性能优化,适用于图形渲染和通用计算任务。
2.1.Matching device UUIDs
当导入由Vulkan导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。每个物理设备都有一个唯一的标识符(UUID),可以通过比较UUID来确保CUDA和Vulkan操作的是同一个GPU。此外,Vulkan物理设备不应属于多GPU设备组,即vkEnumeratePhysicalDeviceGroups
返回的物理设备数量必须为1。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int getCudaDeviceForVulkanPhysicalDevice(VkPhysicalDevice vkPhysicalDevice) {
VkPhysicalDeviceIDProperties vkPhysicalDeviceIDProperties = {};
vkPhysicalDeviceIDProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ID_PROPERTIES;
vkPhysicalDeviceIDProperties.pNext = NULL;
VkPhysicalDeviceProperties2 vkPhysicalDeviceProperties2 = {};
vkPhysicalDeviceProperties2.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
vkPhysicalDeviceProperties2.pNext = &vkPhysicalDeviceIDProperties;
vkGetPhysicalDeviceProperties2(vkPhysicalDevice, &vkPhysicalDeviceProperties2);
int cudaDeviceCount;
cudaGetDeviceCount(&cudaDeviceCount);
for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, cudaDevice);
if (!memcmp(&deviceProp.uuid, vkPhysicalDeviceIDProperties.deviceUUID, VK_UUID_SIZE)) {
return cudaDevice;
}
}
return cudaInvalidDeviceId;
}
2.2.Importing Memory Objects
在Linux和Windows 10上,Vulkan导出的专用内存对象和非专用内存对象都可以被导入到CUDA中。在Windows 7上,只有专用内存对象可以被导入。在导入Vulkan专用内存对象时,必须设置标志cudaExternalMemoryDedicated
。
通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT
导出的Vulkan内存对象,可以使用与该对象关联的文件描述符导入到CUDA中,如下代码所示。需要注意的是,一旦文件描述符被导入,CUDA会接管该文件描述符的所有权。在成功导入后再次使用该文件描述符会导致未定义行为。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cudaExternalMemory_t importVulkanMemoryObjectFromFileDescriptor(int fd, unsigned long long size, bool isDedicated) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeOpaqueFd;
desc.handle.fd = fd;
desc.size = size;
if (isDedicated) {
desc.flags |= cudaExternalMemoryDedicated;
}
cudaImportExternalMemory(&extMem, &desc);
// Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it
return extMem;
}
通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT
导出的Vulkan内存对象,可以使用与该对象关联的NT句柄导入到CUDA中,如下代码所示。需要注意的是,CUDA不会接管NT句柄的所有权,应用程序需要在不再需要该句柄时显式关闭它。NT句柄持有对资源的引用,因此在释放底层内存之前,必须显式释放该句柄。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cudaExternalMemory_t importVulkanMemoryObjectFromNTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
desc.handle.win32.handle = handle;
desc.size = size;
if (isDedicated) {
desc.flags |= cudaExternalMemoryDedicated;
}
cudaImportExternalMemory(&extMem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extMem;
}
通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT
导出的Vulkan对象,如果存在命名句柄,也可以使用该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importVulkanMemoryObjectFromNamedNTHandle(LPCWSTR name, unsigned long long size, bool isDedicated) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
desc.handle.win32.name = (void *)name;
desc.size = size;
if (isDedicated) {
desc.flags |= cudaExternalMemoryDedicated;
}
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
通过VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT
导出的Vulkan内存对象,可以使用与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层内存的引用,因此当所有其他对该资源的引用被销毁时,它会自动被销毁。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importVulkanMemoryObjectFromKMTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeOpaqueWin32Kmt;
desc.handle.win32.handle = (void *)handle;
desc.size = size;
if (isDedicated) {
desc.flags |= cudaExternalMemoryDedicated;
}
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
2.3.Mapping Buffers onto Imported Memory Objects
device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Vulkan API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()
进行释放。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
void *ptr = NULL;
cudaExternalMemoryBufferDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.size = size;
cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);
// Note: ‘ptr’ must eventually be freed using cudaFree()
return ptr;
}
2.4.Mapping Mipmapped Arrays onto Imported Memory Objects
一个CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应Vulkan API创建映射时指定的值一致。另外,如果mipmapped array在Vulkan中被绑定为颜色目标,则必须设置标志cudaArrayColorAttachment
。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()
进行释放。以下代码示例展示了在将mipmapped arrays映射到导入内存对象时,如何将Vulkan参数转换为对应的CUDA参数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
cudaMipmappedArray_t mipmap = NULL;
cudaExternalMemoryMipmappedArrayDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.formatDesc = *formatDesc;
desc.extent = *extent;
desc.flags = flags;
desc.numLevels = numLevels;
// Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);
return mipmap;
}
cudaChannelFormatDesc getCudaChannelFormatDescForVulkanFormat(VkFormat format)
{
cudaChannelFormatDesc d;
memset(&d, 0, sizeof(d));
switch (format) {
case VK_FORMAT_R8_UINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R8_SINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R8G8_UINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R8G8_SINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R8G8B8A8_UINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R8G8B8A8_SINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R16_UINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R16_SINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R16G16_UINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R16G16_SINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R16G16B16A16_UINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R16G16B16A16_SINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R32_UINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R32_SINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R32_SFLOAT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case VK_FORMAT_R32G32_UINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R32G32_SINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R32G32_SFLOAT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case VK_FORMAT_R32G32B32A32_UINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
case VK_FORMAT_R32G32B32A32_SINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned; break;
case VK_FORMAT_R32G32B32A32_SFLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat; break;
default: assert(0);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
return d;
}
cudaExtent getCudaExtentForVulkanExtent(VkExtent3D vkExt, uint32_t arrayLayers, VkImageViewType vkImageViewType) {
cudaExtent e = { 0, 0, 0 };
switch (vkImageViewType) {
case VK_IMAGE_VIEW_TYPE_1D: e.width = vkExt.width; e.height = 0; e.depth = 0; break;
case VK_IMAGE_VIEW_TYPE_2D: e.width = vkExt.width; e.height = vkExt.height; e.depth = 0; break;
case VK_IMAGE_VIEW_TYPE_3D: e.width = vkExt.width; e.height = vkExt.height; e.depth = vkExt.depth; break;
case VK_IMAGE_VIEW_TYPE_CUBE: e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
case VK_IMAGE_VIEW_TYPE_1D_ARRAY: e.width = vkExt.width; e.height = 0; e.depth = arrayLayers; break;
case VK_IMAGE_VIEW_TYPE_2D_ARRAY: e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
default: assert(0);
}
return e;
}
unsigned int getCudaMipmappedArrayFlagsForVulkanImage(VkImageViewType vkImageViewType, VkImageUsageFlags vkImageUsageFlags, bool allowSurfaceLoadStore) {
unsigned int flags = 0;
switch (vkImageViewType) {
case VK_IMAGE_VIEW_TYPE_CUBE: flags |= cudaArrayCubemap; break;
case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
case VK_IMAGE_VIEW_TYPE_1D_ARRAY: flags |= cudaArrayLayered; break;
case VK_IMAGE_VIEW_TYPE_2D_ARRAY: flags |= cudaArrayLayered; break;
default: break;
}
if (vkImageUsageFlags & VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT) {
flags |= cudaArrayColorAttachment;
}
if (allowSurfaceLoadStore) {
flags |= cudaArraySurfaceLoadStore;
}
return flags;
}
2.5.Importing Synchronization Objects
通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BIT
导出的Vulkan信号量对象,可以使用与该对象关联的文件描述符导入到CUDA中,如下所示。需要注意的是,一旦文件描述符被导入,CUDA会接管该文件描述符的所有权。在成功导入后再次使用该文件描述符会导致未定义行为。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromFileDescriptor(int fd) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeOpaqueFd;
desc.handle.fd = fd;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it
return extSem;
}
通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT
导出的Vulkan信号量,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,CUDA不会接管NT句柄的所有权,应用程序必须在不再需要该句柄时显式关闭它。NT句柄持有对资源的引用,因此在释放底层信号量对象之前,必须显式释放该句柄。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
desc.handle.win32.handle = handle;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extSem;
}
通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT
导出的Vulkan信号量对象,如果存在命名句柄,也可以使用该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNamedNTHandle(LPCWSTR name) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
desc.handle.win32.name = (void *)name;
cudaImportExternalSemaphore(&extSem, &desc);
return extSem;
}
通过VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT
导出的Vulkan信号量对象,可以使用与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层信号量的引用,当所有其他对该资源的引用被销毁时,它会被自动销毁。
1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importVulkanSemaphoreObjectFromKMTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32Kmt;
desc.handle.win32.handle = (void *)handle;
cudaImportExternalSemaphore(&extSem, &desc);
return extSem;
}
2.6.Signaling/Waiting on Imported Synchronization Objects
导入的Vulkan信号量对象可以通过以下方式触发信号。触发此类信号量对象会将其设置为已触发状态。等待此信号的操作必须在Vulkan中发出。此外,等待该信号的操作必须在信号被触发之后发出。
1
2
3
4
5
6
7
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
cudaExternalSemaphoreSignalParams params = {};
memset(¶ms, 0, sizeof(params));
cudaSignalExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
导入的Vulkan信号量对象可以通过以下方式进行等待。等待此类信号量对象时,它会等待信号量进入已触发状态,然后将其重置回未触发状态。等待操作所依赖的信号量触发操作,必须在Vulkan中发出。此外,信号量的触发操作必须在等待操作发出之前执行。
1
2
3
4
5
6
7
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
cudaExternalSemaphoreWaitParams params = {};
memset(¶ms, 0, sizeof(params));
cudaWaitExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
3.OpenGL Interoperability
传统的OpenGL-CUDA互操作(见:OpenGL Interoperability)是通过CUDA直接使用由OpenGL创建的句柄来实现的。然而,由于OpenGL也可以使用由Vulkan创建的内存和同步对象,因此存在一种实现OpenGL-CUDA互操作的替代方法。基本上,由Vulkan导出的内存和同步对象可以同时导入到OpenGL和CUDA中,并用于协调OpenGL和CUDA之间的内存访问。有关如何导入Vulkan导出的内存和同步对象的详细信息,请参考以下OpenGL扩展:
GL_EXT_memory_object
GL_EXT_memory_object_fd
GL_EXT_memory_object_win32
GL_EXT_semaphore
GL_EXT_semaphore_fd
GL_EXT_semaphore_win32
4.Direct3D 12 Interoperability
4.1.Matching Device LUIDs
当导入由Direct3D 12导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。可以通过CUDA device和Direct3D 12 device的LUID来确定操作的是同一个本地物理设备,代码示例如下所示。需要注意的是,Direct3D 12 device不能创建在链接节点适配器(linked node adapter)上。也就是说,通过ID3D12Device::GetNodeCount
返回的节点数量必须为1。
这里简单解释下UUID和LUID。
UUID全称是通用唯一标识符(Universally Unique IDentifier),是一个128位的编码,在全球范围内,每个物理设备都有自己唯一的UUID。
LUID全称是本地唯一标识符(Locally Unique IDentifier),是一个64位的编码,仅局限于本地单个系统,且仅适用于Windows操作系统。相比UUID,LUID的计算成本和存储成本更低,性能更高效。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int getCudaDeviceForD3D12Device(ID3D12Device *d3d12Device) {
LUID d3d12Luid = d3d12Device->GetAdapterLuid();
int cudaDeviceCount;
cudaGetDeviceCount(&cudaDeviceCount);
for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, cudaDevice);
char *cudaLuid = deviceProp.luid;
if (!memcmp(&d3d12Luid.LowPart, cudaLuid, sizeof(d3d12Luid.LowPart)) &&
!memcmp(&d3d12Luid.HighPart, cudaLuid + sizeof(d3d12Luid.LowPart), sizeof(d3d12Luid.HighPart))) {
return cudaDevice;
}
}
return cudaInvalidDeviceId;
}
4.2.Importing Memory Objects
在调用ID3D12Device::CreateHeap
时设置标志D3D12_HEAP_FLAG_SHARED
可以创建可共享的Direct3D 12堆内存对象(heap memory object),可以使用与该对象关联的NT句柄将其导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该NT句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层内存。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaExternalMemory_t importD3D12HeapFromNTHandle(HANDLE handle, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
desc.handle.win32.handle = (void *)handle;
desc.size = size;
cudaImportExternalMemory(&extMem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extMem;
}
如果存在命名句柄,可共享的Direct3D 12堆内存对象也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cudaExternalMemory_t importD3D12HeapFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
desc.handle.win32.name = (void *)name;
desc.size = size;
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
通过在调用D3D12Device::CreateCommittedResource
时设置标志D3D12_HEAP_FLAG_SHARED
创建的可共享的Direct3D 12提交资源(committed resource),可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。在导入Direct3D 12提交资源时,必须设置标志cudaExternalMemoryDedicated
。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该NT句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层内存。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cudaExternalMemory_t importD3D12CommittedResourceFromNTHandle(HANDLE handle, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
desc.handle.win32.handle = (void *)handle;
desc.size = size;
desc.flags |= cudaExternalMemoryDedicated;
cudaImportExternalMemory(&extMem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extMem;
}
如果存在命名句柄,可共享的Direct3D 12提交资源也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D12CommittedResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
desc.handle.win32.name = (void *)name;
desc.size = size;
desc.flags |= cudaExternalMemoryDedicated;
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
4.3.Mapping Buffers onto Imported Memory Objects
device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Direct3D 12 API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()
进行释放。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
void *ptr = NULL;
cudaExternalMemoryBufferDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.size = size;
cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);
// Note: 'ptr' must eventually be freed using cudaFree()
return ptr;
}
4.4.Mapping Mipmapped Arrays onto Imported Memory Objects
CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应的Direct3D 12 API创建映射时指定的值一致。另外,如果该mipmapped array可以在Direct3D 12中绑定为渲染目标,则必须设置标志cudaArrayColorAttachment
。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()
进行释放。以下代码示例展示了在将mipmapped arrays映射到导入内存对象时,如何将Vulkan参数转换为对应的CUDA参数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
cudaMipmappedArray_t mipmap = NULL;
cudaExternalMemoryMipmappedArrayDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.formatDesc = *formatDesc;
desc.extent = *extent;
desc.flags = flags;
desc.numLevels = numLevels;
// Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);
return mipmap;
}
cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
cudaChannelFormatDesc d;
memset(&d, 0, sizeof(d));
switch (dxgiFormat) {
case DXGI_FORMAT_R8_UINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8_SINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R8G8_UINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8G8_SINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R8G8B8A8_UINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8G8B8A8_SINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16_UINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16_SINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16G16_UINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16G16_SINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16G16B16A16_UINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16G16B16A16_SINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32_UINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32_SINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32_FLOAT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case DXGI_FORMAT_R32G32_UINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32G32_SINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32G32_FLOAT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case DXGI_FORMAT_R32G32B32A32_UINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32G32B32A32_SINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat; break;
default: assert(0);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
}
return d;
}
cudaExtent getCudaExtentForD3D12Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d12SRVDimension) {
cudaExtent e = { 0, 0, 0 };
switch (d3d12SRVDimension) {
case D3D12_SRV_DIMENSION_TEXTURE1D: e.width = width; e.height = 0; e.depth = 0; break;
case D3D12_SRV_DIMENSION_TEXTURE2D: e.width = width; e.height = height; e.depth = 0; break;
case D3D12_SRV_DIMENSION_TEXTURE3D: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D12_SRV_DIMENSION_TEXTURECUBE: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D12_SRV_DIMENSION_TEXTURE1DARRAY: e.width = width; e.height = 0; e.depth = depthOrArraySize; break;
case D3D12_SRV_DIMENSION_TEXTURE2DARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
default: assert(0);
}
return e;
}
unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D12_SRV_DIMENSION d3d12SRVDimension, D3D12_RESOURCE_FLAGS d3d12ResourceFlags, bool allowSurfaceLoadStore) {
unsigned int flags = 0;
switch (d3d12SRVDimension) {
case D3D12_SRV_DIMENSION_TEXTURECUBE: flags |= cudaArrayCubemap; break;
case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
case D3D12_SRV_DIMENSION_TEXTURE1DARRAY: flags |= cudaArrayLayered; break;
case D3D12_SRV_DIMENSION_TEXTURE2DARRAY: flags |= cudaArrayLayered; break;
default: break;
}
if (d3d12ResourceFlags & D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET) {
flags |= cudaArrayColorAttachment;
}
if (allowSurfaceLoadStore) {
flags |= cudaArraySurfaceLoadStore;
}
return flags;
}
4.5.Importing Synchronization Objects
通过在调用ID3D12Device::CreateFence
时设置标志D3D12_FENCE_FLAG_SHARED
创建的可共享的Direct3D 12栅栏对象(fence object),可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层信号量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D12FenceFromNTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
desc.handle.win32.handle = handle;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extSem;
}
如果存在命名句柄,可共享的Direct3D 12栅栏对象也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D12FenceFromNamedNTHandle(LPCWSTR name) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
desc.handle.win32.name = (void *)name;
cudaImportExternalSemaphore(&extSem, &desc);
return extSem;
}
4.6.Signaling/Waiting on Imported Synchronization Objects
导入的Direct3D 12栅栏对象可以通过以下方式触发。触发此类栅栏对象会将其值设置为指定的值。等待该信号的对应等待操作必须在Direct3D 12中发出。此外,等待该信号的等待操作必须在信号发出后进行。
1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
cudaExternalSemaphoreSignalParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.fence.value = value;
cudaSignalExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
导入的Direct3D 12栅栏对象可以通过以下方式等待。等待此类栅栏对象时,会一直等待,直到栅栏值大于或等于指定的值。等待该信号的对应触发操作必须在Direct3D 12中发出。此外,等待操作所依赖的信号必须在等待操作发出之前进行触发。
1
2
3
4
5
6
7
8
9
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
cudaExternalSemaphoreWaitParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.fence.value = value;
cudaWaitExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
5.Direct3D 11 Interoperability
5.1.Matching Device LUIDs
当导入由Direct3D 11导出的内存和同步对象时,必须在创建这些对象的同一device上进行导入和映射。可以通过CUDA device和Direct3D 11 device的LUID来确定操作的是同一个本地物理设备,代码示例如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int getCudaDeviceForD3D11Device(ID3D11Device *d3d11Device) {
IDXGIDevice *dxgiDevice;
d3d11Device->QueryInterface(__uuidof(IDXGIDevice), (void **)&dxgiDevice);
IDXGIAdapter *dxgiAdapter;
dxgiDevice->GetAdapter(&dxgiAdapter);
DXGI_ADAPTER_DESC dxgiAdapterDesc;
dxgiAdapter->GetDesc(&dxgiAdapterDesc);
LUID d3d11Luid = dxgiAdapterDesc.AdapterLuid;
int cudaDeviceCount;
cudaGetDeviceCount(&cudaDeviceCount);
for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, cudaDevice);
char *cudaLuid = deviceProp.luid;
if (!memcmp(&d3d11Luid.LowPart, cudaLuid, sizeof(d3d11Luid.LowPart)) &&
!memcmp(&d3d11Luid.HighPart, cudaLuid + sizeof(d3d11Luid.LowPart), sizeof(d3d11Luid.HighPart))) {
return cudaDevice;
}
}
return cudaInvalidDeviceId;
}
5.2.Importing Memory Objects
可共享的Direct3D 11纹理资源(即ID3D11Texture1D
、ID3D11Texture2D
或ID3D11Texture3D
)可以通过在调用ID3D11Device:CreateTexture1D
、ID3D11Device:CreateTexture2D
或ID3D11Device:CreateTexture3D
时设置以下任一标志来创建:
D3D11_RESOURCE_MISC_SHARED
或D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX
(适用于Windows 7)。D3D11_RESOURCE_MISC_SHARED_NTHANDLE
(适用于Windows 10)。
可共享的Direct3D 11 buffer资源,即ID3D11Buffer
,可以通过在调用ID3D11Device::CreateBuffer
时指定上述任一标志来创建。通过指定D3D11_RESOURCE_MISC_SHARED_NTHANDLE
创建的可共享资源,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对资源的引用,因此必须显式释放句柄,才能释放底层内存。在导入Direct3D 11资源时,必须设置标志cudaExternalMemoryDedicated
。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cudaExternalMemory_t importD3D11ResourceFromNTHandle(HANDLE handle, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
desc.handle.win32.handle = (void *)handle;
desc.size = size;
desc.flags |= cudaExternalMemoryDedicated;
cudaImportExternalMemory(&extMem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extMem;
}
如果存在命名句柄,可共享的Direct3D 11资源也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D11ResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
desc.handle.win32.name = (void *)name;
desc.size = size;
desc.flags |= cudaExternalMemoryDedicated;
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
通过指定D3D11_RESOURCE_MISC_SHARED
或D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX
创建的可共享Direct3D 11资源,可以使用与该对象关联的全局共享D3DKMT
句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT
句柄不会持有对底层内存的引用,当所有对该资源的其他引用被销毁时,它会被自动销毁。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalMemory_t importD3D11ResourceFromKMTHandle(HANDLE handle, unsigned long long size) {
cudaExternalMemory_t extMem = NULL;
cudaExternalMemoryHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalMemoryHandleTypeD3D11ResourceKmt;
desc.handle.win32.handle = (void *)handle;
desc.size = size;
desc.flags |= cudaExternalMemoryDedicated;
cudaImportExternalMemory(&extMem, &desc);
return extMem;
}
5.3.Mapping Buffers onto Imported Memory Objects
device指针可以映射到导入的内存对象上,如下所示。映射时的偏移量和大小必须与使用对应的Direct3D 11 API创建映射时指定的值一致。所有映射的device指针必须使用cudaFree()
进行释放。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
void *ptr = NULL;
cudaExternalMemoryBufferDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.size = size;
cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);
// Note: ‘ptr’ must eventually be freed using cudaFree()
return ptr;
}
5.4.Mapping Mipmapped Arrays onto Imported Memory Objects
CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射时的偏移量、维度、格式和mip层级数必须与使用对应的Direct3D 11 API创建映射时指定的值一致。另外,如果该mipmapped array可以在Direct3D 12中绑定为渲染目标,则必须设置标志cudaArrayColorAttachment
。所有映射的mipmapped arrays必须使用cudaFreeMipmappedArray()
进行释放。以下代码示例展示了如何在将mipmapped arrays映射到导入内存对象时,将Direct3D 11参数转换为对应的CUDA参数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
cudaMipmappedArray_t mipmap = NULL;
cudaExternalMemoryMipmappedArrayDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.formatDesc = *formatDesc;
desc.extent = *extent;
desc.flags = flags;
desc.numLevels = numLevels;
// Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);
return mipmap;
}
cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
cudaChannelFormatDesc d;
memset(&d, 0, sizeof(d));
switch (dxgiFormat) {
case DXGI_FORMAT_R8_UINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8_SINT: d.x = 8; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R8G8_UINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8G8_SINT: d.x = 8; d.y = 8; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R8G8B8A8_UINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R8G8B8A8_SINT: d.x = 8; d.y = 8; d.z = 8; d.w = 8; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16_UINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16_SINT: d.x = 16; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16G16_UINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16G16_SINT: d.x = 16; d.y = 16; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R16G16B16A16_UINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R16G16B16A16_SINT: d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32_UINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32_SINT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32_FLOAT: d.x = 32; d.y = 0; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case DXGI_FORMAT_R32G32_UINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32G32_SINT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32G32_FLOAT: d.x = 32; d.y = 32; d.z = 0; d.w = 0; d.f = cudaChannelFormatKindFloat; break;
case DXGI_FORMAT_R32G32B32A32_UINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
case DXGI_FORMAT_R32G32B32A32_SINT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned; break;
case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat; break;
default: assert(0);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
return d;
}
cudaExtent getCudaExtentForD3D11Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d11SRVDimension) {
cudaExtent e = { 0, 0, 0 };
switch (d3d11SRVDimension) {
case D3D11_SRV_DIMENSION_TEXTURE1D: e.width = width; e.height = 0; e.depth = 0; break;
case D3D11_SRV_DIMENSION_TEXTURE2D: e.width = width; e.height = height; e.depth = 0; break;
case D3D11_SRV_DIMENSION_TEXTURE3D: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D11_SRV_DIMENSION_TEXTURECUBE: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D11_SRV_DIMENSION_TEXTURE1DARRAY: e.width = width; e.height = 0; e.depth = depthOrArraySize; break;
case D3D11_SRV_DIMENSION_TEXTURE2DARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
default: assert(0);
}
return e;
}
unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D11_SRV_DIMENSION d3d11SRVDimension, D3D11_BIND_FLAG d3d11BindFlags, bool allowSurfaceLoadStore) {
unsigned int flags = 0;
switch (d3d11SRVDimension) {
case D3D11_SRV_DIMENSION_TEXTURECUBE: flags |= cudaArrayCubemap; break;
case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
case D3D11_SRV_DIMENSION_TEXTURE1DARRAY: flags |= cudaArrayLayered; break;
case D3D11_SRV_DIMENSION_TEXTURE2DARRAY: flags |= cudaArrayLayered; break;
default: break;
}
if (d3d11BindFlags & D3D11_BIND_RENDER_TARGET) {
flags |= cudaArrayColorAttachment;
}
if (allowSurfaceLoadStore) {
flags |= cudaArraySurfaceLoadStore;
}
return flags;
}
5.5.Importing Synchronization Objects
通过在调用ID3D11Device5::CreateFence
时设置标志D3D11_FENCE_FLAG_SHARED
创建的可共享的Direct3D 11栅栏对象,可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层信号量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11FenceFromNTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
desc.handle.win32.handle = handle;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extSem;
}
如果存在命名句柄,可共享的Direct3D 11栅栏对象也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D11FenceFromNamedNTHandle(LPCWSTR name) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
desc.handle.win32.name = (void *)name;
cudaImportExternalSemaphore(&extSem, &desc);
return extSem;
}
与可共享的Direct3D 11资源关联的Direct3D 11 keyed mutex对象,即IDXGIKeyedMutex
,可以通过在创建时设置标志D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX
来创建,并可以使用与该对象关联的NT句柄导入到CUDA中,如下所示。需要注意的是,当NT句柄不再需要时,应用程序必须负责关闭该句柄。NT句柄持有对该资源的引用,因此必须显式释放句柄,才能释放底层信号量资源。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11KeyedMutexFromNTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
desc.handle.win32.handle = handle;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extSem;
}
如果存在命名句柄,可共享的Direct3D 11 keyed mutex对象也可以通过该命名句柄进行导入,如下所示。
1
2
3
4
5
6
7
8
9
10
11
12
13
cudaExternalSemaphore_t importD3D11KeyedMutexFromNamedNTHandle(LPCWSTR name) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
desc.handle.win32.name = (void *)name;
cudaImportExternalSemaphore(&extSem, &desc);
return extSem;
}
可共享的Direct3D 11 keyed mutex对象可以通过与该对象关联的全局共享D3DKMT句柄导入到CUDA中,如下所示。由于全局共享的D3DKMT句柄不会持有对底层内存的引用,当所有对该资源的其他引用被销毁时,D3DKMT句柄会被自动销毁。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cudaExternalSemaphore_t importD3D11FenceFromKMTHandle(HANDLE handle) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeKeyedMutexKmt;
desc.handle.win32.handle = handle;
cudaImportExternalSemaphore(&extSem, &desc);
// Input parameter 'handle' should be closed if it's not needed anymore
CloseHandle(handle);
return extSem;
}
5.6.Signaling/Waiting on Imported Synchronization Objects
导入的Direct3D 11栅栏对象可以通过以下方式触发。触发此类栅栏对象会将其值设置为指定的值。等待该信号的对应等待操作必须在Direct3D 11中发出。此外,等待该信号的等待操作必须在信号触发操作之后发出。
1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
cudaExternalSemaphoreSignalParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.fence.value = value;
cudaSignalExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
导入的Direct3D 11栅栏对象可以通过以下方式等待。等待此类栅栏对象时,会阻塞任务,直到栅栏值大于或等于指定值。等待该信号的对应触发操作必须在Direct3D 11中发出。此外,等待操作所依赖的信号必须在等待操作发出之前进行触发。
1
2
3
4
5
6
7
8
9
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
cudaExternalSemaphoreWaitParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.fence.value = value;
cudaWaitExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
导入的Direct3D 11 keyed mutex对象可以通过以下方式触发。通过一个键值触发此类keyed mutex对象,会释放该键值对应的互斥锁。等待该信号的对应等待操作必须在Direct3D 11中发出,并使用相同的键值。此外,Direct3D 11的等待操作必须在信号触发操作之后进行。
1
2
3
4
5
6
7
8
9
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, cudaStream_t stream) {
cudaExternalSemaphoreSignalParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.keyedmutex.key = key;
cudaSignalExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
导入的Direct3D 11 keyed mutex对象可以通过以下方式进行等待。在等待此类keyed mutex对象时,需要指定一个以毫秒为单位的超时时间值。等待操作会阻塞,直到keyed mutex值等于指定的键值,或者直到超时时间到达。超时时间也可以是一个无限值。如果指定了无限值,则永远不会超时。必须使用Windows的INFINITE宏来指定无限值。等待该信号的对应触发操作必须在Direct3D 11中发出。此外,Direct3D 11的触发操作必须在等待操作之前完成。
1
2
3
4
5
6
7
8
9
10
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, unsigned int timeoutMs, cudaStream_t stream) {
cudaExternalSemaphoreWaitParams params = {};
memset(¶ms, 0, sizeof(params));
params.params.keyedmutex.key = key;
params.params.keyedmutex.timeoutMs = timeoutMs;
cudaWaitExternalSemaphoresAsync(&extSem, ¶ms, 1, stream);
}
6.NVIDIA Software Communication Interface Interoperability (NVSCI)
NvSciBuf和NvSciSync是为以下目的开发的接口:
- NvSciBuf:允许应用程序在内存中分配和交换buffers。用于在不同的组件或进程之间共享内存缓冲区,同时确保内存的安全性和一致性。
- NvSciSync:允许应用程序在操作边界处管理同步对象。提供跨设备或处理单元的同步机制,确保操作按照预期顺序执行。
6.1.Importing Memory Objects
为了分配一个与指定CUDA device兼容的NvSciBuf对象,必须在NvSciBuf属性列表中设置相应的GPU ID(NvSciBufGeneralAttrKey_GpuId
)。应用程序还可以选择指定以下属性:
NvSciBufGeneralAttrKey_NeedCpuAccess
:指定是否需要CPU访问buffer。NvSciBufRawBufferAttrKey_Align
:指定NvSciBufType_RawBuffer
的对齐要求。NvSciBufGeneralAttrKey_RequiredPerm
:可以为每个NvSciBuf内存对象实例配置不同的访问权限。例如,设置GPU对buffer只有只读权限,可以通过调用NvSciBufObjDupWithReducePerm()
并设置NvSciBufAccessPerm_Readonly
作为输入参数创建一个NvSciBuf对象副本。然后将该新创建的副本对象导入到CUDA中,权限将被限制。NvSciBufGeneralAttrKey_EnableGpuCache
:控制GPU L2缓存功能。NvSciBufGeneralAttrKey_EnableGpuCompression
:指定是否启用GPU压缩功能。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
NvSciBufObj createNvSciBufObject() {
// Raw Buffer Attributes for CUDA
NvSciBufType bufType = NvSciBufType_RawBuffer;
uint64_t rawsize = SIZE;
uint64_t align = 0;
bool cpuaccess_flag = true;
NvSciBufAttrValAccessPerm perm = NvSciBufAccessPerm_ReadWrite;
NvSciRmGpuId gpuid[] ={};
CUuuid uuid;
cuDeviceGetUuid(&uuid, dev));
memcpy(&gpuid[0].bytes, &uuid.bytes, sizeof(uuid.bytes));
// Disable cache on dev
NvSciBufAttrValGpuCache gpuCache[] = ;
NvSciBufAttrValGpuCompression gpuCompression[] = ;
// Fill in values
NvSciBufAttrKeyValuePair rawbuffattrs[] = {
{ NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
{ NvSciBufRawBufferAttrKey_Size, &rawsize, sizeof(rawsize) },
{ NvSciBufRawBufferAttrKey_Align, &align, sizeof(align) },
{ NvSciBufGeneralAttrKey_NeedCpuAccess, &cpuaccess_flag, sizeof(cpuaccess_flag) },
{ NvSciBufGeneralAttrKey_RequiredPerm, &perm, sizeof(perm) },
{ NvSciBufGeneralAttrKey_GpuId, &gpuid, sizeof(gpuid) },
{ NvSciBufGeneralAttrKey_EnableGpuCache &gpuCache, sizeof(gpuCache) },
{ NvSciBufGeneralAttrKey_EnableGpuCompression &gpuCompression, sizeof(gpuCompression) }
};
// Create list by setting attributes
err = NvSciBufAttrListSetAttrs(attrListBuffer, rawbuffattrs,
sizeof(rawbuffattrs)/sizeof(NvSciBufAttrKeyValuePair));
NvSciBufAttrListCreate(NvSciBufModule, &attrListBuffer);
// Reconcile And Allocate
NvSciBufAttrListReconcile(&attrListBuffer, 1, &attrListReconciledBuffer,
&attrListConflictBuffer)
NvSciBufObjAlloc(attrListReconciledBuffer, &bufferObjRaw);
return bufferObjRaw;
}
1
2
3
4
NvSciBufObj bufferObjRo; // Readonly NvSciBuf memory obj
// Create a duplicate handle to the same memory buffer with reduced permissions
NvSciBufObjDupWithReducePerm(bufferObjRaw, NvSciBufAccessPerm_Readonly, &bufferObjRo);
return bufferObjRo;
分配的NvSciBuf内存对象可以通过NvSciBufObj句柄导入到CUDA中,如下所示。应用程序应查询分配的NvSciBufObj,以获取填充CUDA外部内存描述符(CUDA External Memory Descriptor)所需的属性。注意,属性列表和NvSciBuf对象应由应用程序维护。如果导入到CUDA的NvSciBuf对象同时也被其他驱动程序映射,则应用程序必须根据NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency
输出属性的值,使用NvSciSync对象(见第6.4部分)作为适当的屏障,以维护CUDA与其他驱动程序之间的一致性。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cudaExternalMemory_t importNvSciBufObject (NvSciBufObj bufferObjRaw) {
/*************** Query NvSciBuf Object **************/
NvSciBufAttrKeyValuePair bufattrs[] = {
{ NvSciBufRawBufferAttrKey_Size, NULL, 0 },
{ NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency, NULL, 0 },
{ NvSciBufGeneralAttrKey_EnableGpuCompression, NULL, 0 }
};
NvSciBufAttrListGetAttrs(retList, bufattrs,
sizeof(bufattrs)/sizeof(NvSciBufAttrKeyValuePair)));
ret_size = *(static_cast<const uint64_t*>(bufattrs[0].value));
// Note cache and compression are per GPU attributes, so read values for specific gpu by comparing UUID
// Read cacheability granted by NvSciBuf
int numGpus = bufattrs[1].len / sizeof(NvSciBufAttrValGpuCache);
NvSciBufAttrValGpuCache[] cacheVal = (NvSciBufAttrValGpuCache *)bufattrs[1].value;
bool ret_cacheVal;
for (int i = 0; i < numGpus; i++) {
if (memcmp(gpuid[0].bytes, cacheVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
ret_cacheVal = cacheVal[i].cacheability);
}
}
// Read compression granted by NvSciBuf
numGpus = bufattrs[2].len / sizeof(NvSciBufAttrValGpuCompression);
NvSciBufAttrValGpuCompression[] compVal = (NvSciBufAttrValGpuCompression *)bufattrs[2].value;
NvSciBufCompressionType ret_compVal;
for (int i = 0; i < numGpus; i++) {
if (memcmp(gpuid[0].bytes, compVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
ret_compVal = compVal[i].compressionType);
}
}
/*************** NvSciBuf Registration With CUDA **************/
// Fill up CUDA_EXTERNAL_MEMORY_HANDLE_DESC
cudaExternalMemoryHandleDesc memHandleDesc;
memset(&memHandleDesc, 0, sizeof(memHandleDesc));
memHandleDesc.type = cudaExternalMemoryHandleTypeNvSciBuf;
memHandleDesc.handle.nvSciBufObject = bufferObjRaw;
// Set the NvSciBuf object with required access permissions in this step
memHandleDesc.handle.nvSciBufObject = bufferObjRo;
memHandleDesc.size = ret_size;
cudaImportExternalMemory(&extMemBuffer, &memHandleDesc);
return extMemBuffer;
}
6.2.Mapping Buffers onto Imported Memory Objects
device指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小可以根据分配的NvSciBufObj
的属性进行填充。所有映射的device指针必须通过cudaFree()
释放。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
void *ptr = NULL;
cudaExternalMemoryBufferDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.size = size;
cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);
// Note: 'ptr' must eventually be freed using cudaFree()
return ptr;
}
6.3.Mapping Mipmapped Arrays onto Imported Memory Objects
CUDA mipmapped array可以映射到导入的内存对象上,如下所示。映射的偏移量、维度和格式可以根据分配的NvSciBufObj
的属性进行填充。所有映射的mipmapped arrays必须通过cudaFreeMipmappedArray()
释放。以下代码示例展示了在将mipmapped arrays映射到导入的内存对象时,如何将NvSciBuf属性转换为对应的CUDA参数。
注意:
mip层级数必须为1。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
cudaMipmappedArray_t mipmap = NULL;
cudaExternalMemoryMipmappedArrayDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.offset = offset;
desc.formatDesc = *formatDesc;
desc.extent = *extent;
desc.flags = flags;
desc.numLevels = numLevels;
// Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);
return mipmap;
}
6.4.Importing Synchronization Objects
与指定CUDA device兼容的NvSciSync属性可以通过cudaDeviceGetNvSciSyncAttributes()
生成。返回的属性列表可用于创建一个与指定CUDA device兼容的NvSciSyncObj
对象。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
NvSciSyncObj createNvSciSyncObject() {
NvSciSyncObj nvSciSyncObj
int cudaDev0 = 0;
int cudaDev1 = 1;
NvSciSyncAttrList signalerAttrList = NULL;
NvSciSyncAttrList waiterAttrList = NULL;
NvSciSyncAttrList reconciledList = NULL;
NvSciSyncAttrList newConflictList = NULL;
NvSciSyncAttrListCreate(module, &signalerAttrList);
NvSciSyncAttrListCreate(module, &waiterAttrList);
NvSciSyncAttrList unreconciledList[2] = {NULL, NULL};
unreconciledList[0] = signalerAttrList;
unreconciledList[1] = waiterAttrList;
cudaDeviceGetNvSciSyncAttributes(signalerAttrList, cudaDev0, CUDA_NVSCISYNC_ATTR_SIGNAL);
cudaDeviceGetNvSciSyncAttributes(waiterAttrList, cudaDev1, CUDA_NVSCISYNC_ATTR_WAIT);
NvSciSyncAttrListReconcile(unreconciledList, 2, &reconciledList, &newConflictList);
NvSciSyncObjAlloc(reconciledList, &nvSciSyncObj);
return nvSciSyncObj;
}
按照上述方式创建的NvSciSync对象可以通过NvSciSyncObj句柄导入到CUDA中,如下所示。需要注意的是,即使在导入后,NvSciSyncObj句柄的所有权仍归应用程序所有。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaExternalSemaphore_t importNvSciSyncObject(void* nvSciSyncObj) {
cudaExternalSemaphore_t extSem = NULL;
cudaExternalSemaphoreHandleDesc desc = {};
memset(&desc, 0, sizeof(desc));
desc.type = cudaExternalSemaphoreHandleTypeNvSciSync;
desc.handle.nvSciSyncObj = nvSciSyncObj;
cudaImportExternalSemaphore(&extSem, &desc);
// Deleting/Freeing the nvSciSyncObj beyond this point will lead to undefined behavior in CUDA
return extSem;
}
6.5.Signaling/Waiting on Imported Synchronization Objects
导入的NvSciSyncObj
对象可以按照以下方式发出信号。对基于NvSciSync的信号量对象进行信号操作会初始化作为输入传递的fence参数。该fence参数由与上述信号对应的等待操作所等待。此外,等待该信号的等待操作必须在信号操作发出之后发起。如果标志设置为cudaExternalSemaphoreSignalSkipNvSciBufMemSync
,则默认作为信号操作一部分执行的内存同步操作(针对此进程中所有导入的NvSciBuf)将被跳过。当NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency
为FALSE时,应设置此标志。
1
2
3
4
5
6
7
8
9
10
11
void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
cudaExternalSemaphoreSignalParams signalParams = {};
memset(&signalParams, 0, sizeof(signalParams));
signalParams.params.nvSciSync.fence = (void*)fence;
signalParams.flags = 0; //OR cudaExternalSemaphoreSignalSkipNvSciBufMemSync
cudaSignalExternalSemaphoresAsync(&extSem, &signalParams, 1, stream);
}
导入的NvSciSyncObj
对象可以按照以下方式被等待。等待基于NvSciSync的信号量对象时,会等待输入的fence参数被对应的信号发出方设置为已信号状态。此外,信号操作必须在等待操作之前发出。如果标志设置为cudaExternalSemaphoreWaitSkipNvSciBufMemSync
,则默认作为信号操作一部分执行的内存同步操作(针对此进程中所有导入的NvSciBuf)将被跳过。当NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency
为FALSE时,应设置此标志。
1
2
3
4
5
6
7
8
9
10
void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
cudaExternalSemaphoreWaitParams waitParams = {};
memset(&waitParams, 0, sizeof(waitParams));
waitParams.params.nvSciSync.fence = (void*)fence;
waitParams.flags = 0; //OR cudaExternalSemaphoreWaitSkipNvSciBufMemSync
cudaWaitExternalSemaphoresAsync(&extSem, &waitParams, 1, stream);
}