【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide（v12.6）”。
本文为原创文章，未经本人允许，禁止转载。转载请注明出处。

1.Device Memory L2 Access Management

当CUDA kernel反复访问global内存中的某个数据区域时，这类数据访问可以被认为是持久的（persisting）。另一方面，如果数据只被访问一次，那么这种数据访问可以被认为是流式的（streaming）。

从CUDA 11.0开始，针对计算能力在8.0及以上的device，开发者可以控制哪些数据能长时间的保留在L2 cache中，避免重复访问global内存的同一块区域，提高内存访问效率。

2.L2 cache Set-Aside for Persisting Accesses

L2 cache的一部分可以预留出来，用于global内存中被持久访问的数据。持久访问可以优先使用这部分预留的L2 cache，只有当没有持久访问使用时，常规访问或流式访问才能使用这部分预留的L2 cache。

可以在一定范围内调整预留L2 cache的大小：

cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache for persisting accesses or the max allowed*/

当GPU被配置为MIG（Multi-Instance GPU）模式时，L2 cache预留功能被禁用。

当使用MPS（Multi-Process Service）时，不能通过cudaDeviceSetLimit修改L2 cache的预留大小。此时，只能在MPS服务器启动时通过环境变量CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT指定预留大小。

3.L2 Policy for Persisting Accesses

在global内存的连续区域划定一个访问策略窗口（access policy window），在窗口内的数据有一定概率会被保留在预留的L2 cache中以被持久访问。L2 cache预留部分的大小是有限的，所以不能保证窗口中的所有数据都保留在预留的L2 cache中，因此会设置一个保留比例。

下面是在CUDA Stream中设置窗口的例子：

cudaStreamAttrValue stream_attribute;                                         // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                              // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

base_ptr是窗口在global内存中的起始位置，num_bytes是窗口的大小。窗口范围为[ptr..ptr+num_bytes)。当然，窗口之外的数据也有可能会被保留在L2 cache中，但只能是非预留部分。

hitRatio是cache命中率，如果要访问的数据已经存在于L2 cache中了，就称作一次cache hit；如果要访问的数据不在L2 cache中，需要从global内存中读取，则称作一次cache miss。也就是说，窗口内的数据有60%的概率（以hitRatio=0.6为例）被保留在预留的L2 cache中，下次再访问这些数据就可以cache hit了，至于哪些数据会被保留则是随机的，只保证概率大约是60%。被命中的数据被持久访问（cudaAccessPropertyPersisting），而没被命中的数据则被流式访问（cudaAccessPropertyStreaming）。

下面是在CUDA GraphKernelNode中设置窗口的例子：

cudaKernelNodeAttrValue node_attribute;                                     // Kernel level attributes data structure
node_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
node_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                            // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
node_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
node_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
node_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t
cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow, &node_attribute);

假设预留的L2 cache大小为16KB，而窗口大小为32KB：

当hitRatio等于0.5时，硬件会从窗口中随机选择16KB的数据，指定为持久性数据并保存在预留的L2 cache中。
当hitRatio等于1.0时，硬件会尝试将窗口内所有的32KB数据都存放到预留的L2 cache中。但由于预留的大小小于窗口，缓存行会被逐出，以便在预留的L2 cache中只保留最近使用的16KB数据。

假设预留的L2 cache大小为16KB，在两个不同的CUDA stream中的两个并发的kernel，每个kernel都有一个accessPolicyWindow，如果两个窗口的hitRatio都是1.0，则它们在竞争预留的L2 cache时可能会逐出对方的缓存行。然而，如果两个窗口的hitRatio都是0.5，则竞争的可能性会大大降低。

4.L2 Access Properties

global内存中的数据有3种类型的访问属性（枚举值，属于cudaAccessProperty类型的一部分）：

cudaAccessPropertyStreaming：具有流式访问属性的数据在L2 cache中持久保存的可能性较小，因为这些数据会被优先逐出L2 cache。此处说的是可能性比较小，但其可能性并不是0，比如当L2 cache中还有多余空间时，流式访问的数据也可能被暂时保留下来。
cudaAccessPropertyPersisting：具有持久访问属性的数据在L2 cache中持久保存的可能性较大，这些数据会被优先保留在L2 cache中。但这里说的是可能性较大，并不是一定能保留在L2 cache中，比如当出现第3部分说的竞争时，这些数据有可能会被逐出缓存行。
cudaAccessPropertyNormal：可以将具有持久访问属性的数据重置为正常状态，即移除其持久访问的属性。这有助于释放L2 cache空间。

通常与cudaAccessPolicyWindow配合使用。

在L2 cache的非预留区域，数据并不会受到cudaAccessPropertyStreaming、cudaAccessPropertyPersisting或cudaAccessPropertyNormal等访问属性的直接控制。这些访问属性是通过cudaAccessPolicyWindow机制专门为预留的L2 cache设计的，用于细粒度地管理特定内存区域的数据缓存行为。

5.L2 Persistence Example

下面是一个设置、使用并重置预留的L2 cache的例子：

cudaStream_t stream;
cudaStreamCreate(&stream);                                                                  // Create CUDA stream

cudaDeviceProp prop;                                                                        // CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);                                                 // Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);                                  // set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);                        // Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;                                                       // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data1);               // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;                                // Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                                        // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;               // Persistence Property
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;                // Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {
    cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);                                 // This data1 is used by a kernel multiple times
}                                                                                           // [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);                                     // A different kernel in the same stream can also benefit
                                                                                            // from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;                                          // Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();                                                            // Remove any persistent lines in L2

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);                                     // data2 can now benefit from full L2 in normal mode

在上述代码中，窗口的范围是[data1,data1+num_bytes)。data1被cuda_kernelA使用多次，因此data1被设置为持久访问属性，被保留在预留的L2 cache中。同一个stream的不同kernel函数cuda_kernelB在使用data1时，也可以直接从预留的L2 cache中访问到。随后将窗口的大小设置为0以禁掉窗口。最后调用cudaCtxResetPersistingL2Cache()重置整个L2 cache中（不仅仅是预留部分）具有持久访问的数据。

6.Reset L2 Access to Normal

有3种方式将持久访问重置为正常状态：

针对特定内存区域，使用cudaAccessPropertyNormal。
使用cudaCtxResetPersistingL2Cache()将所有持久属性的L2缓存行重置为正常状态。
如果某些持久属性的缓存行在一段时间内没有被访问，硬件最终会自动将它们重置为正常状态。但这个时间是不确定的。因此强烈建议不要依赖自动重置。

7.Manage Utilization of L2 set-aside cache

在CUDA并行计算环境中，多个kernel可以同时运行并且分布在不同的CUDA stream中。每个kernel都可能有自己的访问策略窗口，但它们都共享一个预留的L2 cache。这意味着，所有并发的kernel在使用预留的L2 cache时，会竞争同一个资源。

为了管理预留的L2 cache，应用程序必须考虑以下几点：

预留的L2 cache的大小。
可能同时执行的CUDA kernel。
所有可能同时执行的kernel的访问策略窗口。
何时以及如何重置L2。

8.Query L2 cache Properties

cudaGetDeviceProperties函数：

__host__cudaError_t cudaGetDeviceProperties
(cudaDeviceProp *prop, int device)

其会返回一个指向cudaDeviceProp的指针，而L2 cache的相关属性是结构体cudaDeviceProp的成员，包括：

int cudaDeviceProp::l2CacheSize：L2 cache的大小（单位是字节）。
int cudaDeviceProp::persistingL2CacheMaxSize：预留的L2 cache的最大容量（单位为字节）。
int cudaDeviceProp::accessPolicyMaxWindowSize：访问策略窗口的最大大小（单位为字节）。

9.Control L2 Cache Set-Aside Size for Persisting Memory Access

预留的L2 cache大小可以通过cudaDeviceGetLimit查询，并通过cudaDeviceSetLimit来设置，其最大值由cudaDeviceProp::persistingL2CacheMaxSize决定。

__host____device__cudaError_t cudaDeviceGetLimit
(size_t *pValue, cudaLimit limit)

其中cudaLimit是一个枚举类型：

enum cudaLimit {
    /* other fields not shown */
    cudaLimitPersistingL2CacheSize
};

【CUDA编程】【6】【3.Programming Interface】【3.2.CUDA Runtime】【3.2.3.Device Memory L2 Access Management】

L2 cache Set-Aside for Persisting Accesses，L2 Policy for Persisting Accesses，L2 Access Properties，L2 Persistence Example，Reset L2 Access to Normal，Manage Utilization of L2 set-aside cache，Query L2 cache Properties，Control L2 Cache Set-Aside Size for Persisting Memory Access