【CUDA编程】系列博客参考NVIDIA官方文档“CUDA C++ Programming Guide（v12.6）”。
本文为原创文章，未经本人允许，禁止转载。转载请注明出处。

1.Synchronization Functions

void __syncthreads();

__syncthreads()会等待线程块中的所有线程到达这个同步点。__syncthreads()可以在条件语句中使用，但只有当条件在整个线程块中计算结果相同时才可以。否则，代码执行可能会挂起或产生意外的副作用。

计算能力2.x及以上的device支持__syncthreads()的三种变体，如下所示。

int __syncthreads_count(int predicate);

predicate是一个布尔表达式或逻辑判断，用于检查某个线程是否满足特定的条件。__syncthreads_count()返回满足条件的线程数量。比如int count = __syncthreads_count(threadIdx.x % 2 == 0);返回线程块中线程索引为偶数的线程数量。

int __syncthreads_and(int predicate);

当所有线程都满足条件时（即所有线程的predicate都为非零值时），__syncthreads_and()返回非零值。

int __syncthreads_or(int predicate);

当至少有一个线程满足条件时，__syncthreads_or()就返回非零值。

void __syncwarp(unsigned mask=0xffffffff);

__syncwarp()是一种轻量级的同步机制，专门用于同步warp内的线程。通过指定mask，可以选择性的同步warp内的某些线程，而非同步整个warp。其中，mask是一个32位的整数，每一位对应warp中的一个线程。若mask的某一位为1，则表示对应线程需要参与同步。默认值0xffffffff表示warp中所有线程都需要同步。所有参与同步的线程必须调用相同的__syncwarp()，并使用相同的mask，否则结果是未定义的。

执行__syncwarp()能够保证参与屏障的线程之间的内存顺序。因此，warp中的线程若需要通过内存进行通信，可以先写入内存，然后执行__syncwarp()，接着安全地读取其他线程存储的值。

2.Texture Functions

2.1.Texture Object API

参考：Texture Object API。

2.1.1.tex1Dfetch()

template<class T>
T tex1Dfetch(cudaTextureObject_t texObj, int x);

tex1Dfetch()从一维纹理对象texObj指定的线性内存区域中，根据整数纹理坐标x获取数据。tex1Dfetch()仅支持非归一化坐标，因此仅支持clamp mode和border mode作为寻址模式。此函数不执行任何纹理过滤。对于整数类型的纹理数据，它可能会选择性地将整数提升为单精度浮点数。

2.1.2.tex1D()

template<class T>
T tex1D(cudaTextureObject_t texObj, float x);

tex1D()从一维纹理对象texObj指定的CUDA array中，根据纹理坐标x获取数据。与tex1Dfetch()不同，tex1D()使用的是归一化坐标。

2.1.3.tex1DLod()

template<class T>
T tex1DLod(cudaTextureObject_t texObj, float x, float level);

为了解释参数level，我们先来解释mipmap。mipmap（也称为MIP maps或pyramids）是计算机图形学中用于纹理采样和优化的一种技术。它通过存储纹理的多个分辨率版本，解决了纹理放缩时的性能和质量问题。比如原始纹理的分辨率为$1024 \times 1024$（第0层），通常来说，mipmap中每一层的分辨率是上一层分辨率的一半，那么第1层纹理的分辨率就是$512 \times 512$，第2层纹理的分辨率就是$256 \times 256$，直至分辨率降为$1 \times 1$。下面是一个mipmap如何存储的例子，左边的主图伴有一系列逐层缩小的备份小图。

而我们可以使用LOD（level-of-detail，即参数level）的值决定从哪个分辨率层级中采样。假设mipmap一共有$n$层，那么LOD的取值范围就是$[0,n]$，非整数的LOD值会进行插值，使用相邻两个层级的纹理数据混合得到采样值。

tex1DLod()从一维纹理对象texObj指定的CUDA array中，根据特定level层级中的纹理坐标x来获取数据。

2.1.4.tex1DGrad()

template<class T>
T tex1DGrad(cudaTextureObject_t texObj, float x, float dx, float dy);

与tex1DLod()不同之处在于，tex1DGrad()所用的LOD值是通过X方向梯度dx和Y方向梯度dy推导得到的。

2.1.5.tex2D()

template<class T>
T tex2D(cudaTextureObject_t texObj, float x, float y);

tex2D()从二维纹理对象texObj指定的CUDA array或线性内存区域中，根据二维纹理坐标(x,y)获取数据。

2.1.6.tex2D() for sparse CUDA arrays

                template<class T>
T tex2D(cudaTextureObject_t texObj, float x, float y, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，根据二维纹理坐标(x,y)获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.7.tex2Dgather()

template<class T>
T tex2Dgather(cudaTextureObject_t texObj,
              float x, float y, int comp = 0);

从二维纹理对象texObj指定的CUDA array中，根据二维纹理坐标(x,y)以及参数comp（详见：Texture Gather）获取数据。

2.1.8.tex2Dgather() for sparse CUDA arrays

                template<class T>
T tex2Dgather(cudaTextureObject_t texObj,
            float x, float y, bool* isResident, int comp = 0);

从二维纹理对象texObj指定的稀疏CUDA array中，根据二维纹理坐标(x,y)以及参数comp获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.9.tex2DGrad()

template<class T>
T tex2DGrad(cudaTextureObject_t texObj, float x, float y,
            float2 dx, float2 dy);

从二维纹理对象texObj指定的CUDA array中，根据二维纹理坐标(x,y)获取数据。LOD值通过梯度dx和dy计算得到。

2.1.10.tex2DGrad() for sparse CUDA arrays

                template<class T>
T tex2DGrad(cudaTextureObject_t texObj, float x, float y,
        float2 dx, float2 dy, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，根据二维纹理坐标(x,y)获取数据。LOD值通过梯度dx和dy计算得到。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.11.tex2DLod()

template<class T>
tex2DLod(cudaTextureObject_t texObj, float x, float y, float level);

从二维纹理对象texObj指定的CUDA array或线性内存区域中，在特定level层级（即LOD值）中，根据二维纹理坐标(x,y)获取数据。

2.1.12.tex2DLod() for sparse CUDA arrays

        template<class T>
tex2DLod(cudaTextureObject_t texObj, float x, float y, float level, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，在特定level层级（即LOD值）中，根据二维纹理坐标(x,y)获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.13.tex3D()

template<class T>
T tex3D(cudaTextureObject_t texObj, float x, float y, float z);

从三维纹理对象texObj指定的CUDA array中，根据三维纹理坐标(x,y,z)获取数据。

2.1.14.tex3D() for sparse CUDA arrays

                template<class T>
T tex3D(cudaTextureObject_t texObj, float x, float y, float z, bool* isResident);

从三维纹理对象texObj指定的稀疏CUDA array中，根据三维纹理坐标(x,y,z)获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.15.tex3DLod()

template<class T>
T tex3DLod(cudaTextureObject_t texObj, float x, float y, float z, float level);

从三维纹理对象texObj指定的CUDA array或线性内存区域中，在特定level层级（即LOD值）中，根据三维纹理坐标(x,y,z)获取数据。

2.1.16.tex3DLod() for sparse CUDA arrays

                template<class T>
T tex3DLod(cudaTextureObject_t texObj, float x, float y, float z, float level, bool* isResident);

从三维纹理对象texObj指定的稀疏CUDA array或线性内存区域中，在特定level层级（即LOD值）中，根据三维纹理坐标(x,y,z)获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.17.tex3DGrad()

template<class T>
T tex3DGrad(cudaTextureObject_t texObj, float x, float y, float z,
            float4 dx, float4 dy);

从三维纹理对象texObj指定的CUDA array中，在特定LOD值下，根据三维纹理坐标(x,y,z)获取数据。其中，LOD值通过X方向梯度dx和Y方向梯度dy计算得到。

2.1.18.tex3DGrad() for sparse CUDA arrays

                template<class T>
T tex3DGrad(cudaTextureObject_t texObj, float x, float y, float z,
        float4 dx, float4 dy, bool* isResident);

从三维纹理对象texObj指定的稀疏CUDA array中，在特定LOD值下，根据三维纹理坐标(x,y,z)获取数据。其中，LOD值通过X方向梯度dx和Y方向梯度dy计算得到。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.19.tex1DLayered()

template<class T>
T tex1DLayered(cudaTextureObject_t texObj, float x, int layer);

从一维纹理对象texObj指定的CUDA array中，根据一维纹理坐标x和层索引layer获取数据。

2.1.20.tex1DLayeredLod()

template<class T>
T tex1DLayeredLod(cudaTextureObject_t texObj, float x, int layer, float level);

从一维纹理对象texObj指定的CUDA array中，在特定level层级（即LOD值）中，根据一维纹理坐标x和层索引layer获取数据。

2.1.21.tex1DLayeredGrad()

template<class T>
T tex1DLayeredGrad(cudaTextureObject_t texObj, float x, int layer,
                   float dx, float dy);

从一维纹理对象texObj指定的CUDA array中，在特定LOD值下，根据一维纹理坐标x和层索引layer获取数据。其中，LOD值通过X方向梯度dx和Y方向梯度dy计算得到。

2.1.22.tex2DLayered()

template<class T>
T tex2DLayered(cudaTextureObject_t texObj,
               float x, float y, int layer);

从二维纹理对象texObj指定的CUDA array中，根据二维纹理坐标(x,y)和层索引layer获取数据。

2.1.23.tex2DLayered() for Sparse CUDA Arrays

                template<class T>
T tex2DLayered(cudaTextureObject_t texObj,
            float x, float y, int layer, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，根据二维纹理坐标(x,y)和层索引layer获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.24.tex2DLayeredLod()

template<class T>
T tex2DLayeredLod(cudaTextureObject_t texObj, float x, float y, int layer,
                  float level);

从二维纹理对象texObj指定的CUDA array中，在特定level层级（即LOD值）中，根据二维纹理坐标(x,y)和层索引layer获取数据。

2.1.25.tex2DLayeredLod() for sparse CUDA arrays

                template<class T>
T tex2DLayeredLod(cudaTextureObject_t texObj, float x, float y, int layer,
                float level, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，在特定level层级（即LOD值）中，根据二维纹理坐标(x,y)和层索引layer获取数据。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.26.tex2DLayeredGrad()

template<class T>
T tex2DLayeredGrad(cudaTextureObject_t texObj, float x, float y, int layer,
                   float2 dx, float2 dy);

从二维纹理对象texObj指定的CUDA array中，在特定LOD值下，根据二维纹理坐标(x,y)和层索引layer获取数据。其中，LOD值通过X方向梯度dx和Y方向梯度dy计算得到。

2.1.27.tex2DLayeredGrad() for sparse CUDA arrays

                template<class T>
T tex2DLayeredGrad(cudaTextureObject_t texObj, float x, float y, int layer,
                float2 dx, float2 dy, bool* isResident);

从二维纹理对象texObj指定的稀疏CUDA array中，在特定LOD值下，根据二维纹理坐标(x,y)和层索引layer获取数据。其中，LOD值通过X方向梯度dx和Y方向梯度dy计算得到。同时，通过指针isResident返回该texel是否驻留在显存中。如果texel不在显存中，则函数返回值为零。

2.1.28.texCubemap()

template<class T>
T texCubemap(cudaTextureObject_t texObj, float x, float y, float z);

从cubemap纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)获取数据。

2.1.29.texCubemapGrad()

template<class T>
T texCubemapGrad(cudaTextureObject_t texObj, float x, float, y, float z,
                float4 dx, float4 dy);

从cubemap纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)获取数据。梯度dx和dy用于确定LOD值。

2.1.30.texCubemapLod()

template<class T>
T texCubemapLod(cudaTextureObject_t texObj, float x, float, y, float z,
                float level);

从cubemap纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)获取数据。参数level用于指定LOD值。

2.1.31.texCubemapLayered()

template<class T>
T texCubemapLayered(cudaTextureObject_t texObj,
                    float x, float y, float z, int layer);

从cubemap分层纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)和层索引layer获取数据。

2.1.32.texCubemapLayeredGrad()

template<class T>
T texCubemapLayeredGrad(cudaTextureObject_t texObj, float x, float y, float z,
                       int layer, float4 dx, float4 dy);

从cubemap分层纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)和层索引layer获取数据。梯度dx和dy用于确定LOD值。

2.1.33.texCubemapLayeredLod()

template<class T>
T texCubemapLayeredLod(cudaTextureObject_t texObj, float x, float y, float z,
                       int layer, float level);

从cubemap分层纹理对象texObj指定的CUDA array中，根据纹理坐标(x,y,z)和层索引layer获取数据。参数level用于指定LOD值。

3.Surface Functions

参考：Surface Memory。

表面函数（surface functions）仅支持计算能力2.0及以上的device。

在第3.1部分中，boundaryMode指定了边界模式，即如何处理超出范围的表面坐标。当边界模式设置为cudaBoundaryModeClamp时，超出范围的坐标会被限制在有效范围内。当边界模式设置为cudaBoundaryModeZero时，超出范围的读操作返回零，写操作会被忽略。当边界模式设置为cudaBoundaryModeTrap时，超出范围的访问会导致kernel执行失败。

3.1.Surface Object API

3.1.1.surf1Dread()

template<class T>
T surf1Dread(cudaSurfaceObject_t surfObj, int x,
               boundaryMode = cudaBoundaryModeTrap);

从一维表面对象surfObj指定的CUDA array中，使用字节坐标x读取数据。

3.1.2.surf1Dwrite

template<class T>
void surf1Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x,
                  boundaryMode = cudaBoundaryModeTrap);

将值data写入到由一维表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x指定。

3.1.3.surf2Dread()

template<class T>
T surf2Dread(cudaSurfaceObject_t surfObj,
              int x, int y,
              boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf2Dread(T* data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y,
                 boundaryMode = cudaBoundaryModeTrap);

从二维表面对象surfObj指定的CUDA array中，使用字节坐标x和y读取数据。

3.1.4.surf2Dwrite()

template<class T>
void surf2Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x, int y,
                  boundaryMode = cudaBoundaryModeTrap);

将值data写入到由二维表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x和y指定。

3.1.5.surf3Dread()

template<class T>
T surf3Dread(cudaSurfaceObject_t surfObj,
              int x, int y, int z,
              boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf3Dread(T* data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int z,
                 boundaryMode = cudaBoundaryModeTrap);

从三维表面对象surfObj指定的CUDA array中，使用字节坐标x、y、z读取数据。

3.1.6.surf3Dwrite()

template<class T>
void surf3Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x, int y, int z,
                  boundaryMode = cudaBoundaryModeTrap);

将值data写入到由三维表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x、y、z指定。

3.1.7.surf1DLayeredread()

template<class T>
T surf1DLayeredread(
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf1DLayeredread(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);

从一维分层表面对象surfObj指定的CUDA array中，使用字节坐标x和层索引layer读取数据。

3.1.8.surf1DLayeredwrite()

template<class Type>
void surf1DLayeredwrite(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);

将值data写入到由一维分层表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x和层索引layer指定。

3.1.9.surf2DLayeredread()

template<class T>
T surf2DLayeredread(
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int layer,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf2DLayeredread(T data,
                         cudaSurfaceObject_t surfObj,
                         int x, int y, int layer,
                         boundaryMode = cudaBoundaryModeTrap);

从二维分层表面对象surfObj指定的CUDA array中，使用字节坐标x、y和层索引layer读取数据。

3.1.10.surf2DLayeredwrite()

template<class T>
void surf2DLayeredwrite(T data,
                          cudaSurfaceObject_t surfObj,
                          int x, int y, int layer,
                          boundaryMode = cudaBoundaryModeTrap);

将值data写入到由二维分层表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x、y和层索引layer指定。

3.1.11.surfCubemapread()

template<class T>
T surfCubemapread(
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surfCubemapread(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);

从cubemap表面对象surfObj指定的CUDA array中，使用字节坐标x、y和面索引face读取数据。

3.1.12.surfCubemapwrite()

template<class T>
void surfCubemapwrite(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);

将值data写入到由cubemap表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x、y和面索引face指定。

3.1.13.surfCubemapLayeredread()

template<class T>
T surfCubemapLayeredread(
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surfCubemapLayeredread(T data,
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);

从cubemap分层表面对象surfObj指定的CUDA array中，使用字节坐标x、y和索引layerFace读取数据。

3.1.14.surfCubemapLayeredwrite()

template<class T>
void surfCubemapLayeredwrite(T data,
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);

将值data写入到由cubemap分层表面对象surfObj指定的CUDA array中，写入的位置由字节坐标x、y和索引layerFace指定。

4.Read-Only Data Cache Load Function

只读数据缓存加载函数（read-only data cache load function）仅支持计算能力为5.0及以上的device。

T __ldg(const T* address);

__ldg()用于将全局内存中的数据放到只读数据缓存中，从而提高后续访问速度。适用于kernel中频繁读取、不会被修改的数据。

类型T可以是：

char、signed char、short、int、long、long long、unsigned char、unsigned short、unsigned int、unsigned long、unsigned long long。
char2、char4、short2、short4、int2、int4、longlong2、uchar2、uchar4、ushort2、ushort4、uint2、uint4、ulonglong2。
float、float2、float4、double、double2。
如果包含了cuda_fp16.h头文件，类型T还可以是__half或__half2。
如果包含了cuda_bf16.h头文件，类型T还可以是__nv_bfloat16或__nv_bfloat162。