2014-01-15 49 views
8

我想模拟CPU上CUDA双线性插值的行为,但我发现tex2D的返回值似乎不适合bilinear formulaC/C++和CUDA中的双线性插值

我想,从float9 -bit固定点格式与8位小数值铸造插值系数会导致不同的值。

据转换fomula [2, line 106],所述转换的结果将是相同的作为输入float当coeffient是1/2^n,与n=0,1,..., 8,但我仍然(不总是)接收怪异值。

下面我报告一个怪异值的例子。在这种情况下,奇怪的值总是发生在id = 2*n+1,有谁能告诉我为什么?

Src的阵列:

Src[0][0] = 38; 
Src[1][0] = 39; 
Src[0][1] = 118; 
Src[1][1] = 13; 

纹理定义:

static texture<float4, 2, cudaReadModeElementType> texElnt; 
texElnt.addressMode[0] = cudaAddressModeClamp; 
texElnt.addressMode[1] = cudaAddressModeClamp; 
texElnt.filterMode = cudaFilterModeLinear; 
texElnt.normalized = false; 

核函数:

static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) { 
    const int gx = blockIdx.x*blockDim.x + threadIdx.x; 
    const int gy = blockIdx.y*blockDim.y + threadIdx.y; 
    const int gw = gridDim.x * blockDim.x; 
    const int gid = gy*gw + gx; 
    if (gx >= w || gy >= h) { 
     return; 
    } 

    float2 pnt; 
    pnt.x = (gx)*(stride)/*1/32*/; 
    pnt.y = 0.0625f/*1/16*/; 

    float4 result = tex2D(texElnt, pnt.x + 0.5, pnt.y + 0.5f); 
    pdata[gid*3 + 0] = pnt.x; 
    pdata[gid*3 + 1] = pnt.y; 
    pdata[gid*3 + 2] = result.x; 

} 

CUDA的双线性结果

id pnt.x pnt.y tex2D 
0 0.00000 0.0625 43.0000000 
1 0.03125 0.0625 42.6171875 
2 0.06250 0.0625 42.6484375 
3 0.09375 0.0625 42.2656250 
4 0.12500 0.0625 42.2968750 
5 0.15625 0.0625 41.9140625 
6 0.18750 0.0625 41.9453125 
7 0.21875 0.0625 41.5625000 
8 0.25000 0.0625 41.5937500 
9 0.28125 0.0625 41.2109375 
0 0.31250 0.0625 41.2421875 
10 0.34375 0.0625 40.8593750 
11 0.37500 0.0625 40.8906250 
12 0.40625 0.0625 40.5078125 
13 0.43750 0.0625 40.5390625 
14 0.46875 0.0625 40.1562500 
15 0.50000 0.0625 40.1875000 
16 0.53125 0.0625 39.8046875 
17 0.56250 0.0625 39.8359375 
18 0.59375 0.0625 39.4531250 
19 0.62500 0.0625 39.4843750 
20 0.65625 0.0625 39.1015625 
21 0.68750 0.0625 39.1328125 
22 0.71875 0.0625 38.7500000 
23 0.75000 0.0625 38.7812500 
24 0.78125 0.0625 38.3984375 
25 0.81250 0.0625 38.4296875 
26 0.84375 0.0625 38.0468750 
27 0.87500 0.0625 38.0781250 
28 0.90625 0.0625 37.6953125 
29 0.93750 0.0625 37.7265625 
30 0.96875 0.0625 37.3437500 
31 1.00000 0.0625 37.3750000 

CPU结果:

// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format 

id pnt.x pnt.y tex2D 
0 0.00000 0.0625 43.00000000 
1 0.03125 0.0625 43.23046875 
2 0.06250 0.0625 42.64843750 
3 0.09375 0.0625 42.87890625 
4 0.12500 0.0625 42.29687500 
5 0.15625 0.0625 42.52734375 
6 0.18750 0.0625 41.94531250 
7 0.21875 0.0625 42.17578125 
8 0.25000 0.0625 41.59375000 
9 0.28125 0.0625 41.82421875 
0 0.31250 0.0625 41.24218750 
10 0.34375 0.0625 41.47265625 
11 0.37500 0.0625 40.89062500 
12 0.40625 0.0625 41.12109375 
13 0.43750 0.0625 40.53906250 
14 0.46875 0.0625 40.76953125 
15 0.50000 0.0625 40.18750000 
16 0.53125 0.0625 40.41796875 
17 0.56250 0.0625 39.83593750 
18 0.59375 0.0625 40.06640625 
19 0.62500 0.0625 39.48437500 
20 0.65625 0.0625 39.71484375 
21 0.68750 0.0625 39.13281250 
22 0.71875 0.0625 39.36328125 
23 0.75000 0.0625 38.78125000 
24 0.78125 0.0625 39.01171875 
25 0.81250 0.0625 38.42968750 
26 0.84375 0.0625 38.66015625 
27 0.87500 0.0625 38.07812500 
28 0.90625 0.0625 38.30859375 
29 0.93750 0.0625 37.72656250 
30 0.96875 0.0625 37.95703125 
31 1.00000 0.0625 37.37500000 

我留下一个简单的代码上my github [3],运行你会在D:\有两个文件程序后。

编辑2014年1月20日

我运行不同的增量项目,发现tex2D规范“时alpha乘以beta小于0.00390625tex2D回报不匹配,双线性插值公式“

+1

您可以添加其他人可以编译和运行的最短完整示例吗? – talonmies

+0

感谢您的建议@talonmies,我提供了示例代码的链接。 – user1995868

回答

8

已经令人满意的答案已经提供了这个问题,所以现在我只是想给的双线性插值希望有用的信息汇编,怎么能在实现C++以及在CUDA中可以完成的不同方式。背后双线性插值

数学假设原有功能T(x, y)在点的笛卡尔规则网格(i, j)0 <= i < M10 <= j < M2ij整数采样。对于y的每个值,可以首先使用0 <= a < 1来表示包含在ii + 1之间的任意点i + a。然后,可以执行在该点沿y = j轴的线性内插(其平行于轴线x)获得

enter image description here

其中r(x,y)是内插的T(x,y)样本的功能。同样可以为线y = j + 1进行,获得

enter image description here

现在,对于每个i + a,沿轴线y的内插可在样品上r(i+a,j)r(i+a,j+1)来执行。因此,如果人们使用0 <= b < 1来表示位于jj + 1之间的任意点j + b,然后沿x = i + a轴线(其平行于y轴)的线性内插可以算出,所以得到的最终结果

enter image description here

注意i之间的关系,jabxy有以下几种

enter image description here

C/C++实现

我要强调,这个实现,以及下面的CUDA的,假设,因为在一开始做,那的T样品位于笛卡尔规则网格的点数(i, j)0 <= i < M1,0 <= j < M2ij整数(单位间距)。此外,该例程以单精度,复数(float2)算术提供,但它可以轻松地转换为其他感兴趣的算法。

void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data, 
             float * __restrict__ h_xout, float * __restrict__ h_yout, 
             const int M1, const int M2, const int N1, const int N2){ 

    float2 result_temp1, result_temp2; 
    for(int k=0; k<N2; k++){ 
     for(int l=0; l<N1; l++){ 

      const int ind_x = floor(h_xout[k*N1+l]); 
      const float a  = h_xout[k*N1+l]-ind_x; 

      const int ind_y = floor(h_yout[k*N1+l]); 
      const float b  = h_yout[k*N1+l]-ind_y; 

      float2 h00, h01, h10, h11; 
      if (((ind_x) < M1)&&((ind_y) < M2)) h00 = h_data[ind_y*M1+ind_x];  else h00 = make_float2(0.f, 0.f); 
      if (((ind_x+1) < M1)&&((ind_y) < M2)) h10 = h_data[ind_y*M1+ind_x+1];  else h10 = make_float2(0.f, 0.f); 
      if (((ind_x) < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x]; else h01 = make_float2(0.f, 0.f); 
      if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else h11 = make_float2(0.f, 0.f); 

      result_temp1.x = a * h10.x + (-h00.x * a + h00.x); 
      result_temp1.y = a * h10.y + (-h00.y * a + h00.y); 

      result_temp2.x = a * h11.x + (-h01.x * a + h01.x); 
      result_temp2.y = a * h11.y + (-h01.y * a + h01.y); 

      h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x); 
      h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y); 

     } 
    } 
} 

上述代码中的if/else声明仅仅是边界检查。如果样品不在[0, M1-1] x [0, M2-1]之外,则设置为0

标准CUDA实现

这是一个 “标准” 的CUDA实现跟踪上述CPU之一。没有使用纹理内存。

__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data, 
                const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                const int M1, const int M2, const int N1, const int N2) 
{ 
    const int l = threadIdx.x + blockDim.x * blockIdx.x; 
    const int k = threadIdx.y + blockDim.y * blockIdx.y; 

    if ((l<N1)&&(k<N2)) { 

     float2 result_temp1, result_temp2; 

     const int ind_x = floor(d_xout[k*N1+l]); 
     const float a  = d_xout[k*N1+l]-ind_x; 

     const int ind_y = floor(d_yout[k*N1+l]); 
     const float b  = d_yout[k*N1+l]-ind_y; 

     float2 d00, d01, d10, d11; 
     if (((ind_x) < M1)&&((ind_y) < M2)) d00 = d_data[ind_y*M1+ind_x];  else d00 = make_float2(0.f, 0.f); 
     if (((ind_x+1) < M1)&&((ind_y) < M2)) d10 = d_data[ind_y*M1+ind_x+1];  else d10 = make_float2(0.f, 0.f); 
     if (((ind_x) < M1)&&((ind_y+1) < M2)) d01 = d_data[(ind_y+1)*M1+ind_x]; else d01 = make_float2(0.f, 0.f); 
     if (((ind_x+1) < M1)&&((ind_y+1) < M2)) d11 = d_data[(ind_y+1)*M1+ind_x+1]; else d11 = make_float2(0.f, 0.f); 

     result_temp1.x = a * d10.x + (-d00.x * a + d00.x); 
     result_temp1.y = a * d10.y + (-d00.y * a + d00.y); 

     result_temp2.x = a * d11.x + (-d01.x * a + d01.x); 
     result_temp2.y = a * d11.y + (-d01.y * a + d01.y); 

     d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x); 
     d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y); 

    } 
} 

纹理CUDA实现获取

这是相同的实现如上,但全球内存现在由纹理缓存访问。例如,T[i,j]作为

tex2D(d_texture_fetch_float,ind_x,ind_y); 

(其中,当然ind_x = iind_y = j,和d_texture_fetch_float被假定为一个全局范围变量)进行访问,而不是

d_data[ind_y*M1+ind_x]; 

注意,硬接线纹理过滤功能在这里没有被利用。下面的例程与上面的例程具有相同的精度,并且可能比旧的CUDA架构的速度稍快。

__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result, 
                   const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                   const int M1, const int M2, const int N1, const int N2) 
{ 
    const int l = threadIdx.x + blockDim.x * blockIdx.x; 
    const int k = threadIdx.y + blockDim.y * blockIdx.y; 

    if ((l<N1)&&(k<N2)) { 

     float2 result_temp1, result_temp2; 

     const int ind_x = floor(d_xout[k*N1+l]); 
     const float a  = d_xout[k*N1+l]-ind_x; 

     const int ind_y = floor(d_yout[k*N1+l]); 
     const float b  = d_yout[k*N1+l]-ind_y; 

     const float2 d00 = tex2D(d_texture_fetch_float,ind_x,ind_y);  
     const float2 d10 = tex2D(d_texture_fetch_float,ind_x+1,ind_y); 
     const float2 d11 = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1); 
     const float2 d01 = tex2D(d_texture_fetch_float,ind_x,ind_y+1); 

     result_temp1.x = a * d10.x + (-d00.x * a + d00.x); 
     result_temp1.y = a * d10.y + (-d00.y * a + d00.y); 

     result_temp2.x = a * d11.x + (-d01.x * a + d01.x); 
     result_temp2.y = a * d11.y + (-d01.y * a + d01.y); 

     d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x); 
     d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y); 

    } 
} 

纹理绑定可根据

void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2) 
{ 
    size_t pitch; 
    float* data_d; 
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2)); 
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>(); 
    gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch)); 
    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp; 
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp; 
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice)); 

} 

注意,现在我们不需要if/else边界检查来完成,因为质地会自动夹紧零落在[0, M1-1] x [0, M2-1]采样区域之外的样本,这要归功于说明书

d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp; 
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp; 

带纹理插补的CUDA实现

这是最后一个实现,并使用硬连线功能进行纹理过滤。

__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result, 
                   const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                   const int M1, const int M2, const int N1, const int N2) 
{ 
    const int l = threadIdx.x + blockDim.x * blockIdx.x; 
    const int k = threadIdx.y + blockDim.y * blockIdx.y; 

    if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); } 
} 

。注意,通过这个功能实现的内插公式是与上述相同的衍生,但是现在

enter image description here

其中x_B = x - 0.5y_B = y - 0.5。这解释了0.5在指令

tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f) 

在这种情况下抵消,纹理绑定应做如下

void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2) 
{ 
    size_t pitch; 
    float* data_d; 
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2)); 
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>(); 
    gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch)); 
    d_texture_interp_float.addressMode[0] = cudaAddressModeClamp; 
    d_texture_interp_float.addressMode[1] = cudaAddressModeClamp; 
    d_texture_interp_float.filterMode = cudaFilterModeLinear; // --- Enable linear filtering 
    d_texture_interp_float.normalized = false;     // --- Texture coordinates will NOT be normalized 
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice)); 

} 

需要注意的是,在其他的答案已经提到的,ab存储在9-bit定点格式,具有8比特的小数值,所以这种方法将非常快速,但不如上述准确。

3

UV内插值被截断为9位,而不是参与的texel值。在CUDA手册的第10章(纹理化)中,详细描述了1D情况(包括CPU仿真代码)。代码是开源的,可以在https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/tex1d_9bit.cu

+0

感谢您的回复,我已阅读过代码(与本文链接2相同)。根据链接,如果x = 1/2^n(n = 1,2,... 8),那么压缩应该等于x。我用tex2D得到了不匹配的结果,所以我在这里发布我的问题。 – user1995868

1

双线性插值的错误公式使得纹理拾取的结果很奇怪。

公式 - 1:你可以找到它CUDA阑尾或Wiki中容易

tex(x,y)=(1−α)(1−β)T[i,j] + α(1−β)T[i+1,j] + (1−α)βT[i,j+1] + αβT[i+1,j+1] 

公式 - 2:如果您使用的9位减少的乘法

tex(x,y)=T[i,j] + α(T[i+1,j]-T[i,j]) + β(T[i,j+1]-T[i,j]) + αβ(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1]) 

倍定点格式到公式1,你会得到不匹配的纹理拾取结果,但公式2工作正常。

结论:
如果你想效仿CUDA纹理实现双线性插值,你应该使用公式3.试试吧!

公式 - 3:

tex(x,y)=T[i,j] + frac(α)(T[i+1,j]-T[i,j]) + frac(β)(T[i,j+1]-T[i,j]) + frac(αβ)(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1]) 

// frac(x) turns float to 9-bit fixed point format with 8 bits of fraction values.  
float frac(float x) { 
    float frac, tmp = x - (float)(int)(x); 
    float frac256 = (float)(int)(tmp*256.0f + 0.5f); 
    frac = frac256/256.0f; 
    return frac; 
}