SSE优化

我正在开展一个学校项目，我必须优化SSE中的部分代码，但现在我被困在一个部分几天了。SSE优化

我没有看到在这段代码（它是高斯模糊算法的一部分）中使用矢量SSE指令（内联汇编/ instric f）的任何智能方式。我会很高兴，如果有人能给我只是一个小提示

for (int x = x_start; x < x_end; ++x)  // vertical blur... 
    { 
     float sum = image[x + (y_start - radius - 1)*image_w]; 
     float dif = -sum; 

     for (int y = y_start - 2*radius - 1; y < y_end; ++y) 
     {             // inner vertical Radius loop   
      float p = (float)image[x + (y + radius)*image_w]; // next pixel 
      buffer[y + radius] = p;       // buffer pixel 
      sum += dif + fRadius*p; 
      dif += p;          // accumulate pixel blur 

      if (y >= y_start) 
      { 
       float s = 0, w = 0;       // border blur correction 
       sum -= buffer[y - radius - 1]*fRadius;  // addition for fraction blur 
       dif += buffer[y - radius] - 2*buffer[y]; // sum up differences: +1, -2, +1 

       // cut off accumulated blur area of pixel beyond the border 
       // assume: added pixel values beyond border = value at border 
       p = (float)(radius - y);     // top part to cut off 
       if (p > 0) 
       { 
        p = p*(p-1)/2 + fRadius*p; 
        s += buffer[0]*p; 
        w += p; 
       } 
       p = (float)(y + radius - image_h + 1);    // bottom part to cut off 
       if (p > 0) 
       { 
        p = p*(p-1)/2 + fRadius*p; 
        s += buffer[image_h - 1]*p; 
        w += p; 
       } 
       new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel 
      } 
      else if (y + radius >= y_start) 
      { 
       dif -= 2*buffer[y]; 
      } 
     } // for y 
    } // for x

来源

2013-12-18 Smarty77

你在学校学习SSE吗？这很酷。 – Simple

是啊:)，它是一个关于高级汇编程序的自愿主题，但截止日期正在逼近，并且在很长一段时间内仍然停留在此：/ – Smarty77

不幸的是我认为如果你想使用SSE，你将不得不完全重新实现这个。您应该预先计算系数的一维核，然后使用SSE在每个轴上执行卷积。 –

，你可以使用一个功能更是逻辑运算和口罩：

例如，而不是：

  // process only 1 
     if (p > 0) 
      p = p*(p-1)/2 + fRadius*p;

你可以写

  // processes 4 floats 
     const __m128 &mask = _mm_cmplt_ps(p,0); 
     const __m128 &notMask = _mm_cmplt_ps(0,p); 
     const __m128 &p_tmp = (p*(p-1)/2 + fRadius*p); 
     p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask

另外我可以推荐你使用一个特殊的库，它会重载指令。例如：http://code.compeng.uni-frankfurt.de/projects/vc
dif变量使得依赖内部循环的迭代。你应该尝试并行化外部循环。但是如果没有指令，那么重载代码将变得难以管理。
也考虑重新考虑整个算法。目前的人看起来并不平行。可能你可以忽略精确度，或者增加标量时间？

来源

2013-12-18 19:08:15 klm123

我希望我明白你正确的队友，我已经尝试过，它的工作原理，但问题是在像总和+ = dif + fRadius * p; ，我需要使用前一个周期的差异，当我试图一次计算4个周期时，我无法获得 – Smarty77

@ user2174310，我明白了。你应该尝试并行化外部循环。但是如果超出了指令的覆盖范围，那么代码将变得难以管理。 – klm123

4.7x（约）更快，外环并行化，谢谢建议家伙 – Smarty77

回答

相关问题