2010-07-09 40 views
7

通过SSE指令执行复数乘法和除法是否有利? 我知道使用SSE时,加法和减法效果会更好。有人能告诉我如何使用SSE执行复杂的乘法以获得更好的性能吗?复杂Mul和Div使用Sse指令

回答

9

那么复数乘法的定义是:

((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i 

所以在复数的两种成分是

((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i 

所以假设你使用8辆彩车代表定义了4个复数如下:

c1a, c1b, c2a, c2b 
c3a, c3b, c4a, c4b 

而你想同时做(c1 * c3)和(c2 * c4)你的SSE代码看上去像下面这样的“东西”:

(注意我在windows下使用MSVC,但原理相同)。

__declspec(align(16)) float c1c2[]  = { 1.0f, 2.0f, 3.0f, 4.0f }; 
__declspec(align(16)) float c3c4[]   = { 4.0f, 3.0f, 2.0f, 1.0f }; 
__declspec(align(16)) float mulfactors[] = { -1.0f, 1.0f, -1.0f, 1.0f }; 
__declspec(align(16)) float res[]   = { 0.0f, 0.0f, 0.0f, 0.0f }; 

__asm 
{ 
    movaps xmm0, xmmword ptr [c1c2]   // Load c1 and c2 into xmm0. 
    movaps xmm1, xmmword ptr [c3c4]   // Load c3 and c4 into xmm1. 
    movaps xmm4, xmmword ptr [mulfactors] // load multiplication factors into xmm4 

    movaps xmm2, xmm1      
    movaps xmm3, xmm0      
    shufps xmm2, xmm1, 0xA0     // Change order to c3a c3a c4a c4a and store in xmm2 
    shufps xmm1, xmm1, 0xF5     // Change order to c3b c3b c4b c4b and store in xmm1 
    shufps xmm3, xmm0, 0xB1     // change order to c1b c1a c2b c2a abd store in xmm3 

    mulps xmm0, xmm2       
    mulps xmm3, xmm1      
    mulps xmm3, xmm4      // Flip the signs of the 'a's so the add works correctly. 

    addps xmm0, xmm3      // Add together 

    movaps xmmword ptr [res], xmm0   // Store back out 
}; 

float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]); 
float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]); 

float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]); 
float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]); 

if (res1a != res[0] || 
    res1b != res[1] || 
    res2a != res[2] || 
    res2b != res[3]) 
{ 
    _exit(1); 
} 

我上面所做的是我简化了一些数学。假设以下几点:

c1a c1b c2a c2b 
c3a c3b c4a c4b 

通过重新排列我结束了以下载体

0 => c1a c1b c2a c2b 
1 => c3b c3b c4b c4b 
2 => c3a c3a c4a c4a 
3 => c1b c1a c2b c2a 

我然后乘以0和2合力得到:

0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a 

接着我乘3和1个一起得到:

3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b 

最后,我翻了几个浮体的迹象,3

3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b 

这样我就可以把它们相加,并得到

(c1a * c3a) - (c1b * c3b), (c1b * c3a) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b) 

这就是我们是:)

+0

参见HTTP://软件.intel.com/file/1000,它似乎有更快的算法。 – MSalters 2010-07-09 12:04:14

+1

对我来说是类似的设置,但是他们需要SSE3 ...在我这个时代,99%的时间都可以,我承认。 – Goz 2010-07-09 13:00:59

+0

这增加了子指令看起来很方便。唉,我通常不以上述SSE2为目标兼容性:( – Goz 2010-07-09 13:02:50

8

只为后完整性,可下载的英特尔®64和IA-32体系结构优化参考手册here包含用于复数乘法(例6-9)和复数除法(例6-10)的程序集。

下面是例如乘法的代码:

// Multiplication of (ak + i bk) * (ck + i dk) 
// a + i b can be stored as a data structure 
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0 
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0 
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0 
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0 
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0 
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0 
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0 

的组件直接映射到gccs X86 intrinsics(只是谓词每个指令与__builtin_ia32_)。

4

intel优化参考中的算法不能正确处理输入中的溢出和NaN

单个NaN中的数字的实部或虚部将错误地传播到其他部分。

由于几个无限运算(例如无限* 0)在NaN处结束,溢出可能导致NaN出现在您的其他行为良好的数据中。

如果溢出NaN s为罕见的,一个简单的方法来避免这种情况是在结果只检查NaN并与编译器重新计算它符合IEEE执行:

float complex a[2], b[2]; 
__m128 res = simd_fast_multiply(a, b); 

/* store unconditionally, can be executed in parallel with the check 
* making it almost free if there is no NaN in data */ 
_mm_store_ps(dest, res); 

/* check for NaN */ 
__m128 n = _mm_cmpneq_ps(res, res); 
int have_nan = _mm_movemask_ps(n); 
if (have_nan != 0) { 
    /* do it again unvectorized */ 
    dest[0] = a[0] * b[0]; 
    dest[1] = a[1] * b[1]; 
}