gcc内嵌程序集中的PC相对跳转

我有一个asm循环，保证不会超过128次迭代，我希望通过PC相对跳转来展开。这个想法是以相反的顺序展开每个迭代，然后跳转到需要的循环中。代码看起来像这样：gcc内嵌程序集中的PC相对跳转

#define __mul(i) \ 
    "movq -"#i"(%3,%5,8),%%rax;" \ 
    "mulq "#i"(%4,%6,8);" \ 
    "addq %%rax,%0;" \ 
    "adcq %%rdx,%1;" \ 
    "adcq $0,%2;" 

asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out 
    __mul(127) 
    __mul(126) 
    __mul(125) 
    ... 
    __mul(1) 
    __mul(0) 
    : "+r"(lo),"+r"(hi),"+r"(overflow) 
    : "r"(a.data),"r"(b.data),"r"(i-k),"r"(k) 
    : "%rax","%rdx");

是这样的可能与gcc内联汇编？

来源

2011-02-05 Chris

在gcc内联汇编中，可以使用标签并让汇编器为您挑选跳转目标。类似（做作的例子）：

int max(int a, int b) 
{ 
    int result; 
    __asm__ __volatile__(
     "movl %1, %0\n" 
     "cmpl %2, %0\n" 
     "jeq a_is_larger\n" 
     "movl %2, %0\n" 
     "a_is_larger:\n" : "=r"(result), "r"(a), "r"(b)); 
    return (result); 
}

这是一回事。你可以做的另一件事情是避免乘法，就是让汇编程序为你的块对齐，比如说，以32字节的倍数（我认为指令序列不适合16字节），比如：

#define mul(i)      \ 
    ".align 32\n"     \ 
    ".Lmul" #i ":\n"    \ 
    "movq -" #i "(%3,%5,8),%%rax\n"\ 
    "mulq " #i "(%4,%6,8)\n"  \ 
    "addq %%rax,%0\n"    \ 
    "adcq %%rdx,%1\n"    \ 
    "adcq $0,%2\n"

这将简单地填充指令流nop。如果哟做选择不对齐这些块，你仍然可以在你的主要表现，使用生成的本地标签，弄清楚组装块的大小：

#ifdef UNALIGNED 
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n" 
#else 
__asm__ ("shlq $5, %[label]\n" 
#endif 
    "leaq .Lmulblkstart, %[dummy]\n"  /* this is PC-relative in 64bit */ 
    "jmp (%[dummy], %[label])\n" 
    ".align 32\n" 
    ".Lmulblkstart:\n" 
    __mul(127) 
    ... 
    __mul(0) 
    : ... [dummy]"=r"(dummy) : [label]"r"((128-count)))

而对于情况count是编译期时间常数，你甚至可以这样做：

__asm__("jmp .Lmul" #count "\n" ...);

稍微注意一下就完了：

对齐块是一个好主意，如果自动生成_mul()东西可以创造出不同长度的序列。对于使用的常量0..127，情况并非如此，因为它们都适合一个字节，但是如果您将它们放大一点，它将转换为16位或32位值，并且指令块会一起增长。通过填充指令流，可跳跃技术仍然可以使用。

来源

2011-02-18 16:34:36

对不起，我无法在ATT语法中提供答案，我希望您可以轻松地执行翻译。

如果您在RCX计数，你可以有一个标签刚过__mul（0），那么你可以这样做：

; rcx must be in [0..128] range. 
    imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU) 
    lea rcx, [rcx + the_label] ; There is no memory read here 
    jmp rcx

希望这有助于。编辑：昨天我犯了一个错误。我已经假定在[rcx + the_label]中引用标签被解析为[rcx + rip + disp]，但它并不是因为没有这种寻址模式（只存在[rip + disp32]）

此代码应该工作，另外它会留下RCX不变，并会破坏RAX和RDX代替（但你的代码似乎先写他们之前没有读取它们）：

; rcx must be in [0..128] range. 
    imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU) 
    lea rax, [the_label] ; PC-relative addressing (There is no memory read here) 
    add rax, rdx 
    jmp rax

来源

2011-02-05 23:34:12 LocoDelAssembly

这不是一个直接的答案，但你考虑使用变体 Duff's Device而不是串联组件？这将采用switch语句的形式：

switch(iterations) { 
    case 128: /* code for i=128 here */ 
    case 127: /* code for i=127 here */ 
    case 126: /* code for i=126 here */ 
    /* ... */ 
    case 1: /* code for i=1 here*/ 
    break; 
    default: die("too many cases"); 
}

来源

2011-02-06 01:36:02 nelhage

我现在使用的是Duff设备的变体，但是我发布了这个，因为我想切换到只有asm的方式 – Chris 2011-02-07 19:59:35

gcc内嵌程序集中的PC相对跳转

回答

相关问题