在软件中实现SSE 4.2的CRC32C

所以我有一个包含CRC32C校验和的设计，以确保数据没有被损坏。我决定使用CRC32C，因为如果运行该软件的计算机支持SSE 4.2，我可以同时拥有软件版本和硬件加速版本。我正在参考英特尔开发人员手册（第2A卷），它似乎提供了crc32指令后面的算法。不过，我运气不错。英特尔的开发者指南说以下内容：在软件中实现SSE 4.2的CRC32C

BIT_REFLECT32: DEST[31-0] = SRC[0-31] 
MOD2: Remainder from Polynomial division modulus 2 

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0]) 
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0]) 
TEMP3[63-0] <- TEMP1[31-0] << 32 
TEMP4[63-0] <- TEMP2[31-0] << 32 
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0] 
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41 
DEST[31-0] <- BIT_REFLECT(TEMP6[31-0])

现在，据我所知道的，我所做的一切，以确保其正常启动TEMP6行了，但我想我可以是误会多项式除法，或实施它不正确。如果我的理解是正确的，1/1 mod 2 = 1,0/1 mod 2 = 0和两个除零都是未定义的。

我不明白的是如何与64位和33位操作数的二进制分割工作。如果SRC是0x00000000，并且DEST是0xFFFFFFFF,TEMP5[63-32]将被全部置位，而TEMP5[31-0]将全部为未置位。

如果我是从TEMP5使用比特作为分子，将有30个部分由零作为多项式11EDC6F41只有33位长（因此将其转换为一个64位无符号整数离开顶部30比特未设置），因此分母未被设置为30位。

但是，如果我要使用多项式作为分子，则TEMP5的底部32位是未设置的，导致在那里除以零，并且结果的前30位将为零，因为前30位的分子为零，如0/1 mod 2 = 0。

我误解了这是如何工作的？只是失去了一些东西？或者英特尔在其文档中遗漏了一些关键步骤？

我之所以选择英特尔开发人员指南，是因为他们使用的是33位多项式，而且我想让输出完全相同，而当我使用32位多项式时， bit polynomial 1EDC6F41（如下所示）。

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000; 

for (n = 0; n < 256; n++) { 
    sres = n; 
    for (k = 0; k < 8; k++) 
     sres = (sres & 1) == 1 ? poly^(sres >> 1) : (sres >> 1); 
    crcTable[n] = sres; 
} 
sres = 0xFFFFFFFF; 

for (n = 0; n < 4; n++) { 
    sres = crcTable[(sres^data) & 0xFF]^(sres >> 8); 
}

以上代码生成4138093821作为输出，并且crc32操作码使用输入0x00000000产生2346497208。

对不起，如果这是写得很糟糕或难以理解的地方，它对我来说已经很晚了。

来源

2013-07-15 LMS

对于那些使用Delphi，我已经[写了一些开源代码]（http://blog.synopse.info/post/2014/05/25/New-crc32c%28%29-function-using-optimized如果SSE 4.2不可用，则使用新的'crc32'硬件指令（如果可用）以及快速的x86 asm或纯pascal代码（使用预先计算的表），使用-asm-and-SSE-4.2-instruction）。 Naive滚动版本以330 MB/s运行，优化展开的x86 asm以1.7 GB/s的速度运行，SSE 4.2硬件在Win32和Win64平台上以3.7 GB/s的速度提供惊人的速度。 –

这里是CRC-32C的软件和硬件版本。软件版本经过优化，一次处理八个字节。由于该指令的吞吐量为一个周期，但延迟时间为三个周期，所以硬件版本经过优化，可以在单个内核上并行有效地运行三条crc32q指令。

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction 
* Copyright (C) 2013 Mark Adler 
* Version 1.1 1 Aug 2013 Mark Adler 
*/ 

/* 
    This software is provided 'as-is', without any express or implied 
    warranty. In no event will the author be held liable for any damages 
    arising from the use of this software. 

    Permission is granted to anyone to use this software for any purpose, 
    including commercial applications, and to alter it and redistribute it 
    freely, subject to the following restrictions: 

    1. The origin of this software must not be misrepresented; you must not 
    claim that you wrote the original software. If you use this software 
    in a product, an acknowledgment in the product documentation would be 
    appreciated but is not required. 
    2. Altered source versions must be plainly marked as such, and must not be 
    misrepresented as being the original software. 
    3. This notice may not be removed or altered from any source distribution. 

    Mark Adler 
    [email protected] 
*/ 

/* Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a 
    CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software 
    version is provided as a fall-back, as well as for speed comparisons. */ 

/* Version history: 
    1.0 10 Feb 2013 First version 
    1.1 1 Aug 2013 Correct comments on why three crc instructions in parallel 
*/ 

#include <stdio.h> 
#include <stdlib.h> 
#include <stdint.h> 
#include <unistd.h> 
#include <pthread.h> 

/* CRC-32C (iSCSI) polynomial in reversed bit order. */ 
#define POLY 0x82f63b78 

/* Table for a quadword-at-a-time software crc. */ 
static pthread_once_t crc32c_once_sw = PTHREAD_ONCE_INIT; 
static uint32_t crc32c_table[8][256]; 

/* Construct table for software CRC-32C calculation. */ 
static void crc32c_init_sw(void) 
{ 
    uint32_t n, crc, k; 

    for (n = 0; n < 256; n++) { 
     crc = n; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc32c_table[0][n] = crc; 
    } 
    for (n = 0; n < 256; n++) { 
     crc = crc32c_table[0][n]; 
     for (k = 1; k < 8; k++) { 
      crc = crc32c_table[0][crc & 0xff]^(crc >> 8); 
      crc32c_table[k][n] = crc; 
     } 
    } 
} 

/* Table-driven software version as a fall-back. This is about 15 times slower 
    than using the hardware instructions. This assumes little-endian integers, 
    as is the case on Intel processors that the assembler code here is for. */ 
static uint32_t crc32c_sw(uint32_t crci, const void *buf, size_t len) 
{ 
    const unsigned char *next = buf; 
    uint64_t crc; 

    pthread_once(&crc32c_once_sw, crc32c_init_sw); 
    crc = crci^0xffffffff; 
    while (len && ((uintptr_t)next & 7) != 0) { 
     crc = crc32c_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    while (len >= 8) { 
     crc ^= *(uint64_t *)next; 
     crc = crc32c_table[7][crc & 0xff]^
       crc32c_table[6][(crc >> 8) & 0xff]^
       crc32c_table[5][(crc >> 16) & 0xff]^
       crc32c_table[4][(crc >> 24) & 0xff]^
       crc32c_table[3][(crc >> 32) & 0xff]^
       crc32c_table[2][(crc >> 40) & 0xff]^
       crc32c_table[1][(crc >> 48) & 0xff]^
       crc32c_table[0][crc >> 56]; 
     next += 8; 
     len -= 8; 
    } 
    while (len) { 
     crc = crc32c_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    return (uint32_t)crc^0xffffffff; 
} 

/* Multiply a matrix times a vector over the Galois field of two elements, 
    GF(2). Each element is a bit in an unsigned integer. mat must have at 
    least as many entries as the power of two for most significant one bit in 
    vec. */ 
static inline uint32_t gf2_matrix_times(uint32_t *mat, uint32_t vec) 
{ 
    uint32_t sum; 

    sum = 0; 
    while (vec) { 
     if (vec & 1) 
      sum ^= *mat; 
     vec >>= 1; 
     mat++; 
    } 
    return sum; 
} 

/* Multiply a matrix by itself over GF(2). Both mat and square must have 32 
    rows. */ 
static inline void gf2_matrix_square(uint32_t *square, uint32_t *mat) 
{ 
    int n; 

    for (n = 0; n < 32; n++) 
     square[n] = gf2_matrix_times(mat, mat[n]); 
} 

/* Construct an operator to apply len zeros to a crc. len must be a power of 
    two. If len is not a power of two, then the result is the same as for the 
    largest power of two less than len. The result for len == 0 is the same as 
    for len == 1. A version of this routine could be easily written for any 
    len, but that is not needed for this application. */ 
static void crc32c_zeros_op(uint32_t *even, size_t len) 
{ 
    int n; 
    uint32_t row; 
    uint32_t odd[32];  /* odd-power-of-two zeros operator */ 

    /* put operator for one zero bit in odd */ 
    odd[0] = POLY;    /* CRC-32C polynomial */ 
    row = 1; 
    for (n = 1; n < 32; n++) { 
     odd[n] = row; 
     row <<= 1; 
    } 

    /* put operator for two zero bits in even */ 
    gf2_matrix_square(even, odd); 

    /* put operator for four zero bits in odd */ 
    gf2_matrix_square(odd, even); 

    /* first square will put the operator for one zero byte (eight zero bits), 
     in even -- next square puts operator for two zero bytes in odd, and so 
     on, until len has been rotated down to zero */ 
    do { 
     gf2_matrix_square(even, odd); 
     len >>= 1; 
     if (len == 0) 
      return; 
     gf2_matrix_square(odd, even); 
     len >>= 1; 
    } while (len); 

    /* answer ended up in odd -- copy to even */ 
    for (n = 0; n < 32; n++) 
     even[n] = odd[n]; 
} 

/* Take a length and build four lookup tables for applying the zeros operator 
    for that length, byte-by-byte on the operand. */ 
static void crc32c_zeros(uint32_t zeros[][256], size_t len) 
{ 
    uint32_t n; 
    uint32_t op[32]; 

    crc32c_zeros_op(op, len); 
    for (n = 0; n < 256; n++) { 
     zeros[0][n] = gf2_matrix_times(op, n); 
     zeros[1][n] = gf2_matrix_times(op, n << 8); 
     zeros[2][n] = gf2_matrix_times(op, n << 16); 
     zeros[3][n] = gf2_matrix_times(op, n << 24); 
    } 
} 

/* Apply the zeros operator table to crc. */ 
static inline uint32_t crc32c_shift(uint32_t zeros[][256], uint32_t crc) 
{ 
    return zeros[0][crc & 0xff]^zeros[1][(crc >> 8) & 0xff]^
      zeros[2][(crc >> 16) & 0xff]^zeros[3][crc >> 24]; 
} 

/* Block sizes for three-way parallel crc computation. LONG and SHORT must 
    both be powers of two. The associated string constants must be set 
    accordingly, for use in constructing the assembler instructions. */ 
#define LONG 8192 
#define LONGx1 "8192" 
#define LONGx2 "16384" 
#define SHORT 256 
#define SHORTx1 "256" 
#define SHORTx2 "512" 

/* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ 
static pthread_once_t crc32c_once_hw = PTHREAD_ONCE_INIT; 
static uint32_t crc32c_long[4][256]; 
static uint32_t crc32c_short[4][256]; 

/* Initialize tables for shifting crcs. */ 
static void crc32c_init_hw(void) 
{ 
    crc32c_zeros(crc32c_long, LONG); 
    crc32c_zeros(crc32c_short, SHORT); 
} 

/* Compute CRC-32C using the Intel hardware instruction. */ 
static uint32_t crc32c_hw(uint32_t crc, const void *buf, size_t len) 
{ 
    const unsigned char *next = buf; 
    const unsigned char *end; 
    uint64_t crc0, crc1, crc2;  /* need to be 64 bits for crc32q */ 

    /* populate shift tables the first time through */ 
    pthread_once(&crc32c_once_hw, crc32c_init_hw); 

    /* pre-process the crc */ 
    crc0 = crc^0xffffffff; 

    /* compute the crc for up to seven leading bytes to bring the data pointer 
     to an eight-byte boundary */ 
    while (len && ((uintptr_t)next & 7) != 0) { 
     __asm__("crc32b\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next++; 
     len--; 
    } 

    /* compute the crc on sets of LONG*3 bytes, executing three independent crc 
     instructions, each on LONG bytes -- this is optimized for the Nehalem, 
     Westmere, Sandy Bridge, and Ivy Bridge architectures, which have a 
     throughput of one crc per cycle, but a latency of three cycles */ 
    while (len >= LONG*3) { 
     crc1 = 0; 
     crc2 = 0; 
     end = next + LONG; 
     do { 
      __asm__("crc32q\t" "(%3), %0\n\t" 
        "crc32q\t" LONGx1 "(%3), %1\n\t" 
        "crc32q\t" LONGx2 "(%3), %2" 
        : "=r"(crc0), "=r"(crc1), "=r"(crc2) 
        : "r"(next), "0"(crc0), "1"(crc1), "2"(crc2)); 
      next += 8; 
     } while (next < end); 
     crc0 = crc32c_shift(crc32c_long, crc0)^crc1; 
     crc0 = crc32c_shift(crc32c_long, crc0)^crc2; 
     next += LONG*2; 
     len -= LONG*3; 
    } 

    /* do the same thing, but now on SHORT*3 blocks for the remaining data less 
     than a LONG*3 block */ 
    while (len >= SHORT*3) { 
     crc1 = 0; 
     crc2 = 0; 
     end = next + SHORT; 
     do { 
      __asm__("crc32q\t" "(%3), %0\n\t" 
        "crc32q\t" SHORTx1 "(%3), %1\n\t" 
        "crc32q\t" SHORTx2 "(%3), %2" 
        : "=r"(crc0), "=r"(crc1), "=r"(crc2) 
        : "r"(next), "0"(crc0), "1"(crc1), "2"(crc2)); 
      next += 8; 
     } while (next < end); 
     crc0 = crc32c_shift(crc32c_short, crc0)^crc1; 
     crc0 = crc32c_shift(crc32c_short, crc0)^crc2; 
     next += SHORT*2; 
     len -= SHORT*3; 
    } 

    /* compute the crc on the remaining eight-byte units less than a SHORT*3 
     block */ 
    end = next + (len - (len & 7)); 
    while (next < end) { 
     __asm__("crc32q\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next += 8; 
    } 
    len &= 7; 

    /* compute the crc for up to seven trailing bytes */ 
    while (len) { 
     __asm__("crc32b\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next++; 
     len--; 
    } 

    /* return a post-processed crc */ 
    return (uint32_t)crc0^0xffffffff; 
} 

/* Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors 
    introduced in November, 2008. This does not check for the existence of the 
    cpuid instruction itself, which was introduced on the 486SL in 1992, so this 
    will fail on earlier x86 processors. cpuid works on all Pentium and later 
    processors. */ 
#define SSE42(have) \ 
    do { \ 
     uint32_t eax, ecx; \ 
     eax = 1; \ 
     __asm__("cpuid" \ 
       : "=c"(ecx) \ 
       : "a"(eax) \ 
       : "%ebx", "%edx"); \ 
     (have) = (ecx >> 20) & 1; \ 
    } while (0) 

/* Compute a CRC-32C. If the crc32 instruction is available, use the hardware 
    version. Otherwise, use the software version. */ 
uint32_t crc32c(uint32_t crc, const void *buf, size_t len) 
{ 
    int sse42; 

    SSE42(sse42); 
    return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len); 
} 

#ifdef TEST 

#define SIZE (262144*3) 
#define CHUNK SIZE 

int main(int argc, char **argv) 
{ 
    char *buf; 
    ssize_t got; 
    size_t off, n; 
    uint32_t crc; 

    (void)argv; 
    crc = 0; 
    buf = malloc(SIZE); 
    if (buf == NULL) { 
     fputs("out of memory", stderr); 
     return 1; 
    } 
    while ((got = read(0, buf, SIZE)) > 0) { 
     off = 0; 
     do { 
      n = (size_t)got - off; 
      if (n > CHUNK) 
       n = CHUNK; 
      crc = argc > 1 ? crc32c_sw(crc, buf + off, n) : 
          crc32c(crc, buf + off, n); 
      off += n; 
     } while (off < (size_t)got); 
    } 
    free(buf); 
    if (got == -1) { 
     fputs("read error\n", stderr); 
     return 1; 
    } 
    printf("%08x\n", crc); 
    return 0; 
} 

#endif /* TEST */

来源

2013-07-15 04:28:34

我似乎无法在任何地方找到“crc32q”指令（只是引用航空），也没有提到在内联程序集中使用的“crc32b”指令，并且MSVC不接受其内联部件。 – LMS

这是为GNU编译器（gcc）编写的，它使用AT＆T语法来编译汇编指令，而不是英特尔语法。由于AT＆T语法不依赖于参数类型（例如dword ptr等），因此AT＆T语法更清楚地说明了生成的指令。您的汇编程序可能使用Intel语法，其中'crc32'“指令”实际上可以生成六条不同指令之一。根据论证的性质，哪一个必须由汇编程序决定，以及由人试图读取代码。 –

'crc32q'每次操作一个“四字”，64位。 'crc32b'每次操作一个字节。 –

马克·阿德勒的答案是正确的，完整的，但CRC-32C集成在他们的应用程序可能会觉得有点难以适应的代码，特别是如果他们使用的是Windows和.NET那些寻求快速简便的方法。

我使用硬件或软件方法创建了library that implements CRC-32C，具体取决于可用的硬件。它可用作C++和.NET的NuGet包。当然是开源的。

除了包装Mark Adler的代码之外，我发现了一种简单的方法来提高软件回退吞吐量50％。在我的电脑上，该软件库现在可以实现2 GB/s的软件和20 GB/s以上的硬件。对于那些好奇，这里的优化软件实现：

static uint32_t append_table(uint32_t crci, buffer input, size_t length) 
{ 
    buffer next = input; 
#ifdef _M_X64 
    uint64_t crc; 
#else 
    uint32_t crc; 
#endif 

    crc = crci^0xffffffff; 
#ifdef _M_X64 
    while (length && ((uintptr_t)next & 7) != 0) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    while (length >= 16) 
    { 
     crc ^= *(uint64_t *)next; 
     uint64_t high = *(uint64_t *)(next + 8); 
     crc = table[15][crc & 0xff] 
      ^table[14][(crc >> 8) & 0xff] 
      ^table[13][(crc >> 16) & 0xff] 
      ^table[12][(crc >> 24) & 0xff] 
      ^table[11][(crc >> 32) & 0xff] 
      ^table[10][(crc >> 40) & 0xff] 
      ^table[9][(crc >> 48) & 0xff] 
      ^table[8][crc >> 56] 
      ^table[7][high & 0xff] 
      ^table[6][(high >> 8) & 0xff] 
      ^table[5][(high >> 16) & 0xff] 
      ^table[4][(high >> 24) & 0xff] 
      ^table[3][(high >> 32) & 0xff] 
      ^table[2][(high >> 40) & 0xff] 
      ^table[1][(high >> 48) & 0xff] 
      ^table[0][high >> 56]; 
     next += 16; 
     length -= 16; 
    } 
#else 
    while (length && ((uintptr_t)next & 3) != 0) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    while (length >= 12) 
    { 
     crc ^= *(uint32_t *)next; 
     uint32_t high = *(uint32_t *)(next + 4); 
     uint32_t high2 = *(uint32_t *)(next + 8); 
     crc = table[11][crc & 0xff] 
      ^table[10][(crc >> 8) & 0xff] 
      ^table[9][(crc >> 16) & 0xff] 
      ^table[8][crc >> 24] 
      ^table[7][high & 0xff] 
      ^table[6][(high >> 8) & 0xff] 
      ^table[5][(high >> 16) & 0xff] 
      ^table[4][high >> 24] 
      ^table[3][high2 & 0xff] 
      ^table[2][(high2 >> 8) & 0xff] 
      ^table[1][(high2 >> 16) & 0xff] 
      ^table[0][high2 >> 24]; 
     next += 12; 
     length -= 12; 
    } 
#endif 
    while (length) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    return (uint32_t)crc^0xffffffff; 
}

正如你所看到的，它只是在一个时间仰卧起坐较大的块。它需要更大的查找表，但它仍然缓存友好。该表以相同的方式生成，只有更多的行。

我研究的另一件事是使用PCLMULQDQ指令在AMD处理器上获得硬件加速。我设法将Intel's CRC patch for zlib（也available on GitHub）连接到CRC-32C多项式~~，除了the magic constant 0x9db42487。如果有人能够破译那个，请让我知道~~。在supersaw7's excellent explanation on reddit之后，我也移植了难以捉摸的0x9db42487常量，我只需要找点时间来擦亮和测试它。

来源

2014-02-20 17:28:15

+1感谢您分享您的代码。将它移植到Delphi时，它可以帮助我很多。 –

我修复了该补丁的链接并添加了一些其他链接。罗伯特，你在这个问题上取得了进展吗？ –

它似乎与支持PCLMULQDQ的cloudflare的zlib不使用常量...也许这对你有用？ –

我在这里比较各种算法：https://github.com/htot/crc32c

最快的算法已经从英特尔crc_iscsi_v_pcl.asm汇编代码采取（这是在Linux内核的修饰形式提供），并使用C包装（crcintelasm.cc ）纳入该项目。

为了能够在32位平台上运行此代码，首先将其移植到C（crc32intelc），并且需要少量内联汇编。代码的某些部分取决于位的大小，crc32q在32位上不可用，也不是movq，这些都放在宏的（crc32intel.h）中，替换为32位平台的代码。

来源

2016-10-04 22:35:16

在软件中实现SSE 4.2的CRC32C

回答

相关问题