2012-06-22 27 views
2

我在C编写代码在enwik8和enwik9中执行一些正则表达式。我还在其他语言中创建了相同的算法以进行基准测试。问题在于我的C代码有问题,因为它需要40秒,而python和其他代码只需要10秒。C正则表达式

我忘了什么?

#include <stdio.h> 
#include <regex.h> 

#define size 1024 

int main(int argc, char **argv){ 
    FILE *fp; 
    char line[size]; 
    regex_t re; 
    int x; 
    const char *filename = "enwik8"; 
    const char *strings[] = {"\bhome\b", "\bdear\b", "\bhouse\b", "\bdog\b", "\bcat\b", "\bblue\b", "\bred\b", "\bgreen\b", "\bbox\b", "\bwoman\b", "\bman\b", "\bwomen\b", "\bfull\b", "\bempty\b", "\bleft\b", "\bright\b", "\btop\b", "\bhelp\b", "\bneed\b", "\bwrite\b", "\bread\b", "\btalk\b", "\bgo\b", "\bstay\b", "\bupper\b", "\blower\b", "\bI\b", "\byou\b", "\bhe\b", "\bshe\b", "\bwe\b", "\bthey\b"}; 

    for(x = 0; x < 33; x++){ 
     if(regcomp(&re, strings[x], REG_EXTENDED) != 0){ 
      printf("Failed to compile regex '%s'\n", strings[x]); 

      return -1; 
     } 

     fp = fopen(filename, "r"); 

     if(fp == 0){ 
      printf("Failed to open file %s\n", filename); 

      return -1; 
     } 

     while((fgets(line, size, fp)) != NULL){ 
      regexec(&re, line, 0, NULL, 0); 
     } 
    } 

    return 0; 
} 
+0

确定的Python等人使用相同的正则表达式的lib?别忘了regfree。 –

+2

另外,您是否打算在不关闭的情况下打开同一个文件33次? –

+0

雅你是对的,这可能是我的性能问题 –

回答

3

文件访问和编译正则表达式可能是一个罪魁祸首。

  • 编译regexs一次,并将它们存储在数组中
  • 打开文件
  • 读取一行
  • 运行每个编译正则表达式在它
  • 关闭文件。
+0

+1的因子加快了程序倒置循环。 –