2012-07-04 81 views
3

我需要为字节序列“9μ}Æ”(或“\ x39 \ xb5 \ x7d \ xc6”)搜索一个(非文本)文件。有没有更好的方法来搜索一个文件的字符串?

经过5个小时的在线搜索,这是我能做的最好的。它的工作原理,但我想知道是否有更好的办法:

char buffer; 

int pos=in.tellg(); 

// search file for string 
while(!in.eof()){ 
    in.read(&buffer, 1); 
    pos=in.tellg(); 
    if(buffer=='9'){ 
     in.read(&buffer, 1); 
     pos=in.tellg(); 
     if(buffer=='µ'){ 
      in.read(&buffer, 1); 
      pos=in.tellg(); 
      if(buffer=='}'){ 
       in.read(&buffer, 1); 
       pos=in.tellg(); 
       if(buffer=='Æ'){ 
        cout << "found"; 
       } 
      } 
     } 
    } 

    in.seekg((streampos) pos); 

注:

  • 我不能使用getline()。这不是一个文本文件,因此可能没有多少换行符。
  • 在我尝试使用多字符缓冲区之前,然后将缓冲区复制到C++字符串,然后使用string::find()。这是行不通的,因为整个文件中有很多字符,所以缓冲区中的序列在复制到字符串时会被缩短。
+0

你为什么不一次以4个字符读取而不是1个字符? – twain249

+0

您可以一次读取大块字节,将它们存储在一个字节数组中,并使用'memcmp'或'std :: search'比较它们。如果你愿意,重复直到EOF。 – jweyrich

+0

你是否在一个可以产生grep进程的系统中? – Almo

回答

0

如果您不介意将整个文件加载到内存数组中(或使用mmap()使其看起来像文件在内存中),那么您可以在内存中搜索字符序列,内存,这是一个有点容易做:

// Works much like strstr(), except it looks for a binary sub-sequence rather than a string sub-sequence 
const char * MemMem(const char * lookIn, int numLookInBytes, const char * lookFor, int numLookForBytes) 
{ 
     if (numLookForBytes == 0)    return lookIn; // hmm, existential questions here 
    else if (numLookForBytes == numLookInBytes) return (memcmp(lookIn, lookFor, numLookInBytes) == 0) ? lookIn : NULL; 
    else if (numLookForBytes < numLookInBytes) 
    { 
     const char * startedAt = lookIn; 
     int matchCount = 0; 
     for (int i=0; i<numLookInBytes; i++) 
     { 
     if (lookIn[i] == lookFor[matchCount]) 
     { 
      if (matchCount == 0) startedAt = &lookIn[i]; 
      if (++matchCount == numLookForBytes) return startedAt; 
     } 
     else matchCount = 0; 
     } 
    } 
    return NULL; 
} 

....那么你可以调用内存中的数据阵列上的上述功能:

char * ret = MemMem(theInMemoryArrayContainingFilesBytes, numBytesInFile, myShortSequence, 4); 
if (ret != NULL) printf("Found it at offset %i\n", ret-theInMemoryArrayContainingFilesBytes); 
      else printf("It's not there.\n"); 
+2

如果你打算将文件加载到内存中,为什么不使用'std :: search'? – bames53

0

该程序加载整个文件进入内存,然后使用std::search就可以了。

int main() { 
    std::string filedata; 
    { 
     std::ifstream fin("file.dat"); 
     std::stringstream ss; 
     ss << fin.rdbuf(); 
     filedata = ss.str(); 
    } 

    std::string key = "\x39\xb5\x7d\xc6"; 
    auto result = std::search(std::begin(filedata), std::end(filedata), 
           std::begin(key), std::end(key)); 
    if (std::end(filedata) != result) { 
     std::cout << "found\n"; 
     // result is an iterator pointing at '\x39' 
    } 
} 
0
const char delims[] = { 0x39, 0xb5, 0x7d, 0xc6 }; 
char buffer[4]; 
const size_t delim_size = 4; 
const size_t last_index = delim_size - 1; 

for (size_t i = 0; i < last_index; ++i) 
{ 
    if (! (is.get(buffer[i]))) 
    return false; // stream to short 
} 

while (is.get(buffer[last_index])) 
{ 
    if (memcmp(buffer, delims, delim_size) == 0) 
    break; // you are arrived 
    memmove(buffer, buffer + 1, last_index); 
} 

您正在寻找4个字节:

unsigned int delim = 0xc67db539; 
unsigned int uibuffer; 
char * buffer = reinterpret_cast<char *>(&uibuffer); 

for (size_t i = 0; i < 3; ++i) 
{ 
    if (! (is.get(buffer[i]))) 
    return false; // stream to short 
} 

while (is.get(buffer[3])) 
{ 
    if (uibuffer == delim) 
    break; // you are arrived 
    uibuffer >>= 8; 
} 
5

类似于bames53公布;我使用的矢量作为缓冲:

std::ifstream ifs("file.bin"); 

ifs.seekg(0, std::ios::end); 
std::streamsize f_size = ifs.tellg(); 
ifs.seekg(0, std::ios::beg); 

std::vector<unsigned char> buffer(f_size); 
ifs.read(buffer.data(), f_size); 

std::vector<unsigned char> seq = {0x39, 0xb5, 0x7d, 0xc6}; 

bool found = std::search(buffer.begin(), buffer.end(), seq.begin(), seq.end()) != buffer.end(); 
0

因为你说你不能搜索,因为该字符串的空终止字符的整个文件,这里是你的一个替代方案,其内容在整个文件,并使用递归找到整个文件中第一次出现字符串。

#include <iostream> 
    #include <fstream> 
    #include <string> 

    using namespace std; 

    string readFile (char *fileName) { 
     ifstream fi (fileName); 
     if (!fi) 
     cerr << "ERROR: Cannot open file" << endl; 
     else { 
     string str ((istreambuf_iterator<char>(fi)), istreambuf_iterator<char>()); 
     return str; 
     } 
     return NULL; 
    } 

    bool findFirstOccurrenceOf_r (string haystack, char *needle, int haystack_pos, int needle_pos, int needle_len) { 
     if (needle_pos == needle_len) 
     return true; 
     if (haystack[haystack_pos] == needle[needle_pos]) 
     return findFirstOccurrenceOf_r (haystack, needle, haystack_pos+1, needle_pos+1, needle_len); 
     return false; 
    } 

    int findFirstOccurrenceOf (string haystack, char *needle, int length) { 
     int pos = -1; 
     for (int i = 0; i < haystack.length() - length; i++) { 
     if (findFirstOccurrenceOf_r (haystack, needle, i, 0, length)) 
      return i; 
     } 
     return pos; 
    } 

    int main() { 
     char str_to_find[4] = {0x39, 0xB5, 0x7D, 0xC6}; 
     string contents = readFile ("input"); 

     int pos = findFirstOccurrenceOf (contents, str_to_find, 4); 

     cout << pos << endl; 
    } 

如果文件不是太大,最好的解决办法是加载整个文件到内存中,这样你就不会需要保持从驱动器中读取。如果文件太大而无法立即加载,则一次需要加载文件的块。但是如果你在卡盘中加载,确保你检查块的边缘。您的块可能会恰好在您搜索的字符串中间分割。

相关问题