解析Python大文件

如何使用正则表达式解析大文件（使用re模块），而不需要将整个文件加载到字符串（或内存）中？内存映射文件不起作用，因为它们的内容不能转换为某种惰性字符串。 re模块仅支持字符串作为内容参数。解析Python大文件

#include <boost/format.hpp> 
#include <boost/iostreams/device/mapped_file.hpp> 
#include <boost/regex.hpp> 
#include <iostream> 

int main(int argc, char* argv[]) 
{ 
    boost::iostreams::mapped_file fl("BigFile.log"); 
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl); 
    boost::regex expr("something usefull"); 
    boost::match_flag_type flags = boost::match_default; 
    boost::iostreams::mapped_file::iterator start, end; 
    start = fl.begin(); 
    end = fl.end(); 
    boost::match_results<boost::iostreams::mapped_file::iterator> what; 
    while(boost::regex_search(start, end, what, expr)) 
    { 
     std::cout<<what[0].str()<<std::endl; 
     start = what[0].second; 
    } 
    return 0; 
}

为了证明我的要求。我使用C++（和boost）编写了一个简短的示例，与我想要的Python相同。

来源

2012-07-26 Alex

除非你需要多行的正则表达式，一行解析文件行。 – Lenna 2012-07-26 17:06:04

或许，如果你改写了一个问题，你有什么，以及你想达到什么，它会给我们一个更好的机会来提出建议 - 除非你坚持一种特定的方法。 – 2012-07-26 17:08:28

这取决于你在做什么样的解析。

如果你正在做的解析是面向行，你可以在一个文件中与行迭代：

with open("/some/path") as f: 
    for line in f: 
     parse(line)

否则，你需要在同一时间使用像分块，通过读取数据块并解析它们。显然，这将涉及更多的小心，以防你试图匹配与块边界重叠。

来源

2012-07-26 17:06:45 Julian

感谢我在流中搜索模式，而不检查线的边界 – Alex 2012-07-27 08:52:18

要在朱利安的解决方案阐述，你可以实现分块（如果你想要做多的正则表达式）的存储和连接的连续行，像这样：

list_prev_lines = [] 
for i in range(N): 
    list_prev_lines.append(f.readline()) 
for line in f: 
    list_prev_lines.pop(0) 
    list_prev_lines.append(line) 
    parse(string.join(list_prev_lines))

这将保持之前的N个运行列表行，包括当前行，然后将多行组解析为单个字符串。

来源

2012-07-26 17:15:48 CosmicComputer

是的，但我不知道需要多少行（一般情况下），实际上这种情况只是将整个文件读到内存中，而是使用内存映射文件的一般解决方案（因为易于使用效率好） – Alex 2012-07-27 08:55:11

现在一切正常（Python 3.2.3与Python 2.7在界面上有一些区别）。搜索图案应与B”只是前缀有（在Python 3.2.3）一个有效的解决方案。

import re 
import mmap 
import pprint 

def ParseFile(fileName): 
    f = open(fileName, "r") 
    print("File opened succesfully") 
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) 
    print("File mapped succesfully") 
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m) 
    for item in items: 
     pprint.pprint(item.group(0)) 

if __name__ == "__main__": 
    ParseFile("testre")

来源

2012-07-27 16:44:15 Alex

这很简洁，因为它允许使用m最后一行正则表达式。 – Rotareti 2017-07-26 11:16:05

解析Python大文件

回答

相关问题