Q

如何将正则表达式应用于文件的内容？

python
regex

2011-02-07 23 views 2 likes

2

我想申请正则表达式的文件的内容，而将整个文件加载到一个字符串。 RegexObject将第一个参数作为字符串或缓冲区。有没有办法将文件转换成缓冲区？如何将正则表达式应用于文件的内容？

2011-02-07 Candy Chiu

+0

你试图将正则表达式应用到整个文件 - 我。e试图将整个文件与您的正则表达式匹配 - 或者您是否试图逐行匹配文件或以其他一些大小的块进行匹配？ – 2011-02-07 19:29:05

A

回答

2

报价：

缓冲区对象不是直接通过 Python语法支持，但可以通过调用内置功能缓冲（）创建的。

和其他一些有趣的部分：

缓冲液（对象[，偏移，大小]]）

对象参数必须是支持缓冲器呼叫接口的对象（如字符串，数组和缓冲区）

名File对象没有实现缓冲界面 - 让你不得不改变其内容要么转换为字符串（f.read()）或成阵列（使用mmap为该）。

2011-02-07 19:27:52

4

是的！尝试mmap：

可以使用re模块通过一个内存映射文件

2011-02-07 19:23:31

+1

哇，想象回溯会做那种情况。 – sln 2011-02-07 19:49:55

1

搜索读入行的文件在一个时间并应用REG EXP到该行。似乎被堆叠起来处理字符串。 http://docs.python.org/library/re.html包含更多的细节，但我无法找到有关缓冲区的任何内容。从Python的文档

2011-02-07 19:25:55 Bassdread

+0

唯一的问题是如果正则表达式匹配跨行（`/ foo \ nbar /`）... – ircmaxell 2011-02-07 20:00:26

0

进行缓冲自己。如果正则表达式匹配块的一部分，则从该块中删除该部分，继续使用未使用的部分，读取下一个块，重复。

如果正则表达式被设计为一个特定的理论最大的，对什么都不匹配，缓冲是在执法机关一样大的情况下，清除缓冲区，在接下来的块读取。一般来说，正则表达式不是用来处理非常大的数据块的。正则表达式越复杂，它所做的回溯越多。

2011-02-07 19:56:41 sln

0

下面的代码演示：

打开文件
文件
在求只读取文件
使用正则表达式匹配的模式

的一部分
假设：所有的句子是个Ë相同长度

# import random for randomly choosing in a list 
import random 
# import re for regular expression matching 
import re 

#open a new file for read/writing 
file = open("TEST", "r+") 

# some strings to put in the sentence 
typesOfSentences = ["test", "flop", "bork", "flat", "pork"] 
# number of types of sentences 
numTypes = len(typesOfSentences) 

# for i values 0 to 99 
for i in range(100): 
    # Create a random sentence for example 
    # "This is a test sentence 01" 
    sentence = "This is a %s sentence %02d\n" % (random.choice(typesOfSentences), i) 
    # write the sentence to the file 
    file.write(sentence) 

# Go back to beginning of file 
file.seek(0) 

# print out the whole file 
for line in file: 
    print line 

# Determine the length of the sentence 
length = len(sentence) 

# go to 20th sentence from the beginning 
file.seek(length * 20) 

# create a regex matching the type and the number at the end 
pathPattern = re.compile("This is a (.*?) sentence (\d\d)") 

# print the next ten types and numbers 
for i in range(10): 
    # read the next line 
    line = file.readline() 
    # match the regex 
    match = pathPattern.match(line) 
    # if there was a match 
    if match: 
     # NOTE: match.group(0) is always the entire sentence 
     # Print type of sentence it was and it's number 
     print "Sentence %02d is of type %s" % (int(match.group(2)), match.group(1))

2011-02-07 20:08:15 manifest

相关问题