从文本文件解析项目

我有一个文本文件，其中包含{[]}标签中的数据。建议的解析数据的方式是什么，以便我可以使用标签内的数据？从文本文件解析项目

示例文本文件应该是这样的：

“这是文字的一群，是不是在任何{[真]} {有用[方法]}。我需要{[获得]}一些物品{[来自]}它。“

我想在列表中以'真正'，'方式'，'获取'，'起始'结束。我想我可以使用分裂来做到这一点..但似乎有可能有更好的办法。我已经看到了很多解析库，有没有一种适合我想要做的事情？

2010-06-14 chris

我会使用正则表达式。这个答案假定标签字符{} []没有出现在其他标签字符中。

import re 
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.' 

for s in re.findall(r'\{\[(.*?)\]\}', text): 
    print s

使用Python正则表达式的详细模式：

re.findall(''' 
    \{ # opening curly brace 
    \[ # followed by an opening square bracket 
    ( # capture the next pattern 
    .*? # followed by shortest possible sequence of anything 
    ) # end of capture 
    \] # followed by closing square bracket 
    \} # followed by a closing curly brace 
    ''', text, re.VERBOSE)

来源

2010-06-14 19:11:49

这是正则表达式工作：

>>> import re 
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.' 
>>> re.findall(r'\{\[(\w+)\]\}', text) 
['really', 'way', 'get', 'from']

来源

2010-06-14 19:12:48

哇，这是快..和完善。谢谢！ – chris 2010-06-14 19:15:48

@chris：小心：它只捕获分隔符之间的字母数字。如果你的数据有其他种类的字符，这不会选择它们。 – 2010-06-14 19:22:25

为了阐述布赖恩的评论，具体案例：连字词，{[anti-war]};复合词与空白，{[新英格兰]};使用标点符号和空格的地方或人物的名称，{[波士顿，马萨诸塞州}}，{[乔治W.布什]}。 – tgray 2010-06-14 20:59:27

慢，做大，没有正规的expresions

老学校方式：P

def f(s): 
    result = [] 
    tmp = '' 
    for c in s: 
     if c in '{[': 
      stack.append(c) 
     elif c in ']}': 
      stack.pop() 
      if c == ']': 
       result.append(tmp) 
       tmp = '' 
     elif stack and stack[-1] == '[': 
      tmp += c 
    return result 

>>> s 
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.' 
>>> f(s) 
['really', 'way', 'get', 'from']

来源

2010-06-15 08:18:07 remosu

另一种方式

def between_strings(source, start='{[', end=']}'): 
    words = [] 
    while True: 
     start_index = source.find(start) 
     if start_index == -1: 
      break 
     end_index = source.find(end) 
     words.append(source[start_index+len(start):end_index]) 
     source = source[end_index+len(end):] 
    return words 


text = "this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it." 
assert between_strings(text) == ['really', 'way', 'get', 'from']

来源

2010-06-22 03:39:31 Henry

从文本文件解析项目

回答

相关问题