如何从文本文件中使用前缀提取部分字符串

我有一个文本文件，其中某些区域包含以下字符串。如何从文本文件中使用前缀提取部分字符串

20170818_141903 Test ! Vdd 3.000000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS

不幸的是，它不是逗号或制表符分隔，每行都是一个大字符串。

我已经阅读了整个文件，并试图提取一切是二进制数据。

这意味着我想要的一切，其间下列字符

MMS ...... SS

我也想提取例如值P1后:,或VDD：从这些地区

Vdd 3.000000; P: 20.000000...........................etc

我已经做了目前：

import re 

match = re.search(r'\P: (\w+)', LONG_STRING) 
     if match: 
      print match.group(1)

但是这并不提取完整的浮点数，它忽略了小数点位置

来源

2017-08-23 cc6g11

答案v2.0。总的来说，这段代码非常僵硬，并且不是最清晰的代码，但是现在我无法为您提供的示例提供更好的解决方案。

>>> import re 

>>> that_long_row = "20170818_141903 Test ! Vdd 3.000$000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS " 

>>> regex = (r'^'      # start of a string symbol 
     r'.+'       # escape any character 
     r'Vdd '      # until "Vdd " is reached 
     r'(?P<Vdd>[0-9\.]+)'   # select a continuous sequence of numbers and dots folowing that word and assign it to a group "Vdd" 
     r'.+'       # again, skip some more chars 
     r'P: '       # find "P: " word 
     r'(?P<P>[0-9\.]+)'    # select a continuous sequence of numbers and dots and assign to a group "P" 
     r'.+'       # the same goes for your byte "Message" between "MMS" and "SS" symbols 
     r'MMS' 
     r'(?P<Message>[0-1]+)'   # except that it only matches 0 and 1 
     r'SS' 
     r'.+'       # as @Evan mentioned, you need this to escape some possible trailing symbols 
     r'$'       # end of a string symbol 
     ) 

# the same but in a compact form: 
>>> regex = r'^.+Vdd (?P<Vdd>[0-9\.]+).+P: (?P<P>[0-9\.]+).+MMS(?P<Message>[0-1]+)SS.+$' 

>>> match = re.match(regex, that_long_row) 

# matching will form a groupdict that is like a normal dict 
# and you can access any matched group value by its name 

>>> match.groupdict() 
{'Vdd': '3.000', 'P': '20.000000', 'Message': ...

接下来，如果你想解析文件这样的方式，我想创建一个简单的类来处理所有的数据，类型转换，验证等

class Message: 
    def __init__(self, Vdd, P, Message): 
     self.vdd = float(Vdd) 
     self.p = float(P) 
     self.text = Message 

data = [] 

with open('yourfile', 'r') as f: 
    for line in f: 
     match = re.match(regex, line) 
     try: 
      data.append(Message(**match.groupdict())) 
     except ValueError: 
      data.append('CORRUPTED')

等。

来源

2017-08-23 12:54:30

他给出的字符串最后有一个空格，所以如果你想在最后加上$，你可能想把它包含在正则表达式中。此外，考虑到您花时间编写了令人敬畏的正则表达式，收集所有这些内容的好的列表理解可能会很有用。我不想再作答，当你做所有的工作时让他赞扬我。 – Evan

如何找出所有这些正则表达式参数的含义。他们看起来很可怕。 – cc6g11

2 @ cc6g11，他们确实做到了！我试图让我的回答更清楚。希望这会有所帮助！ –

如何从文本文件中使用前缀提取部分字符串

回答

相关问题