最佳方法在Python

-2

我需要指出在这个问题的正确方向查找多个字符串纪录我的工作：最佳方法在Python

比方说，我是从一个C程序读取输出如下：

while True: 
    ln = p.stdout.readline() 
    if '' == ln: 
     break 
    #do stuff here with ln

而且我的输出看起来像这行线：

TrnIq: Thread on CPU 37 
TrnIq: Thread on CPU 37 but will be moved to CPU 44 
IP-Thread on CPU 33 
FANOUT Thread on CPU 37 
Filter-Thread on CPU 38 but will be moved to CPU 51 
TRN TMR Test 2 Supervisor Thread on CPU 34 
HomographyWarp Traking Thread[0] on CPU 26

我想捕捉“TrnIq：线程”和“37”作为两个独立的变量：一个字符串和一个数字从输出“TrnIq ：在CPU上线程37" 。

对于其他行，例如捕获“HomographyWarp Traking Thread [0] on”和＃“26”，来自“在CPU 26上的HomographyWarp Traking Thread [0]”，它相当不错。

唯一真正的挑战是这样的行：“在CPU 38上的过滤器线程，但将被移动到CPU 51”，我需要“Filer-Thread”和＃“51”不是第一个＃“ 38" 。

Python有很多不同的方法来做到这一点我甚至不知道从哪里开始！

在此先感谢！

来源

2012-07-20 NASA Intern

“Thanks”for unaccaptance ... – 2012-07-24 20:15:10

正则表达式在这里似乎有点小题大做了我。 [免责声明：我不喜欢正则表达式，但喜欢使用Python，所以尽可能用Python编写，不要写正则表达式。出于我从未完全理解的原因，这被认为是令人惊讶的。]

s = """TrnIq: Thread on CPU 37 
TrnIq: Thread on CPU 37 but will be moved to CPU 44 
IP-Thread on CPU 33 
FANOUT Thread on CPU 37 
Filter-Thread on CPU 38 but will be moved to CPU 51 
TRN TMR Test 2 Supervisor Thread on CPU 34 
HomographyWarp Traking Thread[0] on CPU 26""" 

for line in s.splitlines(): 
    words = line.split() 
    if not ("CPU" in words and "on" in words): continue # skip uninteresting lines 
    prefix_words = words[:words.index("on")+1] 
    prefix = ' '.join(prefix_words) 
    cpu = int(words[-1]) 
    print (prefix, cpu)

给

('TrnIq: Thread on', 37) 
('TrnIq: Thread on', 44) 
('IP-Thread on', 33) 
('FANOUT Thread on', 37) 
('Filter-Thread on', 51) 
('TRN TMR Test 2 Supervisor Thread on', 34) 
('HomographyWarp Traking Thread[0] on', 26)

，我不认为我需要翻译此代码的任何成英文。

来源

2012-07-20 17:08:10 DSM

是的！即使只是看着它，我也能理解！比正则表达式简单得多！但你认为哪种方式更有效？ – 2012-07-20 17:22:30

什么是效率？我认为它是一种总体时间 - 写作，时间调试，时间运行，修改时间来处理我没有预测的情况 - 来获得我需要的输出。正则表达式通常会（并非总是）在性能上获胜。我认为他们只在代码方面赢得我很少发现自己的用例 - 对于基本工具来说有点太复杂，并且不够复杂，无法证明使用真正的解析器是正确的 - 但意见和情况各不相同。 – DSM 2012-07-20 17:36:55

我不断收到“on”上的错误未找到？ – 2012-07-20 17:47:40

以下应返回的信息假定ln一个元组数据的一个单一的线（编辑为包括CPU值转换为int）：

match = re.match(r'(.*?)(?: on CPU.*)?(?: (?:on|to) CPU)(.*)', ln).groups() 
if match: 
    proc, cpu = match.groups() 
    cpu = int(cpu)

实施例：

>>> import re 
>>> for ln in lines: 
...  print re.match(r'(.*?)(?: on CPU.*)?(?: (?:on|to) CPU)(.*)', ln).groups() 
... 
('TrnIq: Thread', '37') 
('TrnIq: Thread', '44') 
('IP-Thread', '33') 
('FANOUT Thread', '37') 
('Filter-Thread', '51') 
('TRN TMR Test 2 Supervisor Thread', '34') 
('HomographyWarp Traking Thread[0]', '26')

说明：

(.*?)   # capture zero or more characters at the start of the string, 
       # as few characters as possible 
(?: on CPU.*)? # optionally match ' on CPU' followed by any number of characters, 
       # do not capture this 
(?: (?:on|to) CPU) # match ' on CPU ' or ' to CPU ', but don't capture 
(.*)   # capture the rest of the line

Rubular：http://www.rubular.com/r/HqS9nGdmbM

来源

2012-07-20 16:53:03

因此，这将返回一个包含每行2个字符串的元组？如果我想将#s转换为字符串，python是否有strToNum函数？ – 2012-07-20 17:07:35

@NASAIntern - 您可以将'。*'结尾改为'\ d +'，这样您就可以不用抓取剩下的行，而只需要抓取数字。然后你可以使用'int（）'内置函数将字符串转换为数字。 – 2012-07-20 17:13:26

为什么downvote？ – 2012-07-20 17:27:19

因此，使用正则表达式^(.*?)\s+on\s+CPU.*(?<=\sCPU)\s+(\d+)\s*$

import sys 
import re 

for ln in sys.stdin: 
    m = re.match(r'^(.*?)\s+on\s+CPU.*(?<=\sCPU)\s+(\d+)\s*$', ln); 
    if m is not None: 
    print m.groups();

见并测试实例here。

来源

2012-07-20 16:54:26

我想我应该澄清的事实，这些只是我感兴趣的输出线。有成千上万的其他行： _process_trn_ip_rslts：切换到TRN_FILTER_PROPAGATING状态。 trn_filter：total_update_timer = 0.057454秒。 trn_ib-> trn_ib_state.ip_part = 1个 DISPOSITION ACCEPT VALUE = 2 -a * _000004.pgm 配置ACCEPT发送到插座6 TrnIb框架4：发送图像ID =（1003080551，750074，framecnt 4）经由插口IP = IB。 open_sock/bind OK，sock = 79 create_cmd_sock/listen OK，erc = 0 trn_filter，instance 3：socket_from_cmd：35_ – 2012-07-20 18:18:19

@NASAIntern - 如果您需要打印整行，只需将'print m.groups（）;'替换为'print ln;' - Ωmega5分钟前 – 2012-07-20 18:39:44

在你所提到的情况下，你总是希望第二个CPU数量，因此它可以用一个正则表达式来完成：

# Test program 
import re 

lns = [ 
    "TrnIq: Thread on CPU 37", 
    "TrnIq: Thread on CPU 37 but will be moved to CPU 44", 
    "IP-Thread on CPU 33", 
    "FANOUT Thread on CPU 37", 
    "Filter-Thread on CPU 38 but will be moved to CPU 51", 
    "TRN TMR Test 2 Supervisor Thread on CPU 34", 
    "HomographyWarp Traking Thread[0] on CPU 26" 
] 

for ln in lns: 
    test = re.search("(?P<process>.*Thread\S* on).* CPU (?P<cpu>\d+)$", ln) 
    print "%s: '%s' on CPU #%s" % (ln, test.group('process'), test.group('cpu'))

在也许你想情况加以区分一般情况下（如线程一个CPU，移动线程，子线程...）。要做到这一点，您可以一个接一个地使用多个re.search（）。例如：

# This search recognizes lines of the form "...Thread on CPU so-and-so", and 
# also lines that add "...but will be moved to CPU some-other-cpu". 
test = re.search("(?P<process>.* Thread) on CPU (?P<cpu1>\d+)(but will be moved to CPU (?P<cpu2>\d+))*", ln) 
if test: 
    # Here we capture Process Thread, both moved and non moved 
    if test.group('cpu2'): 
     # We have process, cpu1 and cpu2: moved thread 
    else: 
     # Nonmoved task, we have test.group('process') and cpu1. 
else: 
    # No match, try some other regexp. For example processes with a thread number 
    # between square brackets: "Thread[0]", which are not captured by the regex above. 
    test = re.search("(?P<process>.*) Thread[(?P<thread>\d+)] on CPU (?P<cpu1>)", ln) 
    if test: 
     # Here we have Homography Traking in process, 0 in thread, 26 in cpu1

为了获得最佳性能，对于频率更高的线路的测试最好先完成。

来源

2012-07-20 16:55:26 LSerni

我不确定我是否理解了第二部分“你可以用几个re.search（）来做到这一点，例如：”我会用什么？ – 2012-07-20 18:07:09

我想我应该澄清的事实，这些只是我感兴趣的输出线。有成千上万的其他行：_process_trn_ip_rslts：切换到TRN_FILTER_PROPAGATING状态。 trn_filter：total_update_timer = 0.057454 seconds.trn_ib-> trn_ib_state.ip_part = 1配置接受值= 2 -a * _000004.pgm处理发送到套接字6的ACCEPT帧4 TrnIb：将图像id =（1003080551,750074，framecnt 4）发送到IP通过套接字= 8从IB。 OK，sock = 79 create_cmd_sock/listen OK，erc = 0 trn_filter，instance 3：socket_from_cmd：35_ – 2012-07-20 18:23:33

好吧，那么对于你读的每一个，你都会使用re.search（）检查它对几个正则表达式之一。我提供的第一个.search会识别“......线程......在...... CPU”之类的行。其他搜索更有针对性，效率更高。如果你有一条你不感兴趣的非常普通的线路，你也可以尝试识别它，以便丢弃它并保存后续的比较。 – LSerni 2012-07-20 21:42:27

可以有两个正则表达式搜索非常简单地完成：

import re 

while True: 
    ln = p.stdout.readline() 
    if '' == ln: 
     break 

    start_match = re.search(r'^(.*?) on', ln) 
    end_match = re.search(r'(\d+)$', ln) 
    process = start_match and start_match.group(0) 
    process_number = end_match and end_match.group(0)

来源

2012-07-20 17:00:06 mVChr

你能提供一些细节吗？我仍然习惯于这种有点东西的Python语法？ – 2012-07-20 17:05:14

您可以阅读Python的正则表达式模块文档：http://docs.python.org/library/re.html – mVChr 2012-07-20 17:20:10

我得到正则表达式，而不是“and”和match.group（）函数。 matchgroup（）返回一个字符串吧？ – 2012-07-20 17:32:42

最佳方法在Python

回答

相关问题