文件蟒蛇查找和删除线3

-1

好吧，我得到了锁这样的文件：

我如何才能找到并删除文件的只有这部分？

我一直在尝试很多工作，但无法使其工作。

来源

2012-06-12 user1229391

那你尝试做？ – Ryan

将每行读入一串字符串。索引号是行号 - 1.在读取行之前，检查行是否等于“id：2”。如果是，则停止阅读该行，直到该行等于“id：3”。读完这行后，清除文件并将数组写回文件直到数组结束。这可能不是最有效的方式，但应该起作用。

来源

2012-06-12 19:26:02 Manto

是用于决定删除序列的标识，还是用于决策的值列表？

您可以构建一个字典，其中的id号是键（由于稍后的排序而转换为int），并将以下行转换为字符串列表，该字符串是该键的值。然后，您可以用键2删除项目，然后遍历按键排序的项目，并输出新的id：键加上格式化的字符串列表。

或者您可以构建订单受保护的列表的列表。如果要保护id的序列（即不重新编号），则还可以记住内部列表中的id：n。

这可以通过合理大小的文件来完成。如果文件很大，您应该将源复制到目标并即时跳过不需要的序列。最后一种情况对于小文件来说也相当容易。

[澄清后添加]

我建议了解以下方法，在很多这样的情况下有用。它使用所谓的有限自动机，它实现了从一个状态转换到另一个状态的动作（见Mealy machine）。

文本行是此处的输入元素。代表上下文状态的节点在这里编号。（我的经验是，不给他们名字是不值得的 - 让他们只是愚蠢的数字。）这里只使用了两种状态，并且status可以很容易地被布尔变量替换。但是，如果情况变得更加复杂，则会导致引入另一个布尔变量，并且代码变得更容易出错。

该代码起初可能看起来非常复杂，但当您知道可以分别考虑每个if status == number时，它很容易理解。这是上述处理的上下文。不要试图优化，让代码这样。它实际上可以稍后进行人工解码，并且您可以绘制类似于Mealy machine example的图片。如果你这样做，那就更容易理解了。

的有用的功能是广义一点 - 一组被忽略的部分可以作为第一个参数传递：

import re 

def filterSections(del_set, fname_in, fname_out): 
    '''Filtering out the del_set sections from fname_in. Result in fname_out.''' 

    # The regular expression was chosen for detecting and parsing the id-line. 
    # It can be done differently, but I consider it just fine and efficient. 
    rex_id = re.compile(r'^id:(\d+)\s*$') 

    # Let's open the input and output file. The files will be closed 
    # automatically. 
    with open(fname_in) as fin, open(fname_out, 'w') as fout: 
     status = 1     # initial status -- expecting the id line 
     for line in fin: 
      m = rex_id.match(line) # get the match object if it is the id-line 

      if status == 1:  # skipping the non-id lines 
       if m:    # you can also write "if m is not None:" 
        num_id = int(m.group(1)) # get the numeric value of the id 
        if num_id in del_set:  # if this id should be deleted 
         status = 1   # or pass (to stay in this status) 
        else: 
         fout.write(line)  # copy this id-line 
         status = 2   # to copy the following non-id lines 
       #else ignore this line (no code needed to ignore it :) 

      elif status == 2:  # copy the non-id lines 
       if m:       # the id-line found 
        num_id = int(m.group(1)) # get the numeric value of the id 
        if num_id in del_set:  # if this id should be deleted 
         status = 1   # or pass (to stay in this status) 
        else: 
         fout.write(line)  # copy this id-line 
         status = 2   # to copy the following non-id lines 
       else: 
        fout.write(line)   # copy this non-id line 


if __name__ == '__main__': 
    filterSections({1, 3}, 'data.txt', 'output.txt') 
    # or you can write the older set([1, 3]) for the first argument.

这里那里给原来的号码输入的ID线。如果您想对这些部分重新编号，可以通过简单的修改来完成。试试代码并询问详细信息。

请注意，有限自动机的功率有限。它们不能用于通常的编程语言，因为它们无法捕获嵌套的配对结构（如parenteses）。

P.S. 7000线实际上是从计算机角度看一个小文件;）

来源

2012-06-12 20:36:18 pepr

该标识符用于dscision删除序列，该文件包含7.000行，因此它很大。对不起，我提供了那么少的信息。 – user1229391

@ user1229391：删除序列后，下一个序列是否应保留原始数字，还是应更正（减少）其ID？ – pepr

如果两者之间存在干扰，这将不工作的任何值....

import fileinput 
... 
def deleteIdGroup(number): 
    deleted = False 
    for line in fileinput.input("testid.txt", inplace = 1): 
     line = line.strip('\n') 
     if line.count("id:" + number): # > 0 
      deleted = True; 
     elif line.count("id:"): # > 0 
      deleted = False; 
     if not deleted: 
      print(line)

编辑：

对不起这将删除ID：2和ID：20 ...宥可以修改它，以便第一，如果检查 - 线==“ID：” +号

来源

2012-06-13 18:18:31 corn3lius

文件蟒蛇查找和删除线3

回答

相关问题