将文本文件拆分成块，然后在这些段中搜索关键短语

我是Python新手，我已经是该语言的粉丝。我有一个程序，执行以下操作：将文本文件拆分成块，然后在这些段中搜索关键短语

打开具有用星号（***）
采用split()功能，这个文本文件分成由分开的部分分离的部分文字的文本文件这些星号。星号行在文本文件中是统一的。
我希望我的代码，通过这些部分的迭代，并做到以下几点：
- 我已分配值“关键短语”一本字典。字典中每个键的值是0。
- 代码需要遍历从拆分创建的每个部分，并检查每个部分是否找到字典中的键。如果找到一个关键术语，则该键的值将增加1.
- 一旦代码遍历一个部分并计算了该部分中有多少个键并相应地添加了值，则应打印出字典键和该设置的计数（值），将值设置为0，然后再次转到从＃3开始的下一部分文本。

我的代码是：

from bs4 import BeautifulSoup 
    import re 
    import time 
    import random 
    import glob, os 
    import string 


termz = {'does not exceed' : 0, 'shall not exceed' : 0, 'not exceeding' : 0, 
    'do not exceed' : 0, 'not to exceed' : 0, 'shall at no time exceed' : 0, 
    'shall not be less than' : 0, 'not less than' : 0} 
with open('Q:/hello/place/textfile.txt', 'r') as f: 
    sections = f.read().split('**************************************************') 
    for p in sections[1:]: 
     for eachKey in termz.keys(): 
     if eachKey in p: 
      termz[eachKey] = termz.get(eachKey) + 1 
      print(termz) 


#print(len(sections)) #there are thirty sections  

     #should be if code encounters ***** then it resets the counters and just moves on.... 
     #so far only can count the phrases over the entire text file.... 

#GO BACK TO .SPLIT() 
# termz = dict.fromkeys(termz,0) #resets the counter

它吐出来的是什么管用的，但它不是第一个，最后，甚至整个它的跟踪文件 - 我不知道它在做什么。

最后的打印语句不合适。 termz = dict.fromkeys(termz,0)行是一种方法，我发现将字典的值重置为0，但被注释掉，因为我不知道如何处理这个问题。本质上，与Python控制结构挣扎。如果有人能指引我走向正确的方向，那会很棒。

来源

2017-07-06 Th3SniperSpirit

您的代码非常接近。请参见下面的评论：

termz = { 
    'does not exceed': 0, 
    'shall not exceed': 0, 
    'not exceeding': 0, 
    'do not exceed': 0, 
    'not to exceed': 0, 
    'shall at no time exceed': 0, 
    'shall not be less than': 0, 
    'not less than': 0 
} 

with open('Q:/hello/place/textfile.txt', 'r') as f: 
    sections = f.read().split('**************************************************') 

    # Skip the first section. (I assume this is on purpose?) 
    for p in sections[1:]: 
     for eachKey in termz: 
      if eachKey in p: 
       # This is simpler than termz[eachKey] = termz.get(eachKey) + 1 
       termz[eachKey] += 1 

     # Move this outside of the inner loop 
     print(termz) 

     # After printing the results for that section, reset the counts 
     termz = dict.fromkeys(termz, 0)

编辑

样品的输入和输出：

input = ''' 
Section 1: 

This section is ignored. 
does not exceed 
************************************************** 
Section 2: 

shall not exceed 
not to exceed 
************************************************** 
Section 3: 

not less than''' 

termz = { 
    'does not exceed': 0, 
    'shall not exceed': 0, 
    'not exceeding': 0, 
    'do not exceed': 0, 
    'not to exceed': 0, 
    'shall at no time exceed': 0, 
    'shall not be less than': 0, 
    'not less than': 0 
} 

sections = input.split('**************************************************') 

# Skip the first section. (I assume this is on purpose?) 
for p in sections[1:]: 
    for eachKey in termz: 
     if eachKey in p: 
      # This is simpler than termz[eachKey] = termz.get(eachKey) + 1 
      termz[eachKey] += 1 

    # Move this outside of the inner loop 
    print(termz) 

    # After printing the results for that section, reset the counts 
    termz = dict.fromkeys(termz, 0) 

# OUTPUT: 
# {'not exceeding': 0, 'shall not exceed': 1, 'not less than': 0, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 1, 'do not exceed': 0, 'does not exceed': 0} 
# {'not exceeding': 0, 'shall not exceed': 0, 'not less than': 1, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 0, 'do not exceed': 0, 'does not exceed': 0}

来源

2017-07-06 18:38:24 smarx

感谢@smarx。它实际上输出与以前相同的东西......它只是一次打印出字典（这让我有一段时间感到困惑），并且最重要的是，输出看起来相当随机......它不包括第一部分，最后一部分或任何有序的东西。 – Th3SniperSpirit

您可能需要分享您的输入。（也许会制作一个虚拟的简短版本的文件。）我真的不知道输出结果可能如何相同......我们在循环之外移动了一个“print”语句。 – smarx

看我的编辑...我包括一个示例输入和程序的输出。它似乎工作正常，所以我想象你的输入是不同的。 – smarx

if eachKey in p: 
      termz[eachKey] += 1 # might do it 
      print(termz)

来源

2017-07-06 18:37:52

肯定该行的一个简化版本 – Th3SniperSpirit

将文本文件拆分成块，然后在这些段中搜索关键短语

回答

相关问题