2016-01-02 31 views
6

我在一个文本文件中有一本书,我需要打印每一节的第一段。我认为如果我在\ n \ n和\ n之间找到了一段文字,我可以找到我的答案。这是我的代码,它不起作用。你能告诉我,我错在哪里?在python中打印第一段

lines = [line.rstrip('\n') for line in open('G:\\aa.txt')] 

check = -1 
first = 0 
last = 0 

for i in range(len(lines)): 
    if lines[i] == "": 
      if lines[i+1]=="": 
       check = 1 
       first = i +2 
    if i+2< len(lines): 
     if lines[i+2] == "" and check == 1: 
      last = i+2 
while (first < last): 
    print(lines[first]) 
    first = first + 1 

另外我发现在计算器的码我尝试了太,但它只是印刷空数组。

f = open("G:\\aa.txt").readlines() 
flag=False 
for line in f: 
     if line.startswith('\n\n'): 
      flag=False 
     if flag: 
      print(line) 
     elif line.strip().endswith('\n'): 
      flag=True 

我在belown分享了这本书的一个样本部分。

土地

有迷人的人类利益的一个广阔的领域,躺在才刚刚我们的大门,这尚未被,但很少探讨之外的LAY。它是动物智能领域。

在研究世界野生动物的各种兴趣中,没有一个超过他们的思想,道德以及他们作为心理过程的结果所进行的行为。

II

野生动物气质&个体性

我想在这里做的就是,找到大写线,并把他们都在一个数组。然后,使用索引方法,通过比较我创建的这个数组的这些元素的索引,我会找到每个部分的第一个和最后一个段落。

输出应该是这样的:

有迷人的人类利益的一个广阔的领域,只有躺在只是我们的大门,这尚未被但很少探讨之外。它是动物智能领域。

我想在这里做的是,找到大写的行,并把它们放在一个数组中。然后,使用索引方法,通过比较我创建的这个数组的这些元素的索引,我会找到每个部分的第一个和最后一个段落。

+0

你可以添加实际的输入和预期的输出吗? –

回答

6

如果你想组可以使用itertools.groupby空行作为分隔符使用部分:

from itertools import groupby 
with open("in.txt") as f: 
    for k, sec in groupby(f,key=lambda x: bool(x.strip())): 
     if k: 
      print(list(sec)) 

多带些itertools FOO,我们可以用大写的标题作为分隔符得到部分:

from itertools import groupby, takewhile 

with open("in.txt") as f: 
    grps = groupby(f,key=lambda x: x.isupper()) 
    for k, sec in grps: 
     # if we hit a title line 
     if k: 
      # pull all paragraphs 
      v = next(grps)[1] 
      # skip two empty lines after title 
      next(v,""), next(v,"") 

      # take all lines up to next empty line/second paragraph 
      print(list(takewhile(lambda x: bool(x.strip()), v))) 

这将使你:

['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n'] 
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.'] 

每个部分的开头都有一个全部大写的标题,所以一旦我们击中了,我们知道有两条空行,那么第一段和模式重复。

要掰成使用循环:

from itertools import groupby 
from itertools import groupby 
def parse_sec(bk): 
    with open(bk) as f: 
     grps = groupby(f, key=lambda x: bool(x.isupper())) 
     for k, sec in grps: 
      if k: 
       print("First paragraph from section titled :{}".format(next(sec).rstrip())) 
       v = next(grps)[1] 
       next(v, ""),next(v,"") 
       for line in v: 
        if not line.strip(): 
         break 
        print(line) 

为了您的文字:

In [11]: cat -E in.txt 

THE LAY OF THE LAND$ 
$ 
$ 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$ 
$ 
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$ 
$ 
$ 
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$ 
$ 
$ 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 

的美元符号是新的生产线,产量:

In [12]: parse_sec("in.txt") 
First paragraph from section titled :THE LAY OF THE LAND 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence. 

First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 
+0

这很酷,我可以看到每个部分使用此代码..但我只想看看他们的第一段。我可以提取? –

+0

@TuğcanDemir,你想从你的问题中的inout中拉出什么? –

+0

我编辑了我的问题。 –

0

翻阅您找到的代码,逐行。

f = open("G:\\aa.txt").readlines() 
flag=False 
for line in f: 
     if line.startswith('\n\n'): 
      flag=True 
     if flag: 
      print(line) 
     elif line.strip().endswith('\n'): 
      flag=True 

它似乎从不将标志变量设置为true。

如果你可以分享你书中的一些样本,它会对每个人更有帮助。

+0

我分享了您共享的相同代码,只需在第一个代码块中将该标志设置为true即可。 –

+0

当我将第一个标志设置为true时,它会在每一行上再增加2条空行。 –

0

这应该工作,只要没有全部上限的段落:

f = open('file.txt') 

    for line in f: 
    line = line.strip() 
    if line: 
     for c in line: 
      if c < 'A' or c > 'Z': # check for non-uppercase chars 
       break 
     else:  # means the line is made of all caps i.e. I, II, etc, meaning new section 
      f.readline() # discard chapter headers and empty lines 
      f.readline() 
      f.readline() 
      print(f.readline().rstrip()) # print first paragraph 

    f.close() 

如果你也想得到最后一段,你可以跟踪上次看到的包含小写字符的行,然后一旦找到全部大写行(I,II等),表示一个新的部分,然后打印最近的一行,因为这将是上一节中的最后一段。

+0

它在两个不连贯的句子之间打印出大量的空行... –

+0

@TuğcanDemir我做了一些细微的改动,以删除空行并使代码更具可读性。此代码(和以前的版本)与您上面提供的示例一起使用。你能提供给你那些结果的样本部分吗? – TisteAndii

1

总是有正则表达式....

import re 
with open("in.txt", "r") as fi: 
    data = fi.read() 
paras = re.findall(r""" 
        [IVXLCDM]+\n\n # Line of Roman numeral characters 
        [^a-z]+\n\n  # Line without lower case characters 
        (.*?)\n   # First paragraph line 
        """, data, re.VERBOSE) 
print "\n\n".join(paras) 
+0

这个人的成长模式:“有些人遇到问题时,想'我知道,我会用正则表达式'。 [现在他们有两个问题](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/)。“ '[IV] +'哈? – msw

+0

如何打印第一段而不是第一行? –

+0

所以,我找到我的方式使用您的代码太..谢谢你这么多:) –

0

TXR解决方案

 
$ txr firstpar.txr data 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence. 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 

守则firstpar.txr

 
@(repeat) 
@num 

@title 

@firstpar 
@ (require (and (< (length num) 5) 
       [some title chr-isupper] 
       (not [some title chr-islower]))) 
@ (do (put-line firstpar)) 
@(end) 

基本上,我们搜索的输入的模式匹配绑定的三元素多线图案,titlefirstpar变量。现在,这种模式可以在错误的地方匹配,因此可以使用require声明添加一些限制性启发式。章节号码必须是简短的一行,标题行必须包含一些大写字母,而不是小写字母。这个表达式写在TXR Lisp中。

如果我们得到这个约束的匹配,那么我们输出在firstpar变量中捕获的字符串。