2016-10-22 40 views
-1

好的,所以我试图从文本文件中找到最长的链,其中一行的最后一个单词是下一个词的第一个单词(适用于诗歌)。我所需要的Python脚本运行良好,但仍需要很长时间。我不是编码专家,完全不知道优化。我是否需要更多选择? 如何缩短运行较长文本所需的时间?行最后一个词的最长链/下一个第一个词

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
import re 
import sys 

# Opening the source text 
with open("/text.txt") as g: 
    all_lines = g.readlines() 

def last_word(particular_line): 
    if particular_line != "\n": 
     particular_line = re.sub(ur'^\W*|\W*$', "",particular_line) 
     if len(particular_line) > 1: 
      return particular_line.rsplit(None, 1)[-1].lower() 

def first_word(particular_line): 
    if particular_line != "\n": 
     particular_line = re.sub(ur'^\W*|\W*$', "",particular_line) 
     if len(particular_line) > 1: 
      return particular_line.split(None, 1)[0].lower() 

def chain(start, lines, depth): 
    remaining = list(lines) 
    del remaining[remaining.index(start)] 
    possibles = [x for x in remaining if (len(x.split()) > 2) and (first_word(x) == last_word(start))] 
    maxchain = [] 
    for c in possibles: 
     l = chain(c, remaining, depth) 
     sys.stdout.flush() 
     sys.stdout.write(str(depth) + " of " + str(len(all_lines)) + " \r") 
     sys.stdout.flush() 
     if len(l) > len(maxchain): 
      maxchain = l 
      depth = str(depth) + "." + str(len(maxchain)) 
    return [start] + maxchain 

#Start 
final_output = [] 

#Finding the longest chain 

for i in range (0, len(all_lines)): 
    x = chain(all_lines[i], all_lines, i) 
    if len(x) > 2: 
     final_output.append(x) 
final_output.sort(key = len) 

#Output on screen 
print "\n\n--------------------------------------------" 

if len(final_output) > 1: 
    print final_output[-1] 
else: 
    print "Nothing found" 
+1

你能提供这样的行的例子吗? –

回答

1
import itertools 
def matching_lines(line_pair): 
    return line_pair[0].split()[-1].lower() == line_pair[1].split()[0].lower() 

line_pairs = ((line,next_line) for line,next_line in itertools.izip(all_lines,all_lines[1:])) 
grouped_pairs = itertools.groupby(line_pairs,matching_lines) 
print max([len(list(y))+1 for x,y in grouped_pairs if x]) 

虽然我不确定它会更快(但我认为这将是因为它仅遍历一次,并大都采用建宏)

+1

不错的解决方案。在Python 3.5中,因为y是一个生成器,所以它需要是'len(list(y))'。总数也是1短。 –

+0

哈哈感谢:P ... –

0

是的,这个代码具有的复杂性$ O(N^2)$。这意味着如果你的文件有n行,那么你的代码执行的迭代次数为1 *(n-1)第一行,然后1 *(n-2)第二行等n个这样的元素。对于大n,这相当于$ n^2 $。其实,有一个代码在这一行

del remaining[remaining.index(start)] 

,你可能是指运行这个错误:

del remaining[:remaining.index(start)] 

(注意:在方括号“”),它扩展了运行时(现(n-1)+(n-1)+(n-1)= n *(n-1) -3)..)。
您可以优化代码,如下所示:从maxchainlen = 0开始,curchainlen = 0。现在,遍历行,每次比较当前行的第一个单词和上一行的最后一个单词。如果它们匹配,则将curchainlen加1。如果它们不匹配,则检查是否存在curchainlen,如果是,则分配maxchainlen = curchainlen,并将curchainlen初始化为0.在完成对行的迭代后,再次对maxchainlen进行检查。例如:

lw = last_word(lines[0]) 
curchainlen = 0 
maxchainlen = 0 
for l in lines[2:]: 
    if lw = first_word(l): 
     curchainlen = curchainlen + 1 
    else: 
     maxchainlen = max(maxchainlen, curchainlen) 
     curchainlen = 0 
maxchainlen = max(maxchainlen, curchainlen) 
print(maxchainlen) 
+0

新代码的复杂性为O(n)。所以对于长文件,它将大大提高性能。 – galra

0

我想尝试拆分这个工作分为两个阶段:第一找到链,然后对它们进行比较。这将简化代码很多。由于链是文件中所有行的一小部分,因此首先找到它们然后对它们进行排序会比试图在一个大的过程中处理整个事件更快。

如果您使用python yield关键字,该问题的第一部分更容易,该关键字与return类似,但不会结束函数。这使您可以一次一行地循环播放内容,并且无需在整个内存中始终保存整个内容。

下面是一次抓取一行文件的基本方法。它采用yield拔出链,因为它发现他们

def get_chains(*lines): 
    # these hold the last token and the 
    # members of this chain 
    previous = None 
    accum = [] 

    # walk through the lines, 
    # seeing if they can be added to the existing chain in `accum` 
    for each_line in lines: 
     # split the line into words, ignoring case & whitespace at the ends 
     pieces = each_line.lower().strip().split(" ") 
     if pieces[0] == previous: 
      # match? add to accum 
      accum.append(each_line) 
     else: 
      # no match? yield our chain 
      # if it is not empty 
      if accum: 
       yield accum 
       accum = [] 
     # update our idea of the last, and try the next line 
     previous = pieces[-1] 

    # at the end of the file we need to kick out anything 
    # still in the accumulator 
    if accum: 
     yield accum 

当你给这个函数线的字符串,它会yield出链,如果发现它们,然后继续。 呼叫这个功能可以捕获成功的链,并与他们做事。

一旦你有链子,很容易按长度排序并选择最长的。由于Python具有内置列表排序功能,只需收集一行行长 - >行对并对其进行排序即可。最长的一行将是最后一项:

def longest_chain(filename): 

    with open (filename, 'rt') as file_handle: 
     # if you loop over an open file, you'll get 
     # back the lines in the file one at a time 
     incoming_chains = get_chains(*file_handle) 

     # collect the results into a list, keyed by lengths 
     all_chains = [(len(chain), chain) for chain in incoming_chains] 
     if all_chains: 
      all_chains.sort() 
      length, lines = all_chains[-1] 
      # found the longest chain 
      return "\n".join(lines) 
     else: 
      # for some reason there are no chains of connected lines 
      return [] 
相关问题