2014-01-21 96 views
2
def ngram(n, k, document): 
    f = open(document, 'r') 
    for i, line in enumerate(f): 
     words = line.split() + line.split() 
     print words 
    return {} 

组为前“我喜欢Python编程语言”和n = 2 是“我爱”,“爱”,“Python的”,“巨蟒编程“和”编程语言“;分割每行的文件分成n

我想存储在一个列表中,然后比较它们有多少相同。

+0

什么k执行? – inspectorG4dget

+0

如果'n = 3'应该输出什么? – thefourtheye

+0

查看'itertools pairwise'配方:http://docs.python.org/2/library/itertools.html#recipes和'collections.Counter'数据结构:http://docs.python.org/2/库/ collections.html#集合。计数器 – IceArdor

回答

1

那么,你可以通过列表理解实现这一目标:

>>> [s1 + " " + s2 for s1, s2 in zip(s.split(), s.split()[1:])] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language'] 

您也可以使用str.format功能:

>>> ["{} {}".format(s1, s2) for s1, s2 in zip(s.split(), s.split()[1:])] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language'] 

功能的最终版本:

from itertools import tee, islice 


def ngram(n, s): 
    var = [islice(it, i, None) for i, it in enumerate(tee(s.split(), n))] 
    return [("{} " * n).format(*itt) for itt in zip(*var)] 

演示:

>>> from splitting import ngram 
>>> thing = 'I love the Python programming language' 
>>> ngram(2, thing) 
['I love ', 'love the ', 'the Python ', 'Python programming ', 'programming language '] 
>>> ngram(3, thing) 
['I love the ', 'love the Python ', 'the Python programming ', 'Python programming language '] 
>>> ngram(4, thing) 
['I love the Python ', 'love the Python programming ', 'the Python programming language '] 
>>> ngram(1, thing) 
['I ', 'love ', 'the ', 'Python ', 'programming ', 'language '] 
+1

参数化n在这里很难 – inspectorG4dget

+0

@ inspectorG4dget嗯,你说得对。将尽快解决。 –

+0

你真正想要的是使用'itertools.tee'的解决方案。看看我的帖子 – inspectorG4dget

3

这并不完全清楚你想要返回什么。假设一个行说:

I love the Python programming language

而且你想要做什么一线间。

from collections import deque 
def linesplitter(line, n): 
    prev = deque(maxlen=n)  # fixed length list 
    for word in line.split(): # iterate through each word 
     prev.append(word)  # keep adding to the list 
     if len(prev) == n:  # until there are n elements 
      print " ".join(prev) # then start printing 
           # oldest element is removed automatically 

with open(document) as f:  # 'r' is implied 
    for line in f: 
     linesplitter(line, 2) # or any other length! 

输出:

I love 
love the 
the Python 
Python programming 
programming language 
2

你可以从itertools recipes的一个适应:

import itertools 
def ngrams(N, k, filepath): 
    with open(filepath) as infile: 
     words = (word for line in infile for word in line.split()) 
     ts = itertools.tee(words, N) 
     for i in range(1, len(ts)): 
      for t in ts[i:]: 
       next(t, None) 
     return zip(*ts) 

与测试文件看起来像这样:

I love 
the 
python programming language 

这里的输出:

In [21]: ngrams(2, '', 'blah') 
Out[21]: 
[('I', 'love'), 
('love', 'the'), 
('the', 'python'), 
('python', 'programming'), 
('programming', 'language')] 

In [22]: ngrams(3, '', 'blah') 
Out[22]: 
[('I', 'love', 'the'), 
('love', 'the', 'python'), 
('the', 'python', 'programming'), 
('python', 'programming', 'language')] 
+0

列表中的'answer'在你的代码中做了什么?并且参数'k'也不使用 – smac89

+0

@ Smac89:绝对没有。它是退化的代码,不应该在那里。感谢您捕捉我的错误。 – inspectorG4dget

0

这里,“单行”解决方案,使用列表悟

s = "I love the Python programming language" 

def ngram(s, n): 
    return [" ".join(k) for k in zip(*[l[0] for l in zip(s.split()[e:] for e in range(n))])] 

# Test 
for i in range(1, 7): 
    print ngram(s, i) 

输出:

['I', 'love', 'the', 'Python', 'programming', 'language'] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language'] 
['I love the', 'love the Python', 'the Python programming', 'Python programming language'] 
['I love the Python', 'love the Python programming', 'the Python programming language'] 
['I love the Python programming', 'love the Python programming language'] 
['I love the Python programming language'] 

注意那没有k参数需要编辑。


适合您的情况:

def ngram(document, n): 
    with open(document) as f: 
     for line in f: 
      print [" ".join(k) for k in zip(*[l[0] for l in zip(line.split()[e:] for e in range(n))])]