分割每行的文件分成n

def ngram(n, k, document): 
    f = open(document, 'r') 
    for i, line in enumerate(f): 
     words = line.split() + line.split() 
     print words 
    return {}

组为前“我喜欢Python编程语言”和n = 2 是“我爱”，“爱”，“Python的”，“巨蟒编程“和”编程语言“;分割每行的文件分成n

我想存储在一个列表中，然后比较它们有多少相同。

来源

2014-01-21 user3195981

什么k执行？ – inspectorG4dget

如果'n = 3'应该输出什么？ – thefourtheye

查看'itertools pairwise'配方：http://docs.python.org/2/library/itertools.html#recipes和'collections.Counter'数据结构：http://docs.python.org/2/库/ collections.html＃集合。计数器 – IceArdor

那么，你可以通过列表理解实现这一目标：

>>> [s1 + " " + s2 for s1, s2 in zip(s.split(), s.split()[1:])] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language']

您也可以使用str.format功能：

>>> ["{} {}".format(s1, s2) for s1, s2 in zip(s.split(), s.split()[1:])] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language']

功能的最终版本：

from itertools import tee, islice 


def ngram(n, s): 
    var = [islice(it, i, None) for i, it in enumerate(tee(s.split(), n))] 
    return [("{} " * n).format(*itt) for itt in zip(*var)]

演示：

>>> from splitting import ngram 
>>> thing = 'I love the Python programming language' 
>>> ngram(2, thing) 
['I love ', 'love the ', 'the Python ', 'Python programming ', 'programming language '] 
>>> ngram(3, thing) 
['I love the ', 'love the Python ', 'the Python programming ', 'Python programming language '] 
>>> ngram(4, thing) 
['I love the Python ', 'love the Python programming ', 'the Python programming language '] 
>>> ngram(1, thing) 
['I ', 'love ', 'the ', 'Python ', 'programming ', 'language ']

来源

2014-01-21 05:32:30

参数化n在这里很难 – inspectorG4dget

@ inspectorG4dget嗯，你说得对。将尽快解决。 –

你真正想要的是使用'itertools.tee'的解决方案。看看我的帖子 – inspectorG4dget

这并不完全清楚你想要返回什么。假设一个行说：

I love the Python programming language

而且你想要做什么一线间。

from collections import deque 
def linesplitter(line, n): 
    prev = deque(maxlen=n)  # fixed length list 
    for word in line.split(): # iterate through each word 
     prev.append(word)  # keep adding to the list 
     if len(prev) == n:  # until there are n elements 
      print " ".join(prev) # then start printing 
           # oldest element is removed automatically 

with open(document) as f:  # 'r' is implied 
    for line in f: 
     linesplitter(line, 2) # or any other length!

输出：

I love 
love the 
the Python 
Python programming 
programming language

来源

2014-01-21 05:33:43 mhlester

你可以从itertools recipes的一个适应：

import itertools 
def ngrams(N, k, filepath): 
    with open(filepath) as infile: 
     words = (word for line in infile for word in line.split()) 
     ts = itertools.tee(words, N) 
     for i in range(1, len(ts)): 
      for t in ts[i:]: 
       next(t, None) 
     return zip(*ts)

与测试文件看起来像这样：

I love 
the 
python programming language

这里的输出：

In [21]: ngrams(2, '', 'blah') 
Out[21]: 
[('I', 'love'), 
('love', 'the'), 
('the', 'python'), 
('python', 'programming'), 
('programming', 'language')] 

In [22]: ngrams(3, '', 'blah') 
Out[22]: 
[('I', 'love', 'the'), 
('love', 'the', 'python'), 
('the', 'python', 'programming'), 
('python', 'programming', 'language')]

来源

2014-01-21 05:38:50 inspectorG4dget

列表中的'answer'在你的代码中做了什么？并且参数'k'也不使用 – smac89

@ Smac89：绝对没有。它是退化的代码，不应该在那里。感谢您捕捉我的错误。 – inspectorG4dget

这里，“单行”解决方案，使用列表悟：

s = "I love the Python programming language" 

def ngram(s, n): 
    return [" ".join(k) for k in zip(*[l[0] for l in zip(s.split()[e:] for e in range(n))])] 

# Test 
for i in range(1, 7): 
    print ngram(s, i)

输出：

['I', 'love', 'the', 'Python', 'programming', 'language'] 
['I love', 'love the', 'the Python', 'Python programming', 'programming language'] 
['I love the', 'love the Python', 'the Python programming', 'Python programming language'] 
['I love the Python', 'love the Python programming', 'the Python programming language'] 
['I love the Python programming', 'love the Python programming language'] 
['I love the Python programming language']

注意那没有k参数需要编辑。

适合您的情况：

def ngram(document, n): 
    with open(document) as f: 
     for line in f: 
      print [" ".join(k) for k in zip(*[l[0] for l in zip(line.split()[e:] for e in range(n))])]

来源

2014-01-21 05:47:42 Christian

分割每行的文件分成n

回答

相关问题