2016-02-29 44 views
0

我编写了从语料库中提取单词的代码,然后对它们进行标记并与句子进行比较。输出是Bag of Words(如果单词在句子1中,如果不是0)。将字符串划分为python

import nltk 
import numpy as np 
from nltk import FreqDist 
from nltk.corpus import brown 


news = brown.words(categories='news') 
news_sents = brown.sents(categories='news') 

fdist = FreqDist(w.lower() for w in news) 
vocabulary = [word for word, _ in fdist.most_common(100)] 
num_sents = len(news_sents) 

for i in range(num_sents): 
    features = {} 
    for word in vocabulary: 
     features[word] = int(word in news_sents[i]) 

    bow = "".join(str(n) for n in list(features.values())) 
    f = open("D:\\test\\Vector.txt", "a") 
    print(bow, file=f) 
    f.close() 

在这种情况下,输出字符串的长度为100个字符。我想将它分割成任意长度的块,并为其分配块数。例如:

print(i+1, chunk_id, bow, sep="\t", end="\n", file=f) 

其中i + 1是句号。为了展示我的意思,让我们取长度为12 >>“110010101111”和“011011000011”的字符串。它应该看起来像:

1 1 1100 
1 2 0101 
1 3 1111 
2 1 0110 
2 2 1100 
2 3 0011 
+0

的重复数据删除技术在谈论名单,但解决方案将字符串工作了。 – timgeb

回答

0

石斑鱼功能从itertools documentation似乎是你在找什么:

def grouper(iterable, n, fillvalue=None): 
    "Collect data into fixed-length chunks or blocks" 
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx 
    args = [iter(iterable)] * n 
    return izip_longest(fillvalue=fillvalue, *args)