python对称字矩阵使用nltk

我试图从文本文档创建一个对称字矩阵。python对称字矩阵使用nltk

例如：文本=“真真是好真真是朋友，尼尼是坏的。”

我已经使用nltk标记了文本文档。现在我要计算其他词在同一句子中出现多少次。从上面的文字中，我想创建下面的矩阵：

 Barbara good friends Benny bad 
Barbara 2 1 1 1 0 
good 1 1 0 0 0 
friends 1 0 1 1 0 
Benny 1 0 1 2 1 
bad  0 0 1 1 1

注意对角线是单词的频率。因为芭芭拉和巴巴拉一样经常出现在巴巴拉的句子中。我希望不要超支，但如果代码变得太复杂，这不是一个大问题。

来源

2013-07-03 mumpy

问题是什么？ –

如何从文本创建上述矩阵？ – mumpy

首先我们记号化的文字，遍历每个句子，并通过在每个句子的单词的所有配对组合迭代，并存放数嵌套dict：

from nltk.tokenize import word_tokenize, sent_tokenize 
from collections import defaultdict 
import numpy as np 
text = "Barbara is good. Barbara is friends with Benny. Benny is bad." 

sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0)) 

for sent in sent_tokenize(text): 
    words = word_tokenize(sent) 
    for word1 in words: 
     for word2 in words: 
      sparse_matrix[word1][word2]+=1 

print sparse_matrix 
>> defaultdict(<function <lambda> at 0x7f46bc3587d0>, { 
'good': defaultdict(<function <lambda> at 0x3504320>, 
    {'is': 1, 'good': 1, 'Barbara': 1, '.': 1}), 
'friends': defaultdict(<function <lambda> at 0x3504410>, 
    {'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..

这基本上就像一个矩阵，因为我们可以索引sparse_matrix['good']['Barbara']并获得号码1，索引sparse_matrix['bad']['Barbara']并获得0，但我们实际上并未存储任何从未共同发生的词的计数，0仅由defaultdict生成，仅当您要求时它。这可以在做这些事情时真的节省很多内存。如果我们需要某种类型的线性代数或其他计算理性的密集矩阵，我们可以得到这样的：

lexicon_size=len(sparse_matrix) 
def mod_hash(x, m): 
    return hash(x) % m 
dense_matrix = np.zeros((lexicon_size, lexicon_size)) 

for k in sparse_matrix.iterkeys(): 
    for k2 in sparse_matrix[k].iterkeys(): 
     dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \ 
      sparse_matrix[k][k2] 

print dense_matrix 
>> 
[[ 0. 0. 0. 0. 0. 0. 0. 0.] 
[ 0. 0. 0. 0. 0. 0. 0. 0.] 
[ 0. 0. 1. 1. 1. 1. 0. 1.] 
[ 0. 0. 1. 1. 1. 0. 0. 1.] 
[ 0. 0. 1. 1. 1. 1. 0. 1.] 
[ 0. 0. 1. 0. 1. 2. 0. 2.] 
[ 0. 0. 0. 0. 0. 0. 0. 0.] 
[ 0. 0. 1. 1. 1. 2. 0. 3.]]

我会建议看http://docs.scipy.org/doc/scipy/reference/sparse.html用于处理矩阵稀疏的其他方式。

来源

2013-07-03 23:21:06 qwwqwwq

谢谢大家时间！我也很感谢你在稀疏矩阵上的链接。干杯! – mumpy

我会首先设置类似下面的内容。可能会添加某种类型的标记;尽管对于你的例子来说没有必要。

text = """Barbara is good. Barbara is friends with Benny. Benny is bad.""" 
allwords = text.replace('.','').split(' ') 
word_to_index = {} 
index_to_word = {} 
index = 0 
for word in allwords: 
    if word not in word_to_index: 
     word_to_index[word] = index 
     index_to_word[index] = word 
     index += 1 
word_count = index 

>>> index_to_word 
{0: 'Barbara', 
1: 'is', 
2: 'good', 
3: 'friends', 
4: 'with', 
5: 'Benny', 
6: 'bad'} 

>>> word_to_index 
{'Barbara': 0, 
'Benny': 5, 
'bad': 6, 
'friends': 3, 
'good': 2, 
'is': 1, 
'with': 4}

然后声明适当大小的矩阵（word_count x word_count）;可能使用numpy像

import numpy 
matrix = numpy.zeros((word_count, word_count))

或者只是一个嵌套列表：

matrix = [None,]*word_count 
for i in range(word_count): 
    matrix[i] = [0,]*word_count

注意这是棘手的，像matrix = [[0]*word_count]*word_count不会因为这项工作将使7所引用的名单相同的内部阵列（例如，如果您尝试该代码，然后执行matrix[0][1] = 1，则会发现matrix[1][1],matrix[2][1]等也将更改为1）。

然后你只需要遍历你的句子。

sentences = text.split('.') 
for sent in sentences: 
    for word1 in sent.split(' '): 
     if word1 not in word_to_index: 
      continue 
     for word2 in sent.split(' '): 
      if word2 not in word_to_index: 
       continue 
      matrix[word_to_index[word1]][word_to_index[word2]] += 1

然后你得到：

>>> matrix 

[[2, 2, 1, 1, 1, 1, 0], 
[2, 3, 1, 1, 1, 2, 1], 
[1, 1, 1, 0, 0, 0, 0], 
[1, 1, 0, 1, 1, 1, 0], 
[1, 1, 0, 1, 1, 1, 0], 
[1, 2, 0, 1, 1, 2, 1], 
[0, 1, 0, 0, 0, 1, 1]]

或者有什么说“本尼”和“坏”，你可以问matrix[word_to_index['Benny']][word_to_index['bad']]的频率，如果你是好奇。

来源

2013-07-03 22:56:53

非常感谢！我感谢您的帮助。 – mumpy

我希望我可以选择两个答案 - 你的答案都对我的分析非常有帮助。干杯! – mumpy

python对称字矩阵使用nltk

回答

相关问题