Python - csr_matrix的数据结构

我正在研究TFIDF。我已经使用了tfidf_vectorizer.fit_transform。它返回一个csr_matrix，但我不明白结果的结构。Python - csr_matrix的数据结构

数据输入：

文件=（“天空是蓝色的”，“阳光灿烂”，“在天空阳光灿烂”，“我们可以看到，闪亮的阳光，灿烂的阳光”）

声明：

tfidf_vectorizer = TfidfVectorizer() 
tfidf_matrix = tfidf_vectorizer.fit_transform(documents) 
print(tfidf_matrix)

其结果是：

（0,9）0.34399327143
（0,7）0.519713848879
（0,4）0.420753151645
（0，0） 0.659191117868
（1,9）0.426858009784
（1,4）0.522108621994
（1,8）0.522108621994
（1,1）0.522108621994
（2,9）0.526261040111
（2,7）0.397544332095
（2,4）0.32184639876
（2，8）0.32184639876
（2，1）0.32184639876
（2，3）0.504234576856
（3,9）0.390963088213
（3,8）0.47820398015
（3,1）0.239101990075
（3，10）0.374599471224
（3，2）0.374599471224
（3,5）0.374599471224
（3,6）0.374599471224

tfidf_matrix是csr_matrix。所以我在这找到了，但没有结构与结果相同：scipy.sparse.csr_matrix

什么结构的值为（0，9）0.34399327143？

来源

2017-08-14 Brasc elok

这看起来像一个收集某种关于句子统计在列表中的矩阵（其中4）和独特的字（11？）。例如，第一行有4个矩阵项，4个字。 'tfidt_matrix.A'应该以传统的矩阵形式显示它。 – hpaulj

@hpaulj：你能帮我写下更详细的矩阵吗？ –

没有矢量化，我可以重新创建矩阵，或多或少，这个顺序操作：

In [703]: documents = ("The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun")

得到的话（全部小写）列出的清单：

In [704]: alist = [l.lower().split() for l in documents]

通过 alist和c

In [705]: aset = set() 
In [706]: [aset.update(l) for l in alist] 
Out[706]: [None, None, None, None] 
In [707]: unq = sorted(list(aset)) 
In [708]: unq 
Out[708]: 
['blue', 
'bright', 
'can', 
'in', 
'is', 
'see', 
'shining', 
'sky', 
'sun', 
'the', 
'we']

转到：

得到词的排序列表（唯一） ollect字数。 rows将语句编号，cols将是唯一字索引

In [709]: rows, cols, data = [],[],[] 
In [710]: for i,row in enumerate(alist): 
    ...:  for c in row: 
    ...:   rows.append(i) 
    ...:   cols.append(unq.index(c)) 
    ...:   data.append(1) 
    ...:

建立从这个数据稀疏矩阵：

In [711]: M = sparse.csr_matrix((data,(rows,cols))) 
In [712]: M 
Out[712]: 
<4x11 sparse matrix of type '<class 'numpy.int32'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [713]: print(M) 
    (0, 0) 1 
    (0, 4) 1 
    (0, 7) 1 
    (0, 9) 1 
    (1, 1) 1 
    .... 
    (3, 9) 2 
    (3, 10) 1 
In [714]: M.A  # viewed as 2d array 
Out[714]: 
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], 
     [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0], 
     [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0], 
     [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)

由于这是使用sklearn，我可以重现你的矩阵：

In [717]: from sklearn import feature_extraction 
In [718]: tf = feature_extraction.text.TfidfVectorizer() 
In [719]: tfM = tf.fit_transform(documents) 
In [720]: tfM 
Out[720]: 
<4x11 sparse matrix of type '<class 'numpy.float64'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [721]: print(tfM) 
    (0, 9) 0.34399327143 
    (0, 7) 0.519713848879 
    (0, 4) 0.420753151645 
    .... 
    (3, 5) 0.374599471224 
    (3, 6) 0.374599471224 
In [722]: tfM.A 
Out[722]: 
array([[ 0.65919112, 0.  , 0.  , 0.  , 0.42075315, 
     0.  , 0.  , 0.51971385, 0.  , 0.34399327, 
     0.  ],.... 
     [ 0.  , 0.23910199, 0.37459947, 0.  , 0.  , 
     0.37459947, 0.37459947, 0.  , 0.47820398, 0.39096309, 
     0.37459947]])

的实际数据被存储为3个属性数组：

In [723]: tfM.indices 
Out[723]: 
array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1, 
     10, 2, 5, 6], dtype=int32) 
In [724]: tfM.data 
Out[724]: 
array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801, 
     ... 
     0.37459947]) 
In [725]: tfM.indptr 
Out[725]: array([ 0, 4, 8, 14, 21], dtype=int32)

对各行的indices值告诉我们哪些词出现在了那句话：

In [726]: np.array(unq)[M[0,].indices] 
Out[726]: 
array(['blue', 'is', 'sky', 'the'], 
     dtype='<U7') 
In [727]: np.array(unq)[M[3,].indices] 
Out[727]: 
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'], 
     dtype='<U7')

来源

2017-08-14 20:14:53 hpaulj

谢谢你，非常详细和有帮助 –

你看到的只是字符串表示在调用print(my_csr_mat)时使用。它列出（在你的情况下）你矩阵中的所有nonzeros。（也许会有大量的nonzeros截断输出）。

由于这是一个稀疏矩阵，它有2个维度。

(0, 9) 0.34399327143

means：matrix-element @ position [0,9] is 0.34399327143。

小演示：

import numpy as np 
from scipy.sparse import csr_matrix 

matrix_dense = np.arange(20).reshape(4,5) 
zero_out = np.random.choice((0,1), size=(4,5), p=(0.7, 0.3)) 
matrix_dense_mod = matrix_dense * zero_out 

print(matrix_dense_mod) 

sparse_mat = csr_matrix(matrix_dense_mod) 

print(sparse_mat)

输出：

[[ 0 0 2 0 4] 
[ 0 6 0 8 0] 
[ 0 11 0 13 14] 
[15 0 0 18 19]] 
    (0, 2)  2 
    (0, 4)  4 
    (1, 1)  6 
    (1, 3)  8 
    (2, 1)  11 
    (2, 3)  13 
    (2, 4)  14 
    (3, 0)  15 
    (3, 3)  18 
    (3, 4)  19

我不知道你So I find on this, but there are no structure as same as the result的意思，但要注意：在scipy.sparse文档最例子有my_mat.toarray （），这意味着它正在用稀疏矩阵构建一个密集数组，该矩阵具有不同的字符串表示风格。

来源

2017-08-14 16:28:34 sascha

谢谢。我知道了 –

Python - csr_matrix的数据结构

回答

相关问题