使用功能散列的群集

我必须将一些json格式的文档聚类。我想修改功能哈希来减小尺寸。开始小，这是我输入：使用功能散列的群集

doc_a = { "category": "election, law, politics, civil, government", 
      "expertise": "political science, civics, republican" 
     } 

doc_b = { "category": "Computers, optimization", 
      "expertise": "computer science, graphs, optimization" 
     } 
doc_c = { "category": "Election, voting", 
      "expertise": "political science, republican" 
     } 
doc_d = { "category": "Engineering, Software, computers", 
      "expertise": "computers, programming, optimization" 
     } 
doc_e = { "category": "International trade, politics", 
      "expertise": "civics, political activist" 
     }

现在，我怎么去使用功能散列，为每个文档向量，然后计算相似度和创建群集？阅读http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html后，我有点失落。不知道如果我必须使用“字典”或转换我的数据有一些整数，然后使用'pair''input_type'到我的featureHasher。我应该如何解释featureHasher的输出？例如，示例http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html输出一个numpy数组。

In [1]: from sklearn.feature_extraction import FeatureHasher 

In [2]: hasher = FeatureHasher(n_features=10, non_negative=True, input_type='pair') 

In [3]: x_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]]) 

In [4]: x_new.toarray() 
Out[4]: 
array([[ 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.], 
     [ 0., 0., 0., 0., 0., 0., 0., 5., 0., 0.]]) 

In [5]:

我认为这些行是文档和列值是..？说，如果我想聚类或找到这些向量之间的相似性（使用余弦或Jaccard），不知道是否我必须做项目明智的比较？

预期输出：doc_a，doc_c和doc_e应该位于一个群集中，其余群集位于另一个群集中。

谢谢！

来源

2016-11-17 user1717931

如果您使用HashingVectorizer而不是FeatureHasher来解决此问题，那么您可以让自己更轻松。 HashingVectorizer负责标记输入数据并可以接受字符串列表。

问题的主要挑战是您实际上有两种文本功能，category和expertise。在这种情况下的诀窍是适合哈希矢量两个功能，然后结合输出：

from sklearn.feature_extraction.text import HashingVectorizer 
from scipy.sparse import hstack 
from sklearn.cluster import KMeans 

docs = [doc_a,doc_b, doc_c, doc_d, doc_e] 

# vectorize both fields separately 
category_vectorizer = HashingVectorizer() 
Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs]) 

expertise_vectorizer = HashingVectorizer() 
Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs]) 

# combine the features into a single data set 
X = hstack((Xc,Xe)) 
print("X: %d x %d" % X.shape) 
print("Xc: %d x %d" % Xc.shape) 
print("Xe: %d x %d" % Xe.shape) 

# fit a cluster model 
km = KMeans(n_clusters=2) 

# predict the cluster 
for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)): 
    print("%s is in cluster %d" % (k,v))

来源

2016-11-18 03:17:37

谢谢瑞安。我在10种具有4种文本特征的文档上试了一下，效果很好。现在，我可以直观地查看群集的成员，并查看哪些功能对其“亲密度”有贡献。当我处理非常大的数据集时，这会成为一个问题，因为文档中提到：“没有办法做反向变换（从特征索引到字符串特征名），这在试图反思哪些特征对于一个模型。”）。我正在研究sklearn可以提供的其他事情。有任何想法吗？ – user1717931

是的，与HashingVectorizer有一个折衷，因为你可以使用更多的功能来适应你的模型，但是你失去了自省的能力。 sklearn中的其他向量化工具（CountVectorizer，TFIDFVectorizer）将允许您执行自检，但它们有更大的占用空间;为了使它们适合大数据集，您可以将max_features设置为合理的数字。 –

如果我必须手动计算两个散列向量（包含每个要素桶中散列数的特征向量）之间的相似度 - 我可以简单地使用jaccard或余弦我相信。对？ – user1717931

使用功能散列的群集

回答

相关问题