我如何聚簇元组列表（标签，概率）列表？ - python

我有一堆文本，它们被分类到不同的类别中，然后每个文档都以每个标签的概率标记为0,1或2。我如何聚簇元组列表（标签，概率）列表？ - python

[ "this is a foo bar", 
    "bar bar black sheep", 
    "sheep is an animal" 
    "foo foo bar bar" 
    "bar bar sheep sheep" ]

在管道前面的工具返回的元组作为这样的列表的列表，在所述外列表中的每个元素是排序文档。

[ [(0,0.3), (1,0.5), (2,0.1)], 
    [(0,0.5), (1,0.3), (2,0.3)], 
    [(0,0.4), (1,0.4), (2,0.5)], 
    [(0,0.3), (1,0.7), (2,0.2)], 
    [(0,0.2), (1,0.6), (2,0.1)] ]

我需要它，看看哪些标签中的每个元组的名单是最可能的实现：我只能用这样的事实，我知道每一个文件的标签0，1或2，其概率为这样的工作：

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] , 
    [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] , 
    [[(0,0.4), (1,0.4), (2,0.5)]] ]

作为另一个例子：

[in]：

[ [(0,0.7), (1,0.2), (2,0.4)], 
    [(0,0.5), (1,0.9), (2,0.3)], 
    [(0,0.3), (1,0.8), (2,0.4)], 
    [(0,0.8), (1,0.2), (2,0.2)], 
    [(0,0.1), (1,0.7), (2,0.5)] ]

[out]：

[[[(0,0.7), (1,0.2), (2,0.4)], 
[(0,0.8), (1,0.2), (2,0.2)]] , 

[[(0,0.5), (1,0.9), (2,0.3)], 
[(0,0.1), (1,0.7), (2,0.5)], 
[(0,0.3), (1,0.8), (2,0.4)]] , 

[]]

注：我做不必须访问原始文本时，数据来源我对管道的一部分。

如何将标签和概率的元组列表进行聚类？在numpy,scipy,sklearn或任何python-able ML套件中是否有这样的功能？甚至NLTK。

我们认为群集数是固定的，但群集大小不是。

我只试图寻找重心的最大值，但只给了我在每个集群的第一个值：在每个群集

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
         [(0,0.5), (1,0.3), (2,0.3)], 
         [(0,0.4), (1,0.4), (2,0.5)], 
         [(0,0.3), (1,0.7), (2,0.2)], 
         [(0,0.2), (1,0.6), (2,0.1)] ] 

# Find centroid. 
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0] 
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0] 
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0] 

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0] 
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0] 
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0] 

print instream[c1_centroid] 
print instream[c2_centroid] 
print instream[c2_centroid]

[out]（顶级元素：

[(0, 0.5), (1, 0.3), (2, 0.3)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
[(0, 0.3), (1, 0.7), (2, 0.2)]

来源

2014-01-08 alvas

如果您可以显示某些输入/输出的示例，这将有所帮助。只是更多地解释你究竟在做什么，也要确保它不是[XY问题]（http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem）。 –

@InbarRose，我编辑了这个问题来给出更多的背景。 – alvas

你出来的第三行不应该是[（0,0.4），（1,0.4），（2,0.5）]'？ –

如果我理解正确，这就是你想要的。

import numpy as np 

N_TYPES = 3 

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
      [(0,0.5), (1,0.3), (2,0.3)], 
      [(0,0.4), (1,0.4), (2,0.5)], 
      [(0,0.3), (1,0.7), (2,0.2)], 
      [(0,0.2), (1,0.6), (2,0.1)] ] 
instream = np.array(instream) 

# this removes document tags because we only consider probabilities here 
values = [map(lambda x: x[1], doc) for doc in instream] 

# determine the cluster of each document by using maximum probability 
belongs_to = map(lambda x: np.argmax(x), values) 
belongs_to = np.array(belongs_to) 

# construct clusters of indices to your instream 
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)] 

# apply the indices to obtain full output 
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]

输出out：

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], 

[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]], 
    [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]], 
    [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], 

[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我用numpy阵列，因为它们能够很好的搜索和索引。例如，表达式(belongs_to == 1).nonzero()[0]将索引数组返回到数组belongs_to，其中值为1。索引的示例是instream[cluster_indices[2]]。

来源

2014-01-08 16:57:11 islijepcevic

为什么要保持元组中的索引？ 0,1和2是多余的，如果我理解正确，则不提供任何信息。只需将n_samples x 3概率列表提供给任何scikit-learn算法即可。或者，如果您只想要最可能的标签分配，请执行np.argmax(X, axis=1)。

来源

2014-01-09 00:08:10

我如何聚簇元组列表（标签，概率）列表？ - python

回答

相关问题