2014-01-08 29 views
3

我有一堆文本,它们被分类到不同的类别中,然后每个文档都以每个标签的概率标记为0,1或2。我如何聚簇元组列表(标签,概率)列表? - python

[ "this is a foo bar", 
    "bar bar black sheep", 
    "sheep is an animal" 
    "foo foo bar bar" 
    "bar bar sheep sheep" ] 

在管道前面的工具返回的元组作为这样的列表的列表,在所述外列表中的每个元素是排序文档。

[ [(0,0.3), (1,0.5), (2,0.1)], 
    [(0,0.5), (1,0.3), (2,0.3)], 
    [(0,0.4), (1,0.4), (2,0.5)], 
    [(0,0.3), (1,0.7), (2,0.2)], 
    [(0,0.2), (1,0.6), (2,0.1)] ] 

我需要它,看看哪些标签中的每个元组的名单是最可能的实现:我只能用这样的事实,我知道每一个文件的标签0,1或2,其概率为这样的工作:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] , 
    [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] , 
    [[(0,0.4), (1,0.4), (2,0.5)]] ] 

作为另一个例子:

[in]

[ [(0,0.7), (1,0.2), (2,0.4)], 
    [(0,0.5), (1,0.9), (2,0.3)], 
    [(0,0.3), (1,0.8), (2,0.4)], 
    [(0,0.8), (1,0.2), (2,0.2)], 
    [(0,0.1), (1,0.7), (2,0.5)] ] 

[out]

[[[(0,0.7), (1,0.2), (2,0.4)], 
[(0,0.8), (1,0.2), (2,0.2)]] , 

[[(0,0.5), (1,0.9), (2,0.3)], 
[(0,0.1), (1,0.7), (2,0.5)], 
[(0,0.3), (1,0.8), (2,0.4)]] , 

[]] 

注:我做必须访问原始文本时,数据来源我对管道的一部分。

如何将标签和概率的元组列表进行聚类?在numpy,scipy,sklearn或任何python-able ML套件中是否有这样的功能?甚至NLTK

我们认为群集数是固定的,但群集大小不是。

我只试图寻找重心的最大值,但只给了我在每个集群的第一个值:在每个群集

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
         [(0,0.5), (1,0.3), (2,0.3)], 
         [(0,0.4), (1,0.4), (2,0.5)], 
         [(0,0.3), (1,0.7), (2,0.2)], 
         [(0,0.2), (1,0.6), (2,0.1)] ] 

# Find centroid. 
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0] 
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0] 
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0] 

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0] 
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0] 
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0] 

print instream[c1_centroid] 
print instream[c2_centroid] 
print instream[c2_centroid] 

[out](顶级元素:

[(0, 0.5), (1, 0.3), (2, 0.3)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
+0

如果您可以显示某些输入/输出的示例,这将有所帮助。只是更多地解释你究竟在做什么,也要确保它不是[XY问题](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)。 –

+0

@InbarRose,我编辑了这个问题来给出更多的背景。 – alvas

+2

你出来的第三行不应该是[(0,0.4),(1,0.4),(2,0.5)]'? –

回答

2

如果我理解正确,这就是你想要的。

import numpy as np 

N_TYPES = 3 

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
      [(0,0.5), (1,0.3), (2,0.3)], 
      [(0,0.4), (1,0.4), (2,0.5)], 
      [(0,0.3), (1,0.7), (2,0.2)], 
      [(0,0.2), (1,0.6), (2,0.1)] ] 
instream = np.array(instream) 

# this removes document tags because we only consider probabilities here 
values = [map(lambda x: x[1], doc) for doc in instream] 

# determine the cluster of each document by using maximum probability 
belongs_to = map(lambda x: np.argmax(x), values) 
belongs_to = np.array(belongs_to) 

# construct clusters of indices to your instream 
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)] 

# apply the indices to obtain full output 
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)] 

输出out

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], 

[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]], 
    [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]], 
    [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], 

[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]] 

我用numpy阵列,因为它们能够很好的搜索和索引。例如,表达式(belongs_to == 1).nonzero()[0]将索引数组返回到数组belongs_to,其中值为1。索引的示例是instream[cluster_indices[2]]

0

为什么要保持元组中的索引? 0,12是多余的,如果我理解正确,则不提供任何信息。只需将n_samples x 3概率列表提供给任何scikit-learn算法即可。 或者,如果您只想要最可能的标签分配,请执行np.argmax(X, axis=1)