我有一堆文本,它们被分类到不同的类别中,然后每个文档都以每个标签的概率标记为0,1或2。我如何聚簇元组列表(标签,概率)列表? - python
[ "this is a foo bar",
"bar bar black sheep",
"sheep is an animal"
"foo foo bar bar"
"bar bar sheep sheep" ]
在管道前面的工具返回的元组作为这样的列表的列表,在所述外列表中的每个元素是排序文档。
[ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
我需要它,看看哪些标签中的每个元组的名单是最可能的实现:我只能用这样的事实,我知道每一个文件的标签0,1或2,其概率为这样的工作:
[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
[[(0,0.4), (1,0.4), (2,0.5)]] ]
作为另一个例子:
[in]
:
[ [(0,0.7), (1,0.2), (2,0.4)],
[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.3), (1,0.8), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)],
[(0,0.1), (1,0.7), (2,0.5)] ]
[out]
:
[[[(0,0.7), (1,0.2), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)]] ,
[[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.1), (1,0.7), (2,0.5)],
[(0,0.3), (1,0.8), (2,0.4)]] ,
[]]
注:我做不必须访问原始文本时,数据来源我对管道的一部分。
如何将标签和概率的元组列表进行聚类?在numpy
,scipy
,sklearn
或任何python-able ML套件中是否有这样的功能?甚至NLTK
。
我们认为群集数是固定的,但群集大小不是。
我只试图寻找重心的最大值,但只给了我在每个集群的第一个值:在每个群集
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
# Find centroid.
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]
c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]
print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]
[out]
(顶级元素:
[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
如果您可以显示某些输入/输出的示例,这将有所帮助。只是更多地解释你究竟在做什么,也要确保它不是[XY问题](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)。 –
@InbarRose,我编辑了这个问题来给出更多的背景。 – alvas
你出来的第三行不应该是[(0,0.4),(1,0.4),(2,0.5)]'? –