2015-10-06 18 views
1

我想运行使用火花MLlib k-means但我得到索引超出范围错误。索引超出范围在火花MLLIB K-means与TFIDF文本clutsering

我分裂我的非常小的样本输入文件和输出是这样的: - 用火花给TFIDF代码有稀疏表示

['hello', 'world', 'this', 'is', 'earth'] 
['what', 'are', 'you', 'trying', 'to', 'do'] 
['trying', 'to', 'learn', 'something'] 
['I', 'am', 'new', 'at', 'this', 'thing'] 
['what', 'about', 'you'] 

现在我'。输出是: -

(1048576,[50570,432125,629096,921177,928731], [1.09861228867,1.09861228867,0.69314718056,1.09861228867,1.09861228867]) 
(1048576,[110522,521365,697409,725041,749730,962395],[0.69314718056,1.09861228867,1.09861228867,0.69314718056,0.69314718056,0.69314718056]) 
(1048576,[4471,725041,850325,962395],[1.09861228867,0.69314718056,1.09861228867,0.69314718056]) 
(1048576,[36748,36757,84721,167368,629096,704697],[1.09861228867,1.09861228867,1.09861228867,1.09861228867,0.69314718056,1.09861228867]) 
(1048576,[110522,220898,749730],[0.69314718056,1.09861228867,0.69314718056]) 

,现在我运行k表示火花由MLlib给出的算法: -

clusters = KMeans.train(tfidf_vectors, 2, maxIterations=10) 

def error(point): 
    center = clusters.centers[clusters.predict(point)] 
    return sqrt(sum([x**2 for x in (point - center)])) 

WSSSE = tfidf_vectors.map(lambda point: error(point)).reduce(lambda x, y: x + y) 
print("Within Set Sum of Squared Error = " + str(WSSSE)) 

clusters.save(sc, "myModelPath") 
sameModel = KMeansModel.load(sc, "myModelPath") 

但我在WSSSE步得到索引超出范围的错误。 我做错了什么?

+0

您输出的'簇'的外观如何? –

+0

我不知道如何查看输出。我在运行该程序后创建的myModelPath文件夹中有一堆文件。如果你能告诉我哪个文件,那么我可以回复你。集群不可迭代,因此无法打印。 – Nicky

回答

1

我今天已经遇到类似的问题,它看起来像is a bug。 TFIDF创建SparseVectors这样的:

>>> from pyspark.mllib.linalg import Vectors 
>>> sv = Vectors.sparse(5, {1: 3}) 

,并使用指数比去年非零值的指数大导致的异常访问值:

>>> sv[0] 
0.0 
>>> sv[1] 
3.0 
>>> sv[2] 
Traceback (most recent call last): 
... 
IndexError: index out of bounds 

快,虽然不是非常有效的,解决方法是将SparseVector转换为NumPy数组:

def error(point):               
    center = clusters.centers[clusters.predict(point)] 
    return sqrt(sum([x**2 for x in (point.toArray() - center)]))