1
我想运行使用火花MLlib k-means但我得到索引超出范围错误。索引超出范围在火花MLLIB K-means与TFIDF文本clutsering
我分裂我的非常小的样本输入文件和输出是这样的: - 用火花给TFIDF代码有稀疏表示
['hello', 'world', 'this', 'is', 'earth']
['what', 'are', 'you', 'trying', 'to', 'do']
['trying', 'to', 'learn', 'something']
['I', 'am', 'new', 'at', 'this', 'thing']
['what', 'about', 'you']
现在我'。输出是: -
(1048576,[50570,432125,629096,921177,928731], [1.09861228867,1.09861228867,0.69314718056,1.09861228867,1.09861228867])
(1048576,[110522,521365,697409,725041,749730,962395],[0.69314718056,1.09861228867,1.09861228867,0.69314718056,0.69314718056,0.69314718056])
(1048576,[4471,725041,850325,962395],[1.09861228867,0.69314718056,1.09861228867,0.69314718056])
(1048576,[36748,36757,84721,167368,629096,704697],[1.09861228867,1.09861228867,1.09861228867,1.09861228867,0.69314718056,1.09861228867])
(1048576,[110522,220898,749730],[0.69314718056,1.09861228867,0.69314718056])
,现在我运行k表示火花由MLlib给出的算法: -
clusters = KMeans.train(tfidf_vectors, 2, maxIterations=10)
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = tfidf_vectors.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
clusters.save(sc, "myModelPath")
sameModel = KMeansModel.load(sc, "myModelPath")
但我在WSSSE步得到索引超出范围的错误。 我做错了什么?
您输出的'簇'的外观如何? –
我不知道如何查看输出。我在运行该程序后创建的myModelPath文件夹中有一堆文件。如果你能告诉我哪个文件,那么我可以回复你。集群不可迭代,因此无法打印。 – Nicky