我将数据表示为单个变量的许多不同直方图。我想确定使用无监督聚类的哪些直方图是相似的。我也想知道使用的最佳群集数量。使用地球移动距离的聚类直方图距离R
我已阅读Earth Movers Distance度量作为度量直方图之间距离的度量,但不知道如何在通用聚类算法中使用该距离度量(例如,k均值)。
主要:我用什么r软件包和函数来聚合直方图?
中学:如何确定“最佳”数量的聚类?
实施例数据集1(3单峰簇):
v1 <- rnorm(n=100, mean = 10, sd = 1) # cluster 1 (around 10)
v2 <- rnorm(n=100, mean = 50, sd = 5) # cluster 2 (around 50)
v3 <- rnorm(n=100, mean = 100, sd = 10) # cluster 3 (around 100)
v4 <- rnorm(n=100, mean = 12, sd = 2) # cluster 1
v5 <- rnorm(n=100, mean = 45, sd = 6) # cluster 2
v6 <- rnorm(n=100, mean = 95, sd = 6) # cluster 3
实施例数据集2(3双峰簇):
b1 <- c(rnorm(n=100, mean=9, sd=2) , rnorm(n=100, mean=200, sd=20)) # cluster 1 (around 10 and 200)
b2 <- c(rnorm(n=100, mean=50, sd=5), rnorm(n=100, mean=100, sd=10)) # cluster 2 (around 50 and 100)
b3 <- c(rnorm(n=100, mean=99, sd=8), rnorm(n=100, mean=175, sd=17)) # cluster 3 (around 100 and 175)
b4 <- c(rnorm(n=100, mean=12, sd=2), rnorm(n=100, mean=180, sd=40)) # cluster 1
b5 <- c(rnorm(n=100, mean=45, sd=6), rnorm(n=100, mean=80, sd=30)) # cluster 2
b6 <- c(rnorm(n=100, mean=95, sd=6), rnorm(n=100, mean=170, sd=25)) # cluster 3
b7 <- c(rnorm(n=100, mean=10, sd=1), rnorm(n=100, mean=210, sd=30)) # cluster 1 (around 10 and 200)
b8 <- c(rnorm(n=100, mean=55, sd=5), rnorm(n=100, mean=90, sd=15)) # cluster 2 (around 50 and 100)
b9 <- c(rnorm(n=100, mean=89, sd=9), rnorm(n=100, mean=165, sd=20)) # cluster 3 (around 100 and 175)
b10 <- c(rnorm(n=100, mean=8, sd=2), rnorm(n=100, mean=160, sd=30)) # cluster 1
b11 <- c(rnorm(n=100, mean=55, sd=6), rnorm(n=100, mean=110, sd=10)) # cluster 2
b12 <- c(rnorm(n=100, mean=105, sd=6), rnorm(n=100, mean=185, sd=21)) # cluster 3
EMD非常昂贵,所以您需要使用下界和索引来加速您的群集。 K-means只适用于Bregman分歧,我不认为EMD是其中之一。 –