使用聚类分析选择最不相似的个体

-1

我想将我的数据聚类为5个聚类，然后我们需要从所有数据中选择50个具有最不相似关系的个体。这意味着如果第一个聚类包含100，第二个包含200，第三个包含400，第四个包含200和第五个100，我必须从第一个聚类中选择5个+从第二个聚类中选择10个+从第三个+第五名5人。使用聚类分析选择最不相似的个体

数据例如：

 mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))

我所做的一切，直至现在聚类的数据和排名个人每个集群内，然后将其导出到excel，并从那里...... ，已成为成为自一个问题我数据变得非常大。

对于如何在R 中应用以前的任何帮助或建议，我将不胜感激。

来源

2013-10-07 hema

你需要帮助瓦特/ R *命令*要得到这个工作，或W/* *的理解，将要使用的过程？这听起来像是一个关于统计的概念性问题，而不是关于R的编程问题。如果是这样，这个Q会更好地移植到[交叉验证]（http://stats.stackexchange.com/）（即统计信息）。 SE）。 – gung

统计学上它非常清楚----我需要关于如何在R – hema

中做到这一点的帮助到目前为止你有什么R代码？ –

我不确定是否是你正在寻找什么，但也许它可以帮助：

mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1)) 
rownames(mydata) <- paste0("id", 1:100) # some id for identification 


# cluster objects and calculate dissimilarity matrix 
cl <- cutree(hclust(
    sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5) 

# combine results, take sum to aggregate dissimilarity 
res <- data.frame(id=rownames(mydata), 
        cluster=cl, dis_sim=rowSums(as.matrix(sim))) 
# order, lowest overall dissimilarity will be first 
res <- res[order(res$dis_sim), ] 


# split object 
reslist <- split(res, f=res$cluster) 


## takes first three items with highest overall dissim. 
lapply(reslist, tail, n=3) 

## returns id´s with highest overall dissimilarity, top 20% 
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)

来源

2013-10-07 14:27:34 holzben

亲爱的Holzben，它真的帮助了谢谢---集群内还有一件事，如何挑选最接近集群质心的个体？ ---再次感谢你为你的漂亮代码和回复 – hema

关于你对此有何评论，找到下面的代码：

恳求注意，代码可以在美观和效率方面得到改善。进一步我用了第二个答案，否则它会是混乱。

# calculation of centroits based on: 
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html 
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE)) 
cent <- tapply(mydata, 
     list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean) 
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]]) 


# add up cluster number and data and split by cluster 
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5)) 
newdfl <- split(newdf, f=newdf$cluster) 

# add centroids and drop cluster info 
totaldf <- lapply(1:5, 
      function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]), 
           li=newdfl, cen=cent) 


# calculate new distance to centroits and sort them 
dist_to_cent <- lapply(totaldf, function(x) 
        sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ])) 
dist_to_cent

为重心的计算出的hclust看到R-Mailinglist

来源

2013-10-07 19:10:36 holzben

感谢您的时间----基于数据示例我认为使用kmeans并将数据集群到50个群集可能会更好（因为我想选择50个人）---然后选择离群集中心距离最近的一个个体/群---你怎么看？很抱歉让你困扰这么多问题。 – hema

如果你有兴趣分析质心kmeans显然是一个比层次聚类更自然的选择......在我的例子中，我开始了层次聚类，因此我也在第二个例子中做了。你的建议听起来不错，但我不确定你的总体目标是什么.... – holzben

使用聚类分析选择最不相似的个体

回答

相关问题