2016-10-27 54 views
0

让我们假设我跑了随机森林模型,我得到如下的变量重要性信息:如何动态地选择列

set.seed(121) 
ImpMeasure<-data.frame(mod.varImp$importance) 
ImpMeasure$Vars<-row.names(ImpMeasure) 
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),] 
row.names(ImpMeasure.df)<-NULL 
class(ImpMeasure.df) 
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame 
ImpMeasure.df 

        Vars  Overall 
1   num_voted_users 100.000000 
2  num_critic_for_reviews 58.961441 
3  num_user_for_reviews 56.500707 
4  movie_facebook_likes 50.680318 
5 cast_total_facebook_likes 30.012205 
6      gross 27.652559 
7  actor_3_facebook_likes 24.094213 
8  actor_2_facebook_likes 19.633290 
9     imdb_score 16.063007 
10 actor_1_facebook_likes 15.848972 
11     duration 11.886036 
12     budget 11.853066 
13    title_year 7.804387 
14 director_facebook_likes 7.318787 
15  facenumber_in_poster 1.868376 
16    aspect_ratio 0.000000 

现在,如果我决定,我想作进一步的分析仅前5个变量,然后在做这样的:

library(dplyr) 
top.var<-ImpMeasure.df[1:5,] %>% select(Vars) 
top.var 

        Vars 
1   num_voted_users 
2 num_critic_for_reviews 
3  num_user_for_reviews 
4  movie_facebook_likes 
5 cast_total_facebook_likes 

如何使用这些信息没有拼写出的实际变量名,但使用说出top.var输出....如何dplyr select用它来选择这些VAR只从原始数据集(如下所示)功能为此..

我的原始数据集是这样的:

num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes 
1    723  178      0     855 
2    302  169      563     1000 
3    602  148      0     161 
4    813  164     22000     23000 
5    255  95      131     782 
6    462  132      475     530 
actor_1_facebook_likes  gross num_voted_users cast_total_facebook_likes 
1    1000  760505847  886204      4834 
2    40000  309404152  471220      48350 
3    11000  200074175  275868      11700 
4    27000  448130642  1144337     106759 
5    131  228830   8      143 
6    640  73058679  212204      1873 
facenumber_in_poster num_user_for_reviews budget title_year 
1     0     3054 237000000  2009 
2     0     1238 300000000  2007 
3     1     994 245000000  2015 
4     0     2701 250000000  2012 
5     0     97 26000000  2002 
6     1     738 263700000  2012 
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster 
1     936  7.9   1.78    33000  2 
2     5000  7.1   2.35     0  3 
3     393  6.8   2.35    85000  2 
4     23000  8.5   2.35    164000  3 
5      12  7.1   1.85     0  1 
6     632  6.6   2.35    24000  2 

回答

0
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster) 
head(movies.imp) 
## num_voted_users num_user_for_reviews num_critic_for_reviews 
## 1   886204   3054     723 
## 2   471220   1238     302 
## 3   275868   994     602 
## 4   1144337   2701     813 
## 5    8   127      37 
## 6   212204   738     462 
## movie_facebook_likes cast_total_facebook_likes cluster 
## 1    33000      4834  1 
## 2     0      48350  1 
## 3    85000      11700  1 
## 4    164000     106759  1 
## 5     0      143  2 
## 6    24000      1873  1 

那做!

0

哈德利提供这个问题的答案here

select_(df, .dots = top.var)