2016-01-08 64 views
2

我有一个来自IMDB电影的R数据帧。将二进制数据帧转换为字符串

(这里是CSV文件:http://had.co.nz/data/movies/movies.tab.gz

的类型是由二进制表定义:

$ Action  (int) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,... 
$ Animation (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 
$ Comedy  (int) 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,... 
$ Drama  (int) 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,... 
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 
$ Romance  (int) 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,... 
$ Short  (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,... 

我想知道:是有这个二进制表转换成一个优雅的,R-生方式在相同的数据框中插入喜欢“喜剧,浪漫”的字符串?

非常感谢您的帮助!

+1

请显示一个小的可重复的例子和预期的输出,而不是'.gz'文件 – akrun

回答

2

我认为这是你想要的。

# Create some toy data like yours 
set.seed(1) 
n <- 5 
ds <- as.data.frame(replicate(7, sample(0:1, n, replace = TRUE))) 
names(ds) <- c("Action", "Animation", "Comedy", "Drama", 
       "Documentary", "Romance", "Short") 
print(ds) 
# Action Animation Comedy Drama Documentary Romance Short 
#1  0   1  0  0   1  0  0 
#2  0   1  0  1   0  0  1 
#3  1   1  1  1   1  0  0 
#4  1   1  0  0   0  1  0 
#5  0   0  1  1   0  0  1 

# Use each row as indicator vector 
apply(ds, 1, function(r) paste(names(ds)[as.logical(r)], collapse = ", ")) 
#[1] "Animation, Documentary"      
#[2] "Animation, Drama, Short"      
#[3] "Action, Animation, Comedy, Drama, Documentary" 
#[4] "Action, Animation, Romance"     
#[5] "Comedy, Drama, Short" 
0

下面是使用data.table

library(data.table) 
library(reshape2) 
setDT(melt(as.matrix(ds)))[value!=0][,toString(Var2) ,Var1] 
0

另一种选择我还会选择data.table:

library(readr) 
library(data.table) 
dt <- read_tsv("http://had.co.nz/data/movies/movies.tab.gz") 
dt <- setkey(melt(setDT(dt), id.vars=1:17)[value==1], "title") 
(dt <- unique(dt[dt[, .(categories=list(variable)), by=title]][, c("variable", "value"):=NULL])) 
#       title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa  categories 
#  1:      $ 1971 121  NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5 4.5 NA Comedy,Drama 
#  2:  $1000 a Touchdown 1939  71  NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5 14.5 NA   Comedy 
#  3: $21 a Day Once a Month 1941  7  NA 8.2  5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5 24.5 NA Animation,Short 
#  4:     $40,000 1996  70  NA 8.2  6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5 45.5 NA   Comedy 
#  5:     $pent 2000  91  NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 4.5 4.5 14.5 14.5 NA   Drama 
# ---                                 
# 44177:     sIDney 2002  15  NA 7.0  8 14.5 0.0 0.0 14.5 0.0 0.0 24.5 14.5 14.5 24.5 NA Action,Short 
# 44178:    tom thumb 1958  98  NA 6.5 274 4.5 4.5 4.5 4.5 14.5 14.5 24.5 14.5 4.5 4.5 NA  Animation 
# 44179:    www.XXX.com 2003 105  NA 1.1 12 45.5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 0.0 24.5 NA Drama,Romance 
# 44180:      xXx 2002 132 85000000 5.5 18514 4.5 4.5 4.5 4.5 14.5 14.5 14.5 14.5 4.5 4.5 PG-13   Action 
# 44181: xXx: State of the Union 2005 101 87000000 3.9 1584 24.5 4.5 4.5 4.5 4.5 14.5 4.5 4.5 4.5 14.5 PG-13   Action 

你可能要离开的类别列作为一个向量或列表中为了能够轻松处理:

head(dt$categories, 2) 
# [[1]] 
# [1] Comedy Drama 
# Levels: Action Animation Comedy Drama Documentary Romance Short 
# 
# [[2]] 
# [1] Comedy 
# Levels: Action Animation Comedy Drama Documentary Romance Short 
相关问题