2014-12-05 58 views
2

其实我有同样的问题,这种情况下strsplit one column with exact information into two column拆分一列R中两列循环

这个问题已经解决了,只是我的数据看起来就像

 SNP Geno AlleleA AlleleB AlleleC AlleleD AlleleE 
1 marker1 G1  AA  AA  AA  AA  AA 
2 marker2 G1  TT  TT  TT  TT  TT 
3 marker3 G1  TT  TT  TT  TT  TT 
4 marker1 G2  CC  CC  CC  CC  CC 
5 marker2 G2  AA  AA  AA  AA  AA 
6 marker3 G2  TT  TT  TT  TT  TT 
7 marker1 G3  GG  GG  GG  GG  GG 
8 marker2 G3  AA  AA  AA  AA  AA 
9 marker3 G3  TT  TT  TT  TT  TT 

dput输出:

structure(list(SNP = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L), .Label = c("marker1", "marker2", "marker3"), class = "factor"), 
    Geno = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("G1", 
    "G2", "G3"), class = "factor"), AlleleA = structure(c(1L, 
    4L, 4L, 2L, 1L, 4L, 3L, 1L, 4L), .Label = c("AA", "CC", "GG", 
    "TT"), class = "factor"), AlleleB = structure(c(1L, 4L, 4L, 
    2L, 1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", 
    "CC", "GG", "TT")), AlleleC = structure(c(1L, 4L, 4L, 2L, 
    1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", 
    "GG", "TT")), AlleleD = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 
    3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", 
    "TT")), AlleleE = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 3L, 
    1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", "TT" 
    ))), .Names = c("SNP", "Geno", "AlleleA", "AlleleB", "AlleleC", 
"AlleleD", "AlleleE"), row.names = c(NA, -9L), class = "data.frame") 

在这个问题上,他只有一列想分成两列。问题是我有5000列(AlleleA,AlleleB .........等),想分裂(每一列到两列)

我试过使用这样的循环,但它doesnt工作,

for(i in colnames(dat)){ 
    dat1 <- data.frame(do.call(rbind, strsplit(as.vector(sprintf("dat$%s",i)), split = ""))) 
} 

我会等你的光, 谢谢

+0

如何分割列? (每列只有两列,分割的定义如何?)。在tidyr中有一个单独的函数,可以将列分成多列,你可以将它应用到你想要分割的每一列,例如dplyr的mutate_each函数。 – 2014-12-05 09:32:06

+0

@beginneR我修改了我的问题 – user46543 2014-12-05 09:40:28

+0

@beginneR其作品使用splitstackshape :)感谢Ananda Mahto – user46543 2014-12-05 09:45:31

回答

4

您可以使用cSplit从我的 “splitstackshape” 包的说法stripWhite = FALSE

例如,如果我们想拆所有的“等位基因*”栏目,我们会做:

library(splitstackshape) 
cSplit(mydf, grep("Allele", names(mydf)), "", stripWhite = FALSE) 
#  SNP Geno AlleleA_1 AlleleA_2 AlleleB_1 AlleleB_2 AlleleC_1 
# 1: marker1 G1   A   A   A   A   A 
# 2: marker2 G1   T   T   T   T   T 
# 3: marker3 G1   T   T   T   T   T 
# 4: marker1 G2   C   C   C   C   C 
# 5: marker2 G2   A   A   A   A   A 
# 6: marker3 G2   T   T   T   T   T 
# 7: marker1 G3   G   G   G   G   G 
# 8: marker2 G3   A   A   A   A   A 
# 9: marker3 G3   T   T   T   T   T 
# AlleleC_2 AlleleD_1 AlleleD_2 AlleleE_1 AlleleE_2 
# 1:   A   A   A   A   A 
# 2:   T   T   T   T   T 
# 3:   T   T   T   T   T 
# 4:   C   C   C   C   C 
# 5:   A   A   A   A   A 
# 6:   T   T   T   T   T 
# 7:   G   G   G   G   G 
# 8:   A   A   A   A   A 
# 9:   T   T   T   T   T 
2

由于@beginneR说,你可以使用tidyr::separate。下面是取自一个例子:http://blog.rstudio.org/2014/07/22/introducing-tidyr/

head(tidier, 8) 

#> id  trt  key time 
#> 1 1 treatment work.T1 0.08514 
#> 2 2 control work.T1 0.22544 
#> 3 3 treatment work.T1 0.27453 
#> 4 4 control work.T1 0.27231 
#> 5 1 treatment home.T1 0.61583 
#> 6 2 control home.T1 0.42967 
#> 7 3 treatment home.T1 0.65166 
#> 8 4 control home.T1 0.56774 

tidy <- tidier %>% 
    separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8) 
#> id  trt location time time 
#> 1 1 treatment  work T1 0.08514 
#> 2 2 control  work T1 0.22544 
#> 3 3 treatment  work T1 0.27453 
#> 4 4 control  work T1 0.27231 
#> 5 1 treatment  home T1 0.61583 
#> 6 2 control  home T1 0.42967 
#> 7 3 treatment  home T1 0.65166 
#> 8 4 control  home T1 0.56774 
+1

我认为*这个问题更多地涉及到必须在多个*列上进行这样的分割。 – A5C1D2H2I1M1N2O1R2T1 2014-12-05 09:40:19

+0

你是对的,我没有看清楚这个问题,也没有@ beginneR的评论。 – 2014-12-05 09:41:45

+1

实际上,我不太清楚这是否可以使用'mutate_each'和'separate'的组合完成,至少不像Ananda的答案那样灵活,因为单独需要您指定要分割的每个列柱。 – 2014-12-05 09:55:05

3

另一种选择是可能

library(qdap) 
res <- colsplit2df(dat, splitcols=2:ncol(dat),sep='') 
colnames(res)[-1] <- make.names(rep(colnames(dat)[-1],each=2), unique=TRUE) 
res[1:3,1:5] 
#  SNP Geno Geno.1 AlleleA AlleleA.1 
#1 marker1 G  1  A   A 
#2 marker2 G  1  T   T 
#3 marker3 G  1  T   T 

或只为Allele

colsplit2df(dat, splitcols=grep('Allele', names(dat)),sep='') 

编辑(泰勒林克)

我建议编辑th e列名的数据帧使用setNames首先如下:

setNames(dat, gsub("([A-Z]{1}[a-z]+[A-Z])", "\\1.1&\\1.2", names(dat))) %>% 
    colsplit2df(splitcols=3:ncol(dat), sep='')