2011-10-12 119 views
2

我有一个包含各带有一个“样本”相关联的长字符串的数据帧:分手一个字符串转换为多个字符串在不同的行

Sample Data 
    1  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
    2  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 

我想编写一个简单的方法来打破这种字符串转换成5片以下面的格式:

Sample X 
CCT6 - Characters 1-33 
GAT1 - Characters 34-68 
IMD3 - Characters 69-99 
PDR3 - Characters 100-130 
RIM15 - Characters 131-168 

给予的输出看起来像这样对于每个样品:

Sample 1 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 

我已经能够使用substr功能打破了长串个片但我还想能够自动执行它,所以我可以得到所有5个在一个输出。理想情况下,这个输出也是一个数据框。

回答

5

这是?read.fwf是。

首先,一些数据看起来像你的问题:

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE) 

现在使用read.fwf,指定每个字段和他们的名字的宽度,这都应该是模式character。我们将示例数据的文本列包装在textConnection中,以便我们可以将其视为一般由read.*和其他函数理解的连接。

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15"))) 


           CCT6        GAT1       IMD3       PDR3         RIM15 
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N 
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N 

现在遍历所有的行并打印出每一个按你的例子​​:

for (i in 1:nrow(strs)) { 
    writeLines(paste("Sample", i)) 
    writeLines(paste(names(strs), strs[i, ], sep = " - ")) 
} 

给,例如:

Sample 2 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 
+0

这很好用!我只是不知道如何保存最终数据,以便以后可以再次访问它。 –

+0

你可以打开一个文件连接并使用带有'con ='参数的writeLines,或者你可以使用'save(strs,file =“strpieces.rda”)' –

+0

现在用这个代码运行的一个问题是它从最终结构中的数据中分离出原始样本ID号。在我的例子中,样本从1开始依次出现。但是,在我的实际数据集中,情况并非如此。我怎样才能保持连接,以便最终的输出将具有原始数据表中附加到分解字符串的任何样本? –

1
SampX <- textConnection("CCT6 - Characters 1-33 
GAT1 - Characters 34-68 
IMD3 - Characters 69-99 
PDR3 - Characters 100-130 
RIM15 - Characters 131-168") 
dfSampX <-read.table(SampX, sep="-") 
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2)) 

sampdat <- read.table(textConnection("Sample Data 
    1  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
    2  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
"), header=TRUE,stringsAsFactors=FALSE) 

此代码将细分为群:

apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"])) 
    [,1]        [,2]         
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
    [,3]        [,4]        
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111" 
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111" 
    [,5]          
[1,] "0000000000000000000N000000N0000000000N" 
[2,] "0000000000000000000N000000N0000000000N" 

这个代码将提供以列表格式片段:

res <- lapply(sampdat$Data, function(x) 
      apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]))) 

res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)}) 
res2 

[[1]] 
            CCT6          GAT1 
    "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
            IMD3          PDR3 
     "N000000100000N00N0N0000000NNNN0"  "1111111111111111111111111111111" 
            RIM15 
"0000000000000000000N000000N0000000000N" 

[[2]] 
            CCT6          GAT1 
    "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
            IMD3          PDR3 
     "N000000100000N00N0N0000000NNNN0"  "1111111111111111111111111111111" 
            RIM15 
"0000000000000000000N000000N0000000000N" 

而且能获得指定的输出格式:

for (samp in seq_along(res2)) { cat("Sample ", samp, "\n") 
     invisible(sapply(1:5, function(y) 
      cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n"))) } 
Sample 1 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 
Sample 2 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 

The 01需要来抑制列表结构中的NULL返回。

+0

嗯......我不相信这我正在寻找什么。 Id喜欢能够在具有多个样本的数据框上运行脚本。在上面看来,你已经将整个字符串输入到每个样本的代码中。编号也喜欢我的输出看起来像我上面提供的例子。 –

+0

你用str()看过“sampdat”对象吗?它与你的数据不同吗?如果是这样,请在您的对象上提供dput()。 –

+0

添加了一个命名步骤。 –

相关问题