2017-06-05 54 views
2

我有一个数据框有三个参考列ref,het和hom在每一行中,我想要替换列中的字母/基因型,其中G = C,A = T,AG = TC或基于参考列反之亦然。R列取代数据框中的其他列的字母表

structure(list(SNP = c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", 
"rs7", "rs8", "rs9"), ref = c("GG", "AA", "AA", "GG", "GG", "GG", 
"AA", "CC", "GG"), het = c("AG", "AG", "AG", "AG", "AG", "AG", 
"AG", "AC", "AG"), hom = c("AA", "GG", "GG", "AA", "AA", "AA", 
"GG", "AA", "AA"), A = c("TC", "TC", "CC", "AG", "TT", "TC", 
"AA", "GG", "GG"), B = c("CC", "TT", "CC", "AG", "TT", "CC", 
"AA", "TG", "GG"), C = c("CC", "CC", "CC", "GG", "CC", "TT", 
"AA", "TG", "GG"), D = c("TT", "TC", "CC", "AG", "TT", "TT", 
"AA", "GG", "AG"), E = c("CC", "TT", "CC", "AG", "TC", "TT", 
"AA", "TG", "GG"), F = c("TC", "TT", "TC", "GG", "TC", "TC", 
"AA", "GG", "GG"), G = c("TC", "TC", "CC", "AG", "TC", "TC", 
"AA", "GG", "GG"), H = c("TC", "TC", "TC", "GG", "TC", "TC", 
"AA", "TG", "GG")), .Names = c("SNP", "ref", "het", "hom", "A", 
"B", "C", "D", "E", "F", "G", "H"), class = "data.frame", row.names = 
c(NA, 
-9L)) 

Input: 
SNP ref het hom A B C D E F G H I 
rs1 GG AG AA TC CC CC TT CC TC TC TC … 
rs2 AA AG GG TC TT CC TC TT TT TC TC … 
rs3 AA AG GG CC CC CC CC CC TC CC TC … 
rs4 GG AG AA AG AG GG AG AG GG AG GG … 
rs5 GG AG AA TT TT CC TT TC TC TC TC … 
rs6 GG AG AA TC CC TT TT TT TC TC TC … 
rs7 AA AG GG AA AA AA AA AA AA AA AA … 
rs8 CC AC AA GG TG TG GG TG GG GG TG … 
rs9 GG AG AA GG GG GG AG GG GG GG GG … 

Desired Output: 
SNP ref het hom A B C D E F G H I 
rs1 GG AG AA AG GG GG AA GG AG AG AG … 
rs2 AA AG GG AG AA GG AG AA AA AG AG … 
rs3 AA AG GG GG GG GG GG GG AG GG AG … 
rs4 GG AG AA AG AG GG AG AG GG AG GG … 
rs5 GG AG AA AA AA GG AA AG AG AG AG … 
rs6 GG AG AA AG GG AA AA AA AG AG AG … 
rs7 AA AG GG AA AA AA AA AA AA AA AA … 
rs8 CC AC AA AA AC AC CC AC CC CC AC … 
rs9 GG AG AA GG GG GG AG GG GG GG GG … 

我该如何写一个函数来根据参考列来替换这些字母表?谢谢。

+0

在此,是只有'ref'列,它是参考 – akrun

+0

感谢回复,没有唯一的裁判列,但他们三个人,包括裁判,HET和坎列 –

+0

我想在你的dput列名V1,V2等,而应该是SNP,ref等。 – akrun

回答

2

我们可以创建一个“字典”与所有可能的基因型和他们的通信,不是通过SNP的列表中,选中第一个元素(列)。如果它不在ref/het/hom中,那么我们认为该行中的元素需要更改,否则我们只是按原样返回该行。

key = list(AA="TT",TT="AA", 
      GG="CC",CC="GG", 
      AG="TC",TC="AG", 
      GA="CT",CT="GA", 
      AC="TG",TG="AC", 
      CA="GT",GT="CA") 


changeAlleles <- function(myrow) { 
    if (!(myrow[5] %in% myrow[2:4])) { 
    myrow <- c(myrow[1:4],sapply(myrow[5:length(myrow)], function(x) key[[x]])) 
    } 
    return(myrow) 
} 

df2=as.data.frame(t(apply(df,1,changeAlleles))) 

    SNP ref het hom A B C D E F G H 
2 rs1 GG AG AA AG GG GG AA GG AG AG AG 
3 rs2 AA AG GG AG AA GG AG AA AA AG AG 
4 rs3 AA AG GG GG GG GG GG GG AG GG AG 
5 rs4 GG AG AA AG AG GG AG AG GG AG GG 
6 rs5 GG AG AA AA AA GG AA AG AG AG AG 
7 rs6 GG AG AA AG GG AA AA AA AG AG AG 
8 rs7 AA AG GG AA AA AA AA AA AA AA AA 
9 rs8 CC AC AA CC AC AC CC AC CC CC AC 
10 rs9 GG AG AA GG GG GG AG GG GG GG GG 
1

我们可以使用chartr

df1[5:12] <- lapply(df1[5:12], function(x) chartr('TC', 'AG', x)) 
df1 
# SNP ref het hom A B C D E F G H I 
#1 rs1 GG AG AA AG GG GG AA GG AG AG AG … 
#2 rs2 AA AG GG AG AA GG AG AA AA AG AG … 
#3 rs3 AA AG GG GG GG GG GG GG AG GG AG … 
#4 rs4 GG AG AA AG AG GG AG AG GG AG GG … 
#5 rs5 GG AG AA AA AA GG AA AG AG AG AG … 
#6 rs6 GG AG AA AG GG AA AA AA AG AG AG … 
#7 rs7 AA AG GG AA AA AA AA AA AA AA AA … 
#8 rs8 CC AC AA GG AG AG GG AG GG GG AG … 
#9 rs9 GG AG AA GG GG GG AG GG GG GG GG … 
+0

谢谢,它几乎可行。但在第8行 rs8 CC AC AA AA AC AC CC AC CC CC AC ...对于那些AG它没有改变为AC –

相关问题