2017-08-14 26 views
0

我有一个数据框有两列,一个用于基因符号,另一个用于功能途径。通路列具有重复值,因为每个通路都有许多基因。我想对这个数据集进行重新排序,以便每列都是单一的路径,这些列中的每一行都是属于该路径的基因。转置与重复数据帧

开始数据帧:

data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

希望的数据帧:

data.frame(p1 = c("G1", "G2", "G3", "G4"), p2 = c("G33", "G43", "G10", 
"")) 

我知道,并不是所有的列将是相同的长度,并且具有空白值优选到NAS。

+0

由于列将不具有相同的长度,你真的最好创建一个标准的'list'而不是'data.frame',特别是因为第1行第1列与第1行第2列无关。 –

回答

0

这是另一种选择。

  1. 拆分成列表使用通路作为分离元件
  2. 获取每一组的最大长度,并设置所有其它基团为相同长度
  3. 重新打开它为数据帧

这里是代码。

mydf <- data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
      gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

# function to run over each element in list 
set_to_max_length <- function(x) { 
    length(x) <- max.length 
    return(x) 
} 

# 1. split into list 
mydf.split <- split(mydf$gene.symbol, mydf$pathway) 

# 2.a get max length of all columns 
max.length <- max(sapply(mydf.split, length)) 

# 2.b set each list element to max length 
mydf.split.2 <- lapply(mydf.split, set_to_max_length) 

# 3. combine back into df 
data.frame(mydf.split.2) 

编辑

下面是使用tidyverse另一种选择 - 有些更简洁:

library(tidyverse) 
mydf <- data.frame(pathway = c("p1", "p1", "p1", "p1", "p2", "p2", "p2"), 
        gene.symbol = c("G1", "G2", "G3", "G4", "G33", "G43", "G10")) 

mydf %>% 
    group_by(pathway) %>% 
    mutate(rownum = row_number()) %>% 
    ungroup() %>% 
    spread(pathway, gene.symbol) %>% 
    select(-1) 
0

这似乎是一个有点令人费解,但它首先要列出不是回来data.frame达到所需的输出:

df$gene.symbol <- as.character(df$gene.symbol) 

pw_list <- list() 
for (pw in unique(df$pathway)) { 
    pw_list[[pw]] <- df[df$pathway == pw, "gene.symbol"] 
} 
pw_list 
$p1 
[1] "G1" "G2" "G3" "G4" 

$p2 
[1] "G33" "G43" "G10" 


reordered <- matrix("", nrow = max(sapply(pw_list, length)), ncol = length(pw_list)) 
colnames(reordered) <- names(pw_list) 

for (pw in names(pw_list)){ 
    n <- length(pw_list[[pw]]) 
    reordered[1:n, pw] <- pw_list[[pw]] 
} 
reordered <- as.data.frame(reordered) 
reordered 
    p1 p2 
1 G1 G33 
2 G2 G43 
3 G3 G10 
4 G4  

编辑

稍微更简洁的版本:

df$gene.symbol <- as.character(df$gene.symbol) 
pw_list <- list() 
for (pw in unique(df$pathway)) { 
    pw_list[[pw]] <- df[df$pathway == pw, "gene.symbol"] 
} 
reordered <- as.data.frame(sapply(pw_list, "[", i = 1:max(sapply(pw_list, length))), 
          stringsAsFactors = FALSE) 
reordered[is.na(reordered)] <- "" 
names(reordered) <- names(pw_list)