2015-06-22 237 views
1

我有兴趣在标题名称前跳过我的数据框的一些行。我如何通过在ID_REF之前扫描所有行或如果ID_REF不存在,请检查ILMN_的模式并删除所有保留第一个的行(如果不包含#)。跳过fread的一些行

# GEOarchive matrix file.    
ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 1688628068_A.BEAD_STDERR 1688628068_A.Detection Pval 
ILMN_1343291 62821.84   135        413.9399      0 
ILMN_1343292 3255.167   131        47.76587      0 
ILMN_1343293 42924.91   152        539.3026      0 
ILMN_1343294 55255.21   100        746.1457      0 
+1

看起来您的列名比列多。 '1688628068_A.Detection Pval'是单列吗?如果文件有'#'需要跳过,'read.table('yourfile.txt',header = TRUE,fill = TRUE'')应该读取它。 – akrun

+0

@akrun是的,这是一个单列 – Hashim

+0

一个选项是将文件中的列名更改为“1688628068_A.Detection_Pval”,并且没有使用'fill = TRUE'来读取 – akrun

回答

3

在linux中,你可以使用awkfread或者它可以与read.table用管道输送。在这里,我用awk

pth <- '/home/akrun/file.txt' #change it to your path 
v1 <- sprintf("awk '/^(ID_REF|LMN)/{ matched = 1} matched {$1=$1; print}' OFS=\",\" %s", pth) 

fread

library(data.table) 
fread(v1) 
#   ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 
#1: ILMN_1343291    62821.840      135 
#2: ILMN_1343292    3255.167      131 
#3: ILMN_1343293    42924.910      152 
#4: ILMN_1343294    55255.210      100 
# 1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval 
#1:    413.93990       0 
#2:     47.76587       0 
#3:    539.30260       0 
#4:    746.14570       0 

或者使用read.table

read.table(pipe(v1), header=TRUE, sep=',', check.names=FALSE) 
#  ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 
#1 ILMN_1343291    62821.840      135 
#2 ILMN_1343292    3255.167      131 
#3 ILMN_1343293    42924.910      152 
#4 ILMN_1343294    55255.210      100 
# 1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval 
#1    413.93990       0 
#2     47.76587       0 
#3    539.30260       0 
#4    746.14570       0 

注意阅读改变了分隔符,:我从1688628068_A.Detection Pval改变了列名1688628068_A.Detection_Pval

由于某种原因,多余的空格会造成fread问题。与read.table这不是一个问题。因此,以下工作也可以正常使用read.table

v2 <- sprintf("awk '/^(ID_REF|ILMN)/{ matched = 1} matched { print}' %s", pth) 

read.table(pipe(v2), header=TRUE, check.names=FALSE) 
#  ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 
#1 ILMN_1343291    62821.840      135 
#2 ILMN_1343292    3255.167      131 
#3 ILMN_1343293    42924.910      152 
#4 ILMN_1343294    55255.210      100 
# 1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval 
#1    413.93990       0 
#2     47.76587       0 
#3    539.30260       0 
#4    746.14570       0 
+1

谢谢,它工作正常 – Hashim