2014-01-10 37 views
0

我在行之间有一个文本文件,其元素不相等。有时第二列包含数据,有时包含NA,有时根本没有记录。我知道,如果连续只有4个元素,我应该在第二列中插入一个NA作为元素。但是,我不知道该怎么做。下面是一个例子的数据集:将NA插入读取为字符串的数据中

abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A* 

下面是所期望的结果:

desired.result <- read.table(text = ' 
    Name1 Name2 Name3 Name4 Status 
abc.def ghi.jkl mno pqr  A* 
bc.def  NA  no qr  A 
c-e.ef  NA non qrr  AE 
    fg.gg  NA  no qr  E 
    aa.bb cc.dd  ee ff  A* 
', header = TRUE) 

我还没有得到远,但我已经能够分割数据并将其输入到与一个matrix以下代码。当然,这些数据是错位的。

setwd('c:/users/mmiller21/simple R programs') 

my.data <- readLines('name_data.txt') 

matrix(unlist(strsplit(unlist(my.data), " ")), ncol=5, byrow=TRUE) 

#  [,1]  [,2]  [,3] [,4]  [,5]  
# [1,] "abc.def" "ghi.jkl" "mno" "pqr"  "A*"  
# [2,] "bc.def" "NA"  "no" "qr"  "A"  
# [3,] "c-e.ef" "non"  "qrr" "AE"  "fg.gg" 
# [4,] "no"  "qr"  "E" "aa.bb" "cc.dd" 
# [5,] "ee"  "ff"  "A*" "abc.def" "ghi.jkl" 

不知何故我应该使用strsplit(unlist(my.data), " ")后计数元件的数量然后插入NA如在每一行中仅包含四个元件的第二元件。然后将数据输入到矩阵中。感谢您的帮助。我宁愿基地R.

回答

2

与文件名替换dat

dat <- textConnection("abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*") 

my.lines <- readLines(dat) 
my.rows <- strsplit(my.lines, " ") 
adjust <- function(row) { 
    if (length(row) == 4) c(head(row, 1), NA, tail(row, 3)) 
    else row 
} 
my.fixed <- lapply(my.rows, adjust) 

out <- matrix(unlist(my.fixed), ncol = 5, byrow = TRUE) 
out[out == "NA"] <- NA 
2

您可以使用选项fill=TRUE,然后翻译遗漏行:

dat <- read.table(text='abc.def ghi.jkl mno pqr A* 
    bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*',fill=TRUE) 

t(apply(dat,1,function(x){ 
    if(nchar(x[5])==0) 
    x= c(x[1],NA_character_,x[2:4]) 
    x 
})) 

    [,1]  [,2]  [,3] [,4] [,5] 
[1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
[2,] "bc.def" NA  "no" "qr" "A" 
[3,] "c-e.ef" NA  "non" "qrr" "AE" 
[4,] "fg.gg" NA  "no" "qr" "E" 
[5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 
3
dat <- read.table(text="abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*", fill=TRUE, stringsAsFactors=FALSE) 
names(dat) <- c('Name1' , 'Name2', 'Name3', 'Name4','Status') 
is.na(dat[[5]]) <- dat[[5]]=="" # set blanks in col 5 to NA 
t(apply(dat, 1, function(r) if(is.na(r[5])) {r[c(1,5,2:4)]}else {r})) 
#--------- 
    [,1]  [,2]  [,3] [,4] [,5] 
[1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
[2,] "bc.def" NA  "no" "qr" "A" 
[3,] "c-e.ef" NA  "non" "qrr" "AE" 
[4,] "fg.gg" NA  "no" "qr" "E" 
[5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 
+1

魔'is.na(DAT [[5]])< - DAT [[5]] ==” “'! – agstudy

+0

这相当于@ agstudy's,除了他允许最后一列包含“NA”。 – flodel

+0

迪宁,你改了你的名字! (我一段时间没有去过这个网站。) –

1

readlines方法,用空格字符分割,并追加NA:

txt <- readLines(file) 
t(sapply(strsplit(txt, "\\s+"), function(x) if(length(x) < 5) append(x, NA, 1) else x)) 
#  [,1]  [,2]  [,3] [,4] [,5] 
# [1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
# [2,] "bc.def" "NA"  "no" "qr" "A" 
# [3,] "c-e.ef" NA  "non" "qrr" "AE" 
# [4,] "fg.gg" NA  "no" "qr" "E" 
# [5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 

完整版本与数据管理:

file <- tempfile() 
cat("abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*", "\n", sep="", file=file) 
txt <- readLines(file) 
t(sapply(strsplit(txt, "\\s+"), function(x) if(length(x) < 5) append(x, NA, 1) else x)) 
unlink(file) 

注意这类似于@Flodel