如何从R中的文件中读取数据帧时跳过无效行？

我有一个包含大量数据的大文件，我想将它读入数据框，但发现一些无效行。这些无效行导致read.table中断。我尝试下面的方法来跳过无效的行，但它看起来表现非常糟糕。如何从R中的文件中读取数据帧时跳过无效行？

counts<-count.fields(textConnection(lines),sep="\001") 
raw_data<-read.table(textConnection(lines[counts == 34]), sep="\001")

有没有更好的方法来实现这个目标？谢谢

来源

2012-05-15 zjffdu

你的定义有什么不好？ –

任何你不直接使用'read.table'的理由？它有很多参数来选择和忽略各种“坏”字符。如果这是您遇到的问题，还有一个参数来填充不完整的行。 –

使用@ PaulHiemstra的样本数据：

read.table("test.csv", sep = ";", fill=TRUE)

那么，你想照顾的NAS。

来源

2012-05-15 11:49:33 Paolo

我把你的答案作为基准的额外选项 –

懒惰的我 - 这是我在我的第一个评论中的回答，但是你写出来的更好的细节 –

@Carl，+1为你的评论业力。 – BenBarnes

你可以做的是遍历文件中的行，并只添加具有正确长度的行。

我定义了以下测试csv文件：

1;2;3;4 
1;2;3;4 
1;2;3 
1;2;3;4

使用read.table失败：

> read.table("test.csv", sep = ";") 
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :                  
    line 3 did not have 4 elements

现在的迭代方法：

require(plyr) 
no_lines = 4 
correct_length = 4 
file_con = file("test.csv", "r") 
result = ldply(1:no_lines, function(line) { 
    dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]] 
    if(length(dum) == correct_length) { 
    return(dum) 
    } else { 
    cat(sprintf("Skipped line %s\n", line)) 
    return(NULL) 
    } 
    }) 
close(file_con) 

> result 
    V1 V2 V3 V4 
1 1 2 3 4 
2 1 2 3 4 
3 1 2 3 4

Ofcourse这是一个简单的例子为文件真的很小。让我们创建一个更具挑战性的例子来作为基准。

# First file with invalid rows 
norow = 10e5 # number of rows 
no_lines = round(runif(norow, min = 3, max = 4)) 
no_lines[1] = correct_length 
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";")) 
writeLines(paste(file_content[[1]], sep = "\n"), "big_test.csv") 

# Same length with valid rows 
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";")) 
writeLines(paste(file_content[[1]], sep = "\n"), "big_normal.csv")

现在为基准

# Iterative approach 
system.time({file_con <- file("big_test.csv", "r") 
    result_test <- ldply(1:norow, function(line) { 
     dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]] 
     if(length(dum) == correct_length) { 
     return(dum) 
     } else { 
     # Commenting this speeds up by 30% 
     #cat(sprintf("Skipped line %s\n", line)) 
     return(NULL) 
     } 
     }) 
    close(file_con)}) 
    user system elapsed 
20.559 0.047 20.775 

# Normal read.table 
system.time(result_normal <- read.table("big_normal.csv", sep = ";")) 
    user system elapsed 
    1.060 0.015 1.079 

# read.table with fill = TRUE 
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE) 
      na_rows <- complete.cases(result_fill) 
      result_fill <- result_fill[-na_rows,]}) 
    user system elapsed 
    1.161 0.033 1.203 

# Specifying which type the columns are (e.g. character or numeric) 
# using the colClasses argument. 
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE, 
             colClasses = rep("numeric", 4)) 
      na_rows <- complete.cases(result_fill) 
      result_fill <- result_fill[-na_rows,]}) 
    user system elapsed 
    0.933 0.064 1.001

所以迭代的方法是相当慢一点，但20秒一个百万行是可以接受的（虽然这取决于你的可接受的定义）。特别是当你只需要这一次，并且使用save保存以备以后检索时。 @Paolo建议的解决方案几乎和正常呼叫read.table一样快。使用complete.cases排除包含错误列数的行（因此为NA's）。指定列的哪些类进一步提高了性能，并且我认为当列和行的数量变大时，这种影响会变得更大。

因此总而言之，最好的选择是使用read.table和fill = TRUE，同时指定列的类。使用ldply的迭代方法只是一个很好的选择，如果你想在选择如何读取线条方面有更大的灵活性，例如只有在特定值高于阈值时才读取该行。但是，可能通过将所有数据读入R来更快地完成此操作，而不是创建子集。只有当数据比RAM更大时，我才能想象迭代方法有其优点。

来源

2012-05-15 11:32:34

如何从R中的文件中读取数据帧时跳过无效行？

回答

相关问题