2014-02-15 46 views
1

我很困惑应该使用以下哪一种? (实际上截至目前所有的人给我的错误):read.table,read.csv或扫描读取R中的文本文件?

> beef = read.csv("beef.txt", header = TRUE) 
Error in read.table(file = file, header = header, sep = sep, quote = quote, : 
    more columns than column names 
> beef = scan("beef.txt") 
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    scan() expected 'a real', got '%' 
> beef=read.table("beef.txt", header = FALSE, sep = " ") 
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    line 1 did not have 8 elements 
> beef=read.table("beef.txt", header = TRUE, sep = " ") 
Error in read.table("beef.txt", header = TRUE, sep = " ") : 
    more columns than column names 

这里的beef.txt文件的顶部,其余的是非常相似的。

% http://lib.stat.cmu.edu/DASL/Datafiles/agecondat.html 
% 
% Datafile Name: Agricultural Economics Studies 
% Datafile Subjects: Agriculture , Economics , Consumer 
% Story Names: Agricultural Economics Studies 
% Reference: F.B. Waugh, Graphic Analysis in Agricultural Economics, 
% Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957. 
% Authorization: free use 
% Description: Price and consumption per capita of beef and pork 
% annually from 1925 to 1941 together with other variables relevant to 
% an economic analysis of price and/or consumption of beef and pork 
% over the period. 
% Number of cases: 17 
% Variable Names: 
% 
% PBE = Price of beef (cents/lb) 
% CBE = Consumption of beef per capita (lbs) 
% PPO = Price of pork (cents/lb) 
% CPO = Consumption of pork per capita (lbs) 
% PFO = Retail food price index (1947-1949 = 100) 
% DINC = Disposable income per capita index (1947-1949 = 100) 
% CFO = Food consumption per capita index (1947-1949 = 100) 
% RDINC = Index of real disposable income per capita (1947-1949 = 100) 
% RFP = Retail food price index adjusted by the CPI (1947-1949 = 100) 
% 
% The Data: 
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP 
1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877 
1926 59.7 59.4 63.3 63.3 68 52.6 92.1 69.6 899 
1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883 
1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884 
1929 71 49 55 68.7 65.6 55.1 91.1 75.2 895 

当我使用fread时,数据被保存得非常奇怪,如下所示,任何想法如何按预期格式化?

> library(data.table) 
> beef=fread("beef.txt", header = T, sep = " ") 
> beef 
    YEAR V2 V3 V4 
1: 1925 NA NA NA 
2: 1926 NA NA NA 
3: 1927 NA NA NA 
4: 1928 NA NA NA 
5: 1929 NA NA NA 
6: 1930 NA NA NA 
7: 1931 NA NA NA 
8: 1932 NA NA NA 
9: 1933 NA NA NA 
10: 1934 NA NA NA 
11: 1935 NA NA NA 
12: 1936 NA NA NA 
13: 1937 NA NA NA 
14: 1938 NA NA NA 
15: 1939 NA NA NA 
16: 1940 NA NA NA 
17: 1941 NA NA NA 
     PBE\tCBE\tPPO\tCPO\tPFO\tDINC\tCFO\tRDINC\tRFP 
1: 59.7\t58.6\t60.5\t65.8\t65.8\t51.4\t90.9\t68.5\t877 
2: 59.7\t59.4\t63.3\t63.3\t68\t52.6\t92.1\t69.6\t899 
3: 63\t53.7\t59.9\t66.8\t65.5\t52.1\t90.9\t70.2\t883 
4: 71\t48.1\t56.3\t69.9\t64.8\t52.7\t90.9\t71.9\t884 
5:  71\t49\t55\t68.7\t65.6\t55.1\t91.1\t75.2\t895 
6: 74.2\t48.2\t59.6\t66.1\t62.4\t48.8\t90.7\t68.3\t874 
7:  72.1\t47.9\t57\t67.4\t51.4\t41.5\t90\t64\t791 
8:  79\t46\t49.5\t69.7\t42.8\t31.4\t87.8\t53.9\t733 
9: 73.1\t50.8\t47.3\t68.7\t41.6\t29.4\t88\t53.2\t752 
10: 70.2\t55.2\t56.6\t62.2\t46.4\t33.2\t89.1\t58\t811 
11: 82.2\t52.2\t73.9\t47.7\t49.7\t37\t87.3\t63.2\t847 
12: 68.4\t57.3\t64.4\t54.4\t50.1\t41.8\t90.5\t70.5\t845 
13:  73\t54.4\t62.2\t55\t52.1\t44.5\t90.4\t72.5\t849 
14: 70.2\t53.6\t59.9\t57.4\t48.4\t40.8\t90.6\t67.8\t803 
15: 67.8\t53.9\t51\t63.9\t47.1\t43.5\t93.8\t73.2\t793 
16: 63.4\t54.2\t41.5\t72.4\t47.8\t46.5\t95.5\t77.6\t798 
17:  56\t60\t43.9\t67.4\t52.2\t56.3\t97.5\t89.5\t830 

当我作为函数read.table告诉记者,在评论我收到奇怪的输出(我不一样整洁如预期阅读):

> beef=read.table("beef.txt", header = TRUE, sep = " ", comment.char="%") 
> beef 
    YEAR X X.1 X.2 
1 1925 NA NA NA 
2 1926 NA NA NA 
3 1927 NA NA NA 
4 1928 NA NA NA 
5 1929 NA NA NA 
6 1930 NA NA NA 
7 1931 NA NA NA 
8 1932 NA NA NA 
9 1933 NA NA NA 
10 1934 NA NA NA 
11 1935 NA NA NA 
12 1936 NA NA NA 
13 1937 NA NA NA 
14 1938 NA NA NA 
15 1939 NA NA NA 
16 1940 NA NA NA 
17 1941 NA NA NA 
       PBE.CBE.PPO.CPO.PFO.DINC.CFO.RDINC.RFP 
1 59.7\t58.6\t60.5\t65.8\t65.8\t51.4\t90.9\t68.5\t877 
2 59.7\t59.4\t63.3\t63.3\t68\t52.6\t92.1\t69.6\t899 
3 63\t53.7\t59.9\t66.8\t65.5\t52.1\t90.9\t70.2\t883 
4 71\t48.1\t56.3\t69.9\t64.8\t52.7\t90.9\t71.9\t884 
5  71\t49\t55\t68.7\t65.6\t55.1\t91.1\t75.2\t895 
6 74.2\t48.2\t59.6\t66.1\t62.4\t48.8\t90.7\t68.3\t874 
7  72.1\t47.9\t57\t67.4\t51.4\t41.5\t90\t64\t791 
8  79\t46\t49.5\t69.7\t42.8\t31.4\t87.8\t53.9\t733 
9 73.1\t50.8\t47.3\t68.7\t41.6\t29.4\t88\t53.2\t752 
10 70.2\t55.2\t56.6\t62.2\t46.4\t33.2\t89.1\t58\t811 
11 82.2\t52.2\t73.9\t47.7\t49.7\t37\t87.3\t63.2\t847 
12 68.4\t57.3\t64.4\t54.4\t50.1\t41.8\t90.5\t70.5\t845 
13  73\t54.4\t62.2\t55\t52.1\t44.5\t90.4\t72.5\t849 
14 70.2\t53.6\t59.9\t57.4\t48.4\t40.8\t90.6\t67.8\t803 
15 67.8\t53.9\t51\t63.9\t47.1\t43.5\t93.8\t73.2\t793 
16 63.4\t54.2\t41.5\t72.4\t47.8\t46.5\t95.5\t77.6\t798 
17  56\t60\t43.9\t67.4\t52.2\t56.3\t97.5\t89.5\t830 

所以感谢评论原来分离的不是一个空间,而是一个标签。这里是什么是正确答案:

> beef=read.table("beef.txt", header = TRUE, sep = "\t", comment.char="%") 
> beef 
    YEAR....PBE CBE PPO CPO PFO DINC CFO RDINC RFP 
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877 
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899 
3 1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883 
4 1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884 
5 1929 71 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895 
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874 
7 1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0 64.0 791 
8 1932 79 46.0 49.5 69.7 42.8 31.4 87.8 53.9 733 
9 1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0 53.2 752 
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1 58.0 811 
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3 63.2 847 
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5 70.5 845 
13 1937 73 54.4 62.2 55.0 52.1 44.5 90.4 72.5 849 
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6 67.8 803 
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8 73.2 793 
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5 77.6 798 
17 1941 56 60.0 43.9 67.4 52.2 56.3 97.5 89.5 830 
+1

查找'read.table'的'comment.char'参数。 ('read.csv'只是'read.table'的一个包装;'scan'类似,但你必须自己指定字段类型。) – krlmlr

+1

'read.table'只是'scan'的一个包装。 – Roland

+1

如果你有一个格式良好的'.csv'不'认为:去'read.csv()'去。如果你有'fwf'文件(固定宽度格式,在气象学中很常见),可以使用'read.fwf()'。否则,请使用正确的参数尝试'read.table()'或'scan()'。 – Fernando

回答

1

有关于它表明fread是最快最近的一篇博客,其余的都是一样的。链接:http://statcompute.wordpress.com/2014/02/11/efficiency-of-importing-large-csv-files-in-r/

在你的情况下,它并不重要,使用你觉得最舒服的那个。

使用fread一个例子如下(假设TAB分隔符):

library(data.table) 
a = fread("data.csv", skip=26) 
a 
    YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP 
1: 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877 
2: 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899 
3: 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883 
4: 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884 
5: 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895 
+0

我用'fread',但读取的数据奇怪地保存!任何提示? –

+1

我没有投票,但我想有人认为这是一个只有链接的答案太多,博客是[我们的常见问题解答]的一个可怜的版本(http://stackoverflow.com/q/1727772/1412059 )。你应该改进你的答案,并展示如何用'fread'处理这样的文件。 – Roland

+1

我实际上投票了,因为它很高兴知道还有其他更快的方法(即使在我的情况下,时间阅读将被计算为零) –

3
beef=read.table("beef.txt", header = TRUE, sep = " ", comment.char="%") 

beef=read.table("beef.txt", header = TRUE, sep = "\t", comment.char="%") #after update 
+0

请看看更新的问题。我用你的方法,但读取的文件不整齐! –

+1

使用'sep =“\ t”' – Ananta

+0

所以当我们用read.table读取一个文件时,最终的结果是data.frame?如果不是,我应该如何将其更改为data.frame? –

1

下面是使用readLines在碱的替代方案。这种方法要复杂得多,但返回的数字数据已准备好进行分析。但是,您必须手动计算原始数据文件中的列,然后重新分配列名称。

编辑

在底部加入我不需要手动计数 列或手动添加列名的通用版本。

请注意,无论数据是用空格还是制表符分隔,任一版本都可以工作。

这里是原始版本的代码:

my.data <- readLines('c:/users/mmiller21/simple R programs/beef.txt') 

ncols <- 10 

header.info <- ifelse(substr(my.data, 1, 1) == '%', 1, 0) 

my.data2 <- my.data[header.info==0] 

my.data3 <- data.frame(matrix(unlist(strsplit(my.data2[-1], "[^0-9,.]+")), ncol=ncols, byrow=TRUE), stringsAsFactors = FALSE) 

my.data4 <- as.data.frame(apply(my.data3, 2, function(x) as.numeric(x))) 

colnames(my.data4) <- c('YEAR', 'PBE', 'CBE', 'PPO', 'CPO', 'PFO', 'DINC', 'CFO', 'RDINC', 'RFP') 

> my.data4 
    YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP 
[1,] 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877 
[2,] 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899 
[3,] 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883 
[4,] 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884 
[5,] 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895 

这里有原始数据文件的内容:

% http://lib.stat.cmu.edu/DASL/Datafiles/agecondat.html 
% 
% Datafile Name: Agricultural Economics Studies 
% Datafile Subjects: Agriculture , Economics , Consumer 
% Story Names: Agricultural Economics Studies 
% Reference: F.B. Waugh, Graphic Analysis in Agricultural Economics, 
% Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957. 
% Authorization: free use 
% Description: Price and consumption per capita of beef and pork 
% annually from 1925 to 1941 together with other variables relevant to 
% an economic analysis of price and/or consumption of beef and pork 
% over the period. 
% Number of cases: 17 
% Variable Names: 
% 
% PBE = Price of beef (cents/lb) 
% CBE = Consumption of beef per capita (lbs) 
% PPO = Price of pork (cents/lb) 
% CPO = Consumption of pork per capita (lbs) 
% PFO = Retail food price index (1947-1949 = 100) 
% DINC = Disposable income per capita index (1947-1949 = 100) 
% CFO = Food consumption per capita index (1947-1949 = 100) 
% RDINC = Index of real disposable income per capita (1947-1949 = 100) 
% RFP = Retail food price index adjusted by the CPI (1947-1949 = 100) 
% 
% The Data: 
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP 
1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877 
1926 59.7 59.4 63.3 63.3 68 52.6 92.1 69.6 899 
1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883 
1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884 
1929 71 49 55 68.7 65.6 55.1 91.1 75.2 895 

下面是通用版本的代码:

my.data <- readLines('c:/users/mmiller21/simple R programs/beef.txt') 

header.info <- ifelse(substr(my.data, 1, 1) == '%', 1, 0) 

my.data2 <- my.data[header.info==0] 

ncols <- length(read.table(textConnection(my.data2[1]))) 

my.data3 <- data.frame(matrix(unlist(strsplit(my.data2[-1], "[^0-9,.]+")), ncol=ncols, byrow=TRUE), stringsAsFactors = FALSE) 

my.data4 <- as.data.frame(apply(my.data3, 2, function(x) as.numeric(x))) 

#colnames(my.data4) <- c('YEAR', 'PBE', 'CBE', 'PPO', 'CPO', 'PFO', 'DINC', 'CFO', 'RDINC', 'RFP') 
#my.data4 

colnames(my.data4) <- read.table(textConnection(my.data2[1]), colClasses = c('character')) 
my.data4 

colSums(my.data4) 

sum(my.data4$PPO) 
1

正确答案如下:

beef=read.table("beef.txt", header = TRUE, sep = "", comment.char="%")