2017-01-11 57 views
0

我目前正在引进大型数据集分为R和我发现FREAD()从data.tables能够使其在一个合理的时间(read.csv对我来说真的很慢)。[R data.table FREAD()在整个文本文件不完全带来

我目前遇到了一对夫妇的问题,我想获得的一些见解。我在列名前面有这个“??”标记,我可以用重命名语句快速修复它,但是此列中的值与原始文件完全不同。该值应为一个16位识别码(像这样“1100110011001100”),但,当它被带进来,它有作为“3.598E-310”。

我不知道这是否是由于UTF-8格式我的数据是,但我有一些麻烦搞清楚是怎么回事。还有另一个具有相似特征的变量(12位数字代码),它也被指数化了。我的变量的其余所有看起来很好(除了与相同长度的其他变量为被带到错误的两个变量)。

回答

1

你应该得到一个善意的警告:

library(data.table) #1.10.0 

DT <- fread("1100110011001100 
     1100110011001100") 
#Warning message: 
#In fread("1100110011001100\n  1100110011001100") : 
# Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again. 

print(DT) 
#    V1 
#1: 5.435266e-309 
#2: 5.435266e-309 
#Warning message: 
#In print.data.table(DT) : 
# Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again. 

library(bit64) 
print(DT) 
#     V1 
#1: 1100110011001100 
#2: 1100110011001100 
1

如果我理解正确OP的16位识别码,就是要式的人物。

但是,fread()确定某些示例行的列类型(有关详细信息,请参阅?fread)。显然,它试图为integer64读取数据。该colClasses参数可用于通过fread()覆盖所做的猜测:

DT <- fread("1100110011001100 
     1100110011001100", colClasses = "character") 
DT 
#     V1 
#1: 1100110011001100 
#2: 1100110011001100 

如果verbose参数设置为TRUEfread()揭示了它的一些内部运作的:

DT <- fread("1100110011001100 
     1100110011001100", colClasses = "character", verbose = TRUE) 
Input contains a \n (or is ""). Taking this to be text input (not a filename) 
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. 
Positioned on line 1 after skip or autostart 
This line is the autostart and not blank so searching up for the last non-blank ... line 1 
Detecting sep ... Deducing this is a single column input. 
Starting data input on line 1 (either column names or first row of data). First 10 characters: 1100110011 
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names. 
Count of eol: 2 (including 0 at the end) 
ncol==1 so sep count ignored 
Type codes (point 0): 2 
Column 1 ('V1') was detected as type 'integer64' but bumped to 'character' as requested by colClasses 
Type codes: 4 (after applying colClasses and integer64) 
Type codes: 4 (after applying drop or select (if supplied) 
Allocating 1 column slots (1 - 0 dropped) 
Read 2 rows. Exactly what was estimated and allocated up front 
    0.000s ( 0%) Memory map (rerun may be quicker) 
    0.000s ( 0%) sep and header detection 
    0.000s ( 0%) Count rows (wc -l) 
    0.000s ( 0%) Column type detection (100 rows at 10 points) 
    0.000s ( 0%) Allocation of 2x1 result (xMB) in RAM 
    0.000s ( 0%) Reading data 
    0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 
    0.000s ( 0%) Coercing data already read in type bumps (if any) 
    0.000s ( 0%) Changing na.strings to NA 
    0.001s  Total 

这可能有助于分析用12位数字代码读取变量的问题。