2017-08-14 31 views
0

我有一系列非常杂乱的* .csv文件正被熊猫读入。一个例子CSV是:忽略pandas.read_csv()中破坏标头的关键数据坏行数据=关键字

Instrument 35392 
"Log File Name : station" 
"Setup Date (MMDDYY) : 031114" 
"Setup Time (HHMMSS) : 073648" 
"Starting Date (MMDDYY) : 031114" 
"Starting Time (HHMMSS) : 090000" 
"Stopping Date (MMDDYY) : 031115" 
"Stopping Time (HHMMSS) : 235959" 
"Interval (HHMMSS) : 010000" 
"Sensor warmup (HHMMSS) : 000200" 
"Circltr warmup (HHMMSS) : 000200" 


"Date","Time","","Temp","","SpCond","","Sal","","IBatt","" 
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts","" 

"Random message here 031114 073721 to 031114 083200" 
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,"" 
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,"" 
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,"" 
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,"" 
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,"" 
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,"" 
"message 3" 
"message 4"** 

我一直在使用这个代码导入* csv文件,工艺在双头,拉出空列,然后用坏数据条块违规行:

DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \ 
      header=[10,11],na_values=['','na', 'nan nan'], \ 
      skiprows=[10], encoding='cp1252') 

DF = DF.dropna(how="all", axis=1) 
DF = DF.dropna(thresh=2) 
droplist = ['message', 'Random'] 
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))] 

DF.head() 

Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts) 
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2 
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3 
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2 
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2 
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2 

这是工作的罚款和花花公子,直到我有了标题后erronious 1排线文件:“随机的信息在这里031114 073721到031114 083200”

我receieve的错误是:

*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site- 
    packages\pandas\io\parsers.py in _do_date_conversions(self, names, data) 
    1554    data, names = _process_date_conversion(
    1555     data, self._date_conv, self.parse_dates, self.index_col, 
    -> 1556     self.index_names, names, 
    keep_date_col=self.keep_date_col) 
    1557 
    1558   return names, data 
    C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site- 
    packages\pandas\io\parsers.py in _process_date_conversion(data_dict, 
    converter, parse_spec, index_col, index_names, columns, keep_date_col) 
    2975  if not keep_date_col: 
    2976   for c in list(date_cols): 
    -> 2977    data_dict.pop(c) 
    2978    new_cols.remove(c) 
    2979 
    KeyError: ('Time', 'HHMMSS')* 

如果我删除该行,代码工作正常。同样,如果我删除标题=行代码工作正常。但是,我希望能够保留这一点,因为我正在阅读上百个这样的文件。

难度:我倾向于在打电话给pandas.read_csv()之前不打开每个文件,因为这些文件可能相当大 - 因此我不想多次读取和保存!另外,我更喜欢真正的熊猫/ pythonic解决方案,它不涉及首先打开文件作为一个stringIO缓冲区来删除违规行。

+0

你能后的错误路线?在出现错误的每种情况下,它会出现在同一种错误行中,还是在某些文件的其他行上可能存在其他类型的问题? –

+0

创建错误的错误行是: “随机消息在这里031114 073721到031114 083200” 此行可能存在或可能不存在于所有文件中。因此,我不能只增加skiprows = index。 此外,如果我改变该行的实际文本,错误仍然存​​在 - 文本是什么并不重要,但它是一行后面只有1列的标题。 –

回答

1

下面是一种方法,利用skip_rows接受可调用函数这一事实。该函数仅接收正在考虑的行索引,这是该参数的内置限制。因此,可调用函数skip_test()首先检查当前索引是否在跳过的已知索引集合中。如果不是,那么它打开实际的文件并检查相应的行以查看它的内容是否匹配。

skip_test()函数在检查实际文件的意义上有一点不好意思,尽管它只检查直到它正在评估的当前行索引。它还假设坏行总是以相同的字符串开头(在示例中为"foo"),但是对于OP,这似乎是一个安全的假设。

# example data 
""" foo.csv 
uid,a,b,c 
0,1,2,3 
skip me 
1,11,22,33 
foo 
2,111,222,333 
""" 

import pandas as pd 

def skip_test(r, fn, fail_on, known): 
    if r in known: # we know we always want to skip these 
     return True 
    # check if row index matches problem line in file 
    # for efficiency, quit after we pass row index in file 
    f = open(fn, "r") 
    data = f.read() 
    for i, line in enumerate(data.splitlines()): 
     if (i == r) & line.startswith(fail_on): 
      return True 
     elif i > r: 
      break 
    return False 

fname = "foo.csv" 
fail_str = "foo" 
known_skip = [2] 
pd.read_csv(fname, sep=",", header=0, 
      skiprows=lambda x: skip_test(x, fname, fail_str, known_skip)) 
# output 
    uid a b c 
0 0 1 2 3 
1 1 11 22 33 
2 2 111 222 333 

如果你到底是哪行的随机消息会出现,当它出现在知道了,那么这将是更快,因为你可以告诉它不检查文件内容对任何指数以前的潜在违规线。

+0

谢谢!是的,我知道通过我的文件会出现什么消息,所以我可以解析它们。 –

+0

不客气! –

0

昨天经过一番修补之后,我发现了一个解决方案以及潜在的问题。

我试过skip_test()功能答案上面,但我仍然得到错误与表的大小:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() 

ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11 

因此,与skiprows玩弄后=我发现我只是没有得到我想要的行为,当使用引擎='c'read_csv()仍然从前几行确定文件的大小,其中一些单列行仍在传递。这可能是因为我没有计划在我的csv集中存在更多的单列不良行。

相反,我创建了一个任意大小的DataFrame作为模板。我拉整个.csv文件,然后用逻辑去掉NaN行。

例如,我知道我会遇到的最大表格将是10行。所以,我对大熊猫电话是:

DF = pd.read_csv(csv_file, sep=',', \ 
    parse_dates={'Datetime_(ascii)': [0,1]},\ 
    na_values=['','na', '999999', '#'], engine='c',\ 
    encoding='cp1252', names = list(range(0,10))) 

然后我用这两条线从数据框中删除NaN的行和列:

#drop the null columns created by double deliminators 
DF = DF.dropna(how="all", axis=1) 
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values