2016-09-08 39 views
1

我有一个大的gzip文件,我想导入到一个熊猫数据框中。不幸的是,该文件的列数不均匀。数据大致有以下格式:ValueError:通过块导入数据到pandas.csv_reader()

.... Col_20: 25 Col_21: 23432 Col22: 639142 
.... Col_20: 25 Col_22: 25134 Col23: 243344 
.... Col_21: 75 Col_23: 79876 Col25: 634534 Col22: 5 Col24: 73453 
.... Col_20: 25 Col_21: 32425 Col23: 989423 
.... Col_20: 25 Col_21: 23424 Col22: 342421 Col23: 7 Col24: 13424 Col 25: 67 
.... Col_20: 95 Col_21: 32121 Col25: 111231 

作为一个测试,我尝试这样做:

import pandas as pd 
filename = `path/to/filename.gz` 

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'): 
    print(chunk) 

这是我得到的回报的错误:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 795, in __next__ 
    return self.get_chunk() 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 836, in get_chunk 
    return self.read(nrows=size) 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read 
    ret = self._engine.read(nrows) 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1761, in read 
    alldata = self._rows_to_cols(content) 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 2166, in _rows_to_cols 
    raise ValueError(msg) 
ValueError: Expected 18 fields in line 28, saw 22 

你怎么分配一定数量的pandas.read_csv()列?

+1

你的问题是一些格式不正确的csv,它与预分配列数无关,您需要进行一些额外的调试以查找具体格式不正确的文件和行,您应该发布指向csv的链接或重现错误的小样本 – EdChum

+0

@EdChum它不只是一行 - 这个文件实际上每行都是这样的。有些行可能有20列,接下来的28行是什么? – ShanZhengYang

+0

我无法在没有看到具体数据的情况下回答假设性问题,发布数据时应该有定期的分隔符和表单,如果不是,那么您需要首先清理数据 – EdChum

回答

1

你也可以试试这个:

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False): 
print(chunk) 

error_bad_lines将跳过以为坏线。我会看有没有更好的选择,可以发现

编辑:为了维护由error_bad_lines跳过线我们可以通过错误并将其重新添加到数据帧

line  = [] 
expected = [] 
saw  = []  
cont  = True 

while cont == True:  
    try: 
     data = pd.read_csv('file1.csv',skiprows=line) 
     cont = False 
    except Exception as e:  
     errortype = e.message.split('.')[0].strip()         
     if errortype == 'Error tokenizing data':       
      cerror  = e.message.split(':')[1].strip().replace(',','') 
      nums  = [n for n in cerror.split(' ') if str.isdigit(n)] 
      expected.append(int(nums[0])) 
      saw.append(int(nums[2])) 
      line.append(int(nums[1])-1) 
     else: 
      cerror  = 'Unknown' 
      print 'Unknown Error - 222' 
+0

感谢您的帮助。是的,我认为'error_bad_lines'删除了我非常需要的数据... – ShanZhengYang

+0

我发现一个肮脏的黑客让我编辑答案,可能不是理想的,但应该做的伎俩。 – SerialDev

+1

问题是,在处理GB规模的大型数据框时,这会变得非常棘手。不过谢谢,这很巧妙。 – ShanZhengYang