2017-04-18 35 views
0

我正在尝试使用包含在tar.gz文件中的csv文件,并且遇到问题将正确的数据/对象传递给csv模块。Python3在tar文件中使用csv文件

说我有一个tar.gz文件,其中包含许多格式化的csv文件,如下所示。

1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30 
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26 
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31 
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38 

我希望能够访问内存中的每个csv文件,而不从tar文件中提取的每个文件,并将其写入磁盘。 例如:

import tarfile 
import csv 

tar = tarfile.open("tar-file.tar.gz") 

for member in tar.getmembers(): 
    f = tar.extractfile(member).read() 
    content = csv.reader(f) 
    for row in content: 
     print(row) 
tar.close() 

这产生了以下错误。

for row in content: 
_csv.Error: iterator should return strings, not int (did you open the file in text mode?) 

我也尝试解析f作为csv模块文档中描述的字符串。

content = csv.reader([f]) 

以上产生相同的错误。

我试着解析文件对象f ascii。

f = tar.extractfile(member).read().decode('ascii') 

但这迭代每个csv元素,而不是迭代包含元素列表的行。

['1'] 
['0'] 
['7'] 
['9'] 
['', ''] 
['S'] 
['A'] 
['M'] 
['P'] 
['L'] 
['E'] 
['_'] 
['A'] 
['', ''] 
['G'] 
['R'] 

剪断...

['2'] 
['0'] 
['1'] 
['7'] 
['/'] 
['0'] 
['2'] 
['/'] 
['1'] 
['5'] 
[' '] 
['2'] 
['2'] 
[':'] 
['5'] 
['7'] 
[':'] 
['3'] 
['8'] 
[] 
[] 

试图既解析˚F为ASCII和读取它作为一个字符串

f = tar.extractfile(member).read().decode('ascii') 
content = csv.reader([f]) 

产生以下输出

for row in content: 
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? 

要展示了我用以下方面的不同结果ng代码。

import tarfile 
import csv 

tar = tarfile.open("tar-file.tar.gz") 

for member in tar.getmembers(): 
    f = tar.extractfile(member).read() 
    print(member.name) 
    print('Raw :', type(f)) 
    print(f) 
    print() 
    f = f.decode('ascii') 
    print('ASCII:', type(f)) 
    print(f) 
tar.close() 

这产生以下输出。 (每个csv在本例中都包含相同的数据)。

./raw_data/csv-file1.csv 
Raw : <class 'bytes'> 
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n' 

ASCII: <class 'str'> 
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30 
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26 
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31 
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38 


./raw_data/csv-file2.csv 
Raw : <class 'bytes'> 
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n' 

ASCII: <class 'str'> 
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30 
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26 
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31 
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38 


./raw_data/csv-file3.csv 
Raw : <class 'bytes'> 
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n' 

ASCII: <class 'str'> 
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30 
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26 
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31 
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38 

如何让csv模块正确读取tar模块提供的内存中的文件? 谢谢。

回答

2

你只需要使用io.StringIO()来产生一个类似csv库的对象的文件来使用。例如:

import tarfile 
import csv 
import io 

with tarfile.open('input.rar') as tar: 
    for member in tar: 
     if member.isreg():  # Is it a regular file? 
      print("{} - {} bytes".format(member.name, member.size)) 
      csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii')) 

      for row in csv.reader(csv_file): 
       print(row) 
+0

感谢马丁,这很好地诀窍。 – Pobbel