2013-10-23 54 views
5

我正试图将一些处理工作从R移到Python。在R中,我使用read.table()读取真正凌乱的CSV文件,并自动以正确的格式分割记录。例如。R在Python中的read.table等效项

391788,"HP Deskjet 3050 scanner always seems to break","<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p> 

<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p> 
","windows-7 printer hp" 

被正确地分成4列。 1条记录可以分成许多行,并且在所有地方都有逗号。在R我只是这样做:

read.table(infile, header = FALSE, nrows=chunksize, sep=",", stringsAsFactors=FALSE) 

在Python中有什么可以做到这一点同样好吗?

谢谢!

回答

3

您可以使用csv模块。

from csv import reader 
csv_reader = reader(open("C:/text.txt","r"), quotechar="\"") 

for row in csv_reader: 
    print row 

['391788', 'HP Deskjet 3050 scanner always seems to break', "<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p>\n\n<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p>\n", 'windows-7 printer hp'] 

长度输出= 4

+0

但这只是返回字符串。它不会像read.table那样推断每一列的类型。 –

2

pandas模块还提供了许多R-样函数和数据结构,包括read_csv。这里的优点是数据将作为熊猫DataFrame读入,比标准的Python列表或字典更容易操作(尤其是如果您习惯于R)。这里是一个例子:

>>> from pandas import read_csv 
>>> ugly = read_csv("ugly.csv",header=None) 
>>> ugly 
     0            1 \ 
0 391788 HP Deskjet 3050 scanner always seems to break 

                2      3 
0 <p>I'm running a Windows 7 64 blah blah blah..... windows-7 printer hp