2017-06-15 25 views
1

我继承了几百个我想导入熊猫数据帧的CSV。它们的格式,像这样:将不正确格式的CSV读入熊猫 - 未转义的引号

username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink 
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112 
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752 
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281 

要扳指成熊猫数据帧,我想:

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)

,并得到这个错误:

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11

我认为这是因为该字段中有一个非转义报价

ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...

所以,我想

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)

,并得到一个新的错误(我假设,因为有;在现场):

Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5

ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11

我不能再生这些CSV文件。我想知道的是,我如何预处理/修复它们,以便它们的格式正确(即,在字段中转义引号)?或者,有没有办法直接将它们读入数据框,即使使用未转义的引号?

+0

什么蟒蛇和熊猫的版本您使用的?我用Python 3.6.1和pandas得到了不同的结果0.19.2 –

+0

Python 3.5.3 pandas 0.20.2 - 你会发生什么? – Libby

+0

对于这种情况,我不需要每一列,并添加'usecols'解决了我眼前的问题。但它并没有回答我的实际问题。这里是工作的一行:'tweets = pd.read_csv(file,header = 0,sep =';',parse_dates = True,quoting = csv.QUOTE_NONE,usecols = [“date”,“hashtags”,“permalink”] )' – Libby

回答

-1

我会在读入熊猫之前清理数据。这是我对你当前问题的解决方案。

编辑:
这将双引号(基于this答案)

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines)) 
o.close() 

原始内更换;

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(lines.replace("; ", "")) 
o.close() 
+0

The;在推文中并不总是跟着一个空格,所以这只适用于一些。例如'; 2013-07-15 15:35; 1; 0;“@ CongressionalPhotoADay 15 - 美丽的东西:从美国国会大厦的扬声器的阳台上看到; ... http:// fb.me/2ZHDzR8XQ" ;;@ CongressionalPhotoADay ;;“356874563839201280”; https:// twitter.com/AustinScottGA08/status/356874563839201280' – Libby

+1

@Libby:在这种情况下,使用像https://stackoverflow.com/a/11096811/2204131这样的正则表达式。 're.sub('\“[^]] * \”',lambda x:x.group(0).replace(';','\;'),lines)'将会替换引号内的';'。 – ramesh