我继承了几百个我想导入熊猫数据帧的CSV。它们的格式,像这样:将不正确格式的CSV读入熊猫 - 未转义的引号
username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281
要扳指成熊猫数据帧,我想:
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)
,并得到这个错误:
ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11
我认为这是因为该字段中有一个非转义报价
ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...
所以,我想
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)
,并得到一个新的错误(我假设,因为有;在现场):
Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5
ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11
我不能再生这些CSV文件。我想知道的是,我如何预处理/修复它们,以便它们的格式正确(即,在字段中转义引号)?或者,有没有办法直接将它们读入数据框,即使使用未转义的引号?
什么蟒蛇和熊猫的版本您使用的?我用Python 3.6.1和pandas得到了不同的结果0.19.2 –
Python 3.5.3 pandas 0.20.2 - 你会发生什么? – Libby
对于这种情况,我不需要每一列,并添加'usecols'解决了我眼前的问题。但它并没有回答我的实际问题。这里是工作的一行:'tweets = pd.read_csv(file,header = 0,sep =';',parse_dates = True,quoting = csv.QUOTE_NONE,usecols = [“date”,“hashtags”,“permalink”] )' – Libby