将不正确格式的CSV读入熊猫 - 未转义的引号

我继承了几百个我想导入熊猫数据帧的CSV。它们的格式，像这样：将不正确格式的CSV读入熊猫 - 未转义的引号

username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink 
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112 
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752 
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281

要扳指成熊猫数据帧，我想：

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)

，并得到这个错误：

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11

我认为这是因为该字段中有一个非转义报价

ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...

所以，我想

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)

，并得到一个新的错误（我假设，因为有;在现场）：

Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5

ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11

我不能再生这些CSV文件。我想知道的是，我如何预处理/修复它们，以便它们的格式正确（即，在字段中转义引号）？或者，有没有办法直接将它们读入数据框，即使使用未转义的引号？

来源

2017-06-15 Libby

什么蟒蛇和熊猫的版本您使用的？我用Python 3.6.1和pandas得到了不同的结果0.19.2 –

Python 3.5.3 pandas 0.20.2 - 你会发生什么？ – Libby

对于这种情况，我不需要每一列，并添加'usecols'解决了我眼前的问题。但它并没有回答我的实际问题。这里是工作的一行：'tweets = pd.read_csv（file，header = 0，sep =';'，parse_dates = True，quoting = csv.QUOTE_NONE，usecols = [“date”，“hashtags”，“permalink”] ）' – Libby

-1

我会在读入熊猫之前清理数据。这是我对你当前问题的解决方案。

编辑：
这将双引号（基于this答案）

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines)) 
o.close()

原始内更换;：

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(lines.replace("; ", "")) 
o.close()

来源

2017-06-16 00:14:21 ramesh

The;在推文中并不总是跟着一个空格，所以这只适用于一些。例如'; 2013-07-15 15：35; 1; 0;“@ CongressionalPhotoADay 15 - 美丽的东西：从美国国会大厦的扬声器的阳台上看到; ... http：// fb.me/2ZHDzR8XQ" ;;@ CongressionalPhotoADay ;;“356874563839201280”; https：// twitter.com/AustinScottGA08/status/356874563839201280' – Libby

@Libby：在这种情况下，使用像https://stackoverflow.com/a/11096811/2204131这样的正则表达式。 're.sub（'\“[^]] * \”'，lambda x：x.group（0）.replace（';'，'\;'），lines）'将会替换引号内的';'。 – ramesh

将不正确格式的CSV读入熊猫 - 未转义的引号

回答

相关问题