Python：如何处理在行尾没有逗号的csv？

-1

我有大约100个CSV包含来自不同来源的数据，因此具有不同的分隔符。有一个Python库可以猜测CSV的结构吗？Python：如何处理在行尾没有逗号的csv？

例如，有人有过这样一个表：

color, shape, avg weight, 
red, square, 15g, 
blue, circle, 11g,

以及CSV他们救看起来像：

'color', 'shape', 'avg weight', 'red', 'square', '15g', 'blue', 'circle', '11g'

如果我知道（列的数量，我找出使用函数）我可以创建一个列表的列表，然后使其成为pandas DataFrame。

然而，许多人都拥有在该行的末尾没有逗号的数据，这样的：

color, shape, avg weight 
red, square, 15g 
blue, circle, 11g

他们发送CSV的样子：

'color', 'shape', 'avg weight' 'red', 'square', '15g' 'blue', 'circle', '11g'

它得到当存在没有价值，甚至更糟avg weight，如：

color, shape, avg weight 
red, square, 
blue, circle, 11g

导致一个CSV塔t看起来像：

'color', 'shape', 'avg weight' '', 'square', '15g' 'blue', 'circle', '11g'

我该如何处理？或者我可以探索的图书馆是什么？

来源

2017-05-25 user1367204

修复您的数据。您需要一个一致的结构，或者编写解析器几乎是不可能的。 – gravity

这不适合我 – user1367204

如果您至少确定引号，则此方法可能有效。我们的想法是将引用的表达式与正则表达式匹配，然后利用我们关于列数的知识来形成数据框。如果您事先不知道列的数量，并且您不能依赖引号，我认为没有合适的方法来重新构建没有换行符的数据。

import re 
import pandas 

s = "'color', 'shape', 'avg weight' '', 'square', '15g' 'blue', 'circle', '11g'" 

Ncols = 3 
r = re.compile("'([^']*)'") 
items = r.findall(s) 
table = [items[i*Ncols:i*Ncols+Ncols] for i in range(len(items)//Ncols)] 

df = pandas.DataFrame(table[1:], columns=table[0])

来源

2017-05-25 18:21:01 chthonicdaemon

Python：如何处理在行尾没有逗号的csv？

回答

相关问题