2017-05-23 83 views
0

我使用这个数据集是很好的格式化 https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat处理逗号字符串CSV

一切练一些文本挖掘与Python,但某些条目,如:

6898,"RAAF Williams, Laverton Base","Laverton","Australia",\N,"YLVT",-37.86360168457031,144.74600219726562,18,10,"O","Australia/Hobart","airport","OurAirports" 
6899,"Nowra Airport","Nowra","Australia","NOA","YSNW",-34.94889831542969,150.53700256347656,400,10,"O","Australia/Sydney","airport","OurAirports" 

有他们的名字和逗号这使得不规则的列表,因为它创建了同一个核心元素(名称)的多个元素

我将代码分配给列表中的每一行:

with open (filename) as txt: 
for line in txt: 
    linea = line.split(',') 
    linea[3]=linea[3].strip('"') 

我的主要问题是,linea[3]应该是在这种情况下,国家australia,但它返回Laverton

我也试过csv库几乎没有区别。

也与此有关:我的代码返回此该条

['6898', 'RAAF Williams, Laverton Base', 'Laverton', 'Australia', '\\N', 'YLVT', '-37.86360168457031', '144.74600219726562', '18', '10', 'O', 'Australia/Hobart', 'airport', 'OurAirports'] 
+0

你尝试熊猫read_csv? 'split(',')'根本不正确 –

+0

您的输出与您的问题描述不符,''澳大利亚'在索引3处就像您想要的一样。 – timgeb

回答

0

如果能换到另一个包:你可以阅读使用熊猫的文件:

import pandas as pd 
df = pd.read_csv(filename, sep=',') 

print df 

    0        1   2   3 4  5   6   7 8 9 10    11  12    13 
0 6898 RAAF Williams, Laverton Base Laverton Australia \N YLVT -37.863602 144.746002 18 10 O Australia/Hobart airport OurAirports 
1 6899     Nowra Airport  Nowra Australia NOA YSNW -34.948898 150.537003 400 10 O Australia/Sydney airport  OurAirports 

# this line will give you the same output structure as you have with the csv package (i.e. the list of lists) 
df.as_matrix() 

[[6898 'RAAF Williams, Laverton Base' 'Laverton' 'Australia' '\\N' 'YLVT' 
    -37.86360168457031 144.74600219726562 18 10 'O' 'Australia/Hobart' 
    'airport' 'OurAirports '] 
[6899 'Nowra Airport' 'Nowra' 'Australia' 'NOA' 'YSNW' -34.948898315429695 
    150.53700256347656 400 10 'O' 'Australia/Sydney' 'airport' 'OurAirports']] 
2

Python已经支持CSV解析很长一段时间。 Refer this link.

您需要在解析器中使用quotechar。基本上,2个引用之间的任何逗号将被忽略。

如:

import csv 
with open (filename) as csvfile: 
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"') 
    for row in csvreader: 
     # do something with the row 
     print row 
+0

我试过那个确切的代码,结果相同 – mejillonius

+0

我试过了你已发布的airports.dat文件。它工作得很好。你确定你使用了*确切的代码*吗? – Nosh