2015-10-07 71 views
1

我试图将下面的打印命令输出到字典中(没有成功),以便随后将其导出为CSV。Python 3 - 将变量导入字典

我怎样才能得到parseddata(输出下面的打印)到一个字典?

样本输入文件:

<html> 
<body> 
<p>{ success:true ,results:3,rows:[{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"N‌​on-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cu‌​mulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cum‌​ulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}</p> 
</body> 
</html> 

我的代码:

import requests 
import re 
from bs4 import BeautifulSoup 
url = requests.get("http://. . .") 
soup = BeautifulSoup(url.text, "lxml") 
parseddata = soup.string.split(':[', 1)[1].lstrip(']') 
print(parseddata) 

print(parseddata)输出为:

{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]} 
+0

但是'parseddata'看起来像什么? – yurib

+0

yurib,我已编辑帖子以显示parseddata的样子。谢谢 –

+0

@zs_python:你能提供一个样本输入文件来处理,以便人们可以运行测试用例。 –

回答

0

这看起来像一个键 - 值映射,与ISIN键和"INE134E01011"值。但它不是JSON,因为钥匙中没有报价,也不是YAML因为普通标键(即字符串不带引号必须是followed by colon + space:

如果你打破部分输出字符串¹:

test_str = (
    '{ISIN:"INE134E01011",Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"14-Aug-2015 15:39",' 
    'SeqNumber:"1001577"},' 
    '{ISIN:"INE134E01011",' # new mapping starts 
    'Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"30-May-2015 14:37",' 
    'SeqNumber:"129901"},' 
    '{ISIN:"INE134E01011",' # new mapping starts 
    'Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"17-Feb-2015 14:57",' 
    'SeqNumber:"126171"}]}' 
) 

测试它等于你输入:

test_org = '{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}' 
assert test_str == test_org 

这分裂清楚其实有3名映射,并有一个尾随]}的。表示存在一个列表,这与使用逗号分隔的3个映射一致。匹配[失踪,因为你在':['分裂后,你lstrip()它。

您可以轻松地操作字符串,YAML可以分析它,但结果是一个列表²:

import ruamel.yaml 
test_str = '[' + test_str.replace(':"', ': "').rstrip('}') 

data = ruamel.yaml.load(test_str) 
print(type(data)) 

打印:

<class 'list'> 

而且,由于该名单包括有http://stardict.sourceforge.net/Dictionaries.php下载共同的钥匙你不能只是结合那些没有丢失的信息。

您可以此列表映射到某个键(有一个冒号在split和输出具有后}迹象表明是在XML),也可以采取与唯一值的键(SeqNumber)和提升价值的关键在字典替换名单:

ddata = {} 
for elem in data: 
    k = elem.pop('SeqNumber') 
    ddata[k] = elem 

,但我没有看到一个原因,从列表中去的字典,如果你的最终目标是一个CSV文件。如果你从YAML解析器的输出,你可以这样做:

import csv 
with open('output.csv', 'w', newline='') as fp: 
    csvwriter = csv.writer(fp) 
    csvwriter.writerow(data[0].keys()) # header of common dict keys 
    for elem in data: 
     csvwriter.writerow(elem.values()) # values 

得到一个CSV与以下内容的文件:

ISIN,Ind,Consolidated,Cumulative,Audited,FilingDate 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,14-Aug-2015 15:39 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,30-May-2015 14:37 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,17-Feb-2015 14:57 

¹而是与\逃逸的新行的,我用括号使多行定义成一个字符串,这使我可以更容易地对行发表评论
²而不是重新添加'[',你当然不应该将它放在首位

+0

谢谢安东恩,这是完美的,只是为我做了工作!真的很感谢你所做的所有努力,我也向我解释。谢谢@ShadowRanger,你的effo rts已经添加到我的python学习中,并且也非常有帮助。这个noob被你们为帮助我学习而付出的努力所淹没。谢谢你! –

+0

@zs_python如果这解决了您的问题,请考虑接受答案(通过单击此答案顶部旁边的标记)。这向其他人表明你的问题已经解决(他们可能不会一直读到你的评论),并在数据库中标记为这样。 – Anthon

+0

感谢@anthon手握,已经接受了指导的答案。很快见到你们:) –

2

除了杂散靠近支架/支架, 这是有效的JSON这是有效的YAML(我做了上午在我最初的答案中采用;可以在不引用属性的情况下声明JavaScript对象,但JSON便携式格式不允许这样做; YAML)。

按照说明here使用PyYAML解析数据。手册split -ing和lstrip正在伤害你,使它比需要的更难。刚刚拿到text,然后用yaml解析(这是必须单独安装第三方模块):

import requests 
import yaml 
from bs4 import BeautifulSoup 

url = requests.get("http://. . .") 
soup = BeautifulSoup(url.text, "lxml") 
# Use safe_load over load to avoid opening security holes; YAML can do 
# a lot of unsafe things if the input isn't trusted, but handling JS 
# object literals can be done safely with safe_load 
response_object = yaml.safe_load(soup.string.strip()) 
data_rows = response_object['rows'] 

for row in data_rows: 
    ... do stuff with each returned row ... 

你可以阅读更多的PyYAML tutorial

+0

感谢ShadowRanger,我猜“末尾流浪的紧支撑/支架”是问题,请问我该如何摆脱它? –

+1

@zs_python:在你问之前预期并添加了一个例子。 :-) – ShadowRanger

+2

可能性是,原始数据是有效的'json',只有你感兴趣的对象是一个只有一个属性(包含一个元素数组)的对象的数组属性中的唯一条目。你可能只需要'json.loads'整个事情,然后访问并分配'data_as_dict = whole_thing_as_dict ['name_of_singleton_key'] [0]'并且避免显式的'拆分'和'lstrip'。 – ShadowRanger