2013-10-13 42 views
2

我正在通过以下xml- http://charts.realclearpolitics.com/charts/1044.xml解析。我想在包含3列的数据框中显示结果:日期,批准,拒绝。 xml文件是动态的,因为每天添加一个新日期,所以代码应该考虑到这一点。我已经实现了一个静态的解决方案,即我必须循环给出值标签行号。我想了解如何动态实现它。在python中通过xml解析

import numpy as np 
import pandas as pd 
import requests 
from pattern import web 

xml = requests.get('http://charts.realclearpolitics.com/charts/1044.xml').text 
dom = web.Element(xml) 
values = dom.by_tag('value') 

date = [] 
approve = [] 
disapprove = [] 

values = dom.by_tag('value') 
#The last range number below is 1720 instead of 1727 as last 6 values of Approve & Disapprove tag are blank. 
for i in range(0,1720): 
    date.append(pd.to_datetime(values[i].content)) 

#The last range number below is 3447 instead of 3454 as last 6 values are blank. Including till 3454 will give error while converting to float. 
for i in range(1727,3447): 
    a = float(values[i].content) 
    approve.append(a) 

#The last range number below is 5174 instead of 5181 as last 6 values are blank. 
for i in range(3454,5174): 
    a = float(values[i].content) 
    disapprove.append(a) 

finalresult = pd.DataFrame({'date': date, 'Approve': approve, 'Disapprove': disapprove}) 
finalresult 
+1

LXML具有XPath的支持,这似乎是你想要的。然后你可以用xpath命令获取元素,不管它们有多少。 –

回答

2

这里是lxml和XPath做到这一点的一种方法:

from lxml import etree 
import pandas as pd 

tree = etree.parse("http://charts.realclearpolitics.com/charts/1044.xml") 

date = [s.text for s in tree.xpath("series/value")] 
approve = [float(s.text) if s.text else 0.0 
      for s in tree.xpath("graphs/graph[@title='Approve']/value")] 
disapprove = [float(s.text) if s.text else 0.0 
       for s in tree.xpath("graphs/graph[@title='Disapprove']/value")] 

assert len(date) == len(approve) == len(disapprove) 

finalresult = pd.DataFrame({'Date': date, 'Approve': approve, 'Disapprove': disapprove}) 
print finalresult 

输出:

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 1727 entries, 0 to 1726 
Data columns (total 3 columns): 
Date   1727 non-null values 
Approve  1727 non-null values 
Disapprove 1727 non-null values 
dtypes: float64(2), object(1) 
+0

感谢您的代码。它解析得很好。此外还有1720个非空值。但它最后包含7个'None'值,这使得像finalresult.Approve.sum()这样的操作变得不可能? – PronojitS

+0

我已经更新了答案。 – mzjn