2017-01-27 23 views
0

我希望你能帮助我,所以我需要创建解析文本的功能,并提取数据到大熊猫数据帧:解析和提取数据到大熊猫数据帧:BeautifulSoup和XML

“” “ 功能 --------- rcp_poll_data

Extract poll information from an XML string, and convert to a DataFrame 

Parameters 
---------- 
xml : str 
    A string, containing the XML data from a page like 
    get_poll_xml(1044) 

Returns 
------- 
A pandas DataFrame with the following columns: 
    date: The date for each entry 
    title_n: The data value for the gid=n graph (take the column name from the `title` tag) 

This DataFrame should be sorted by date 

Example 
------- 
Consider the following simple xml page: 

<chart> 
<series> 
<value xid="0">1/27/2009</value> 
<value xid="1">1/28/2009</value> 
</series> 
<graphs> 
<graph gid="1" color="#000000" balloon_color="#000000" title="Approve"> 
<value xid="0">63.3</value> 
<value xid="1">63.3</value> 
</graph> 
<graph gid="2" color="#FF0000" balloon_color="#FF0000" title="Disapprove"> 
<value xid="0">20.0</value> 
<value xid="1">20.0</value> 
</graph> 
</graphs> 
</chart> 

Given this string, rcp_poll_data should return 
result = pd.DataFrame({'date': pd.to_datetime(['1/27/2009', '1/28/2009']), 
         'Approve': [63.3, 63.3], 'Disapprove': [20.0, 20.0]}) 

mycode的

def rcp_poll_data(xml): 
soup = BeautifulSoup(xml,'xml') 
dates=soup.find("series") 
datesval=soup.findChildren(string=True) 
del datesval[-7:] 
obama=soup.find("graph",gid="1") 
obamaval={"title":obama["title"],"color":obama["color"]} 
romney=soup.find("graph",gid="2") 
romneyval={"title":romney["title"],"color":romney["color"]} 
result = pd.DataFrame({'date': pd.to_datetime(datesval,errors="ignore"), 'GID1':obamaval, 'GID2':romneyval}) 
return result 

”“” 但是当我执行程序时,我总是收到这个错误: 与非系列字符串混合可能会导致模糊的排序。

请帮忙! PS:在get_poll功能是这样的:

def get_poll_xml(poll_id): 
url="http://charts.realclearpolitics.com/charts/"+str(poll_id)+".xml" 
return requests.get(url).content 

poll_id = 1044例如

回答

0

考虑使用内置xml.etree.ElementTree超过BeautifulSoup(更好地为HTML网页抓取)来解析XML具有方法内容如iterfind,findall,find通过子节点添加到XPath,即使有谓词如@gid='1'。而且,由于在这两个<series><graph>父标签<value>元素是相同的长度,可以循环在zip()

import requests 
import pandas as pd 
import xml.etree.ElementTree as et 

def get_poll_xml(poll_id): 
    url="http://charts.realclearpolitics.com/charts/{}.xml".format(poll_id) 
    return requests.get(url).content 

def rcp_poll_data(xml): 

    tree = et.fromstring(xml) 

    dates = []; graphlist1 = []; graphlist2 = [] 

    g1title = tree.find("./graphs/graph[@gid='1']").get('title') 
    g2title = tree.find("./graphs/graph[@gid='2']").get('title') 

    for s, g1, g2 in zip(tree.iterfind("./series/value"), 
         tree.iterfind("./graphs/graph[@gid='1']/value"), 
         tree.iterfind("./graphs/graph[@gid='2']/value")): 
     dates.append(s.text) 
     graphlist1.append(g1.text) 
     graphlist2.append(g2.text) 

    return pd.DataFrame({'Date':pd.to_datetime(dates, errors="ignore"), 
         g1title: graphlist1, 
         g2title: graphlist2}) 

poll_id = 1044 
xml_str = get_poll_xml(poll_id) 
df = rcp_poll_data(xml_str) 

输出

print(df.head(20)) 

# Approve  Date Disapprove 
# 0  63.3 2009-01-27  20.0 
# 1  63.3 2009-01-28  20.0 
# 2  63.5 2009-01-29  19.3 
# 3  63.5 2009-01-30  19.3 
# 4  61.8 2009-01-31  19.4 
# 5  61.8 2009-02-01  19.4 
# 6  61.8 2009-02-02  19.4 
# 7  61.8 2009-02-03  19.4 
# 8  61.8 2009-02-04  19.4 
# 9  61.8 2009-02-05  19.4 
# 10 61.6 2009-02-06  21.4 
# 11 61.6 2009-02-07  21.4 
# 12 61.6 2009-02-08  21.4 
# 13 65.4 2009-02-09  22.6 
# 14 65.4 2009-02-10  22.6 
# 15 64.2 2009-02-11  23.3 
# 16 64.2 2009-02-12  23.3 
# 17 64.2 2009-02-13  23.3 
# 18 64.8 2009-02-14  25.4 
# 19 65.5 2009-02-15  25.5 
+0

哇,太感谢你了,我也没知道xml.etree.ElementTree,谢谢你指出我! –