的Python - 网页抓取 - BeautifulSoup

我是新来BeautifulSoup并试图从以下网站数据： http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city 的Python - 网页抓取 - BeautifulSoup

我试图把解压出来的总结百分比每个类别（食品，住房，衣服，交通，个人护理和娱乐）。因此，对于上面提供的链接，我想提取百分比：48％，129％，63％，43％，42％，42％和72％。

不幸的是，我使用BeautifulSoup的当前Python代码会提取出以下百分比：12％，85％，63％，21％，42％和48％。我不知道为什么会这样。任何帮助在这里将不胜感激！这里是我的代码：

import urllib2 
from bs4 import BeautifulSoup 
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city" 
page = urllib2.urlopen(url) 
soup_expatistan = BeautifulSoup(page) 
page.close() 

expatistan_table = soup_expatistan.find("table",class_="comparison") 
expatistan_titles = expatistan_table.find_all("tr",class_="expandable") 

for expatistan_title in expatistan_titles: 
    published_date = expatistan_title.find("th",class_="percent") 
    print(published_date.span.string)

来源

2014-05-03 user3599514

我无法找出确切原因，但似乎与urllib2问题。简单地更改为requests，它开始工作。下面是代码：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city" 
page = requests.get(url).text 
soup_expatistan = BeautifulSoup(page) 

expatistan_table = soup_expatistan.find("table", class_="comparison") 
expatistan_titles = expatistan_table.find_all("tr", class_="expandable") 

for expatistan_title in expatistan_titles: 
    published_date = expatistan_title.find("th", class_="percent") 
    print(published_date.span.string)

您可以使用pip为了安装requests：

$ pip install requests

编辑

问题确实涉及到urllib2。看起来www.expatistan.com服务器根据请求中设置的用户代理进行不同响应。为了得到与urllib2相同的响应，您必须执行以下操作：

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city" 
request = urllib2.Request(url) 
opener = urllib2.build_opener() 
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0') 
page = opener.open(request).read()

来源

2014-05-03 18:14:16 Trein

非常感谢您的帮助！ – user3599514

的Python - 网页抓取 - BeautifulSoup

回答

相关问题