2017-08-30 46 views
1

我使用python 3.6和我能够用刮文字BeautifulSoup.I刮与沃尔玛website.I试图从沃尔玛刮文本练习。这是我的代码。网络使用beautifulSoup和urllib的

from bs4 import BeautifulSoup 
from urllib.request import urlopen 
main_page=urlopen('http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159') 
soup = BeautifulSoup(main_page,"lxml") 
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text() 
price=soup.select_one("span.Price-group").get_text() 
highLights=soup.select_one("div.ProductPage-short-description-body").get_text() 
description=soup.select_one("div.about-desc").get_text() 
print(title,"\n",highLights,"\n",description,"\n",price) 

在上面的代码中,我提取产品名称,价格,高灯和描述,但我不能够提取的说明(关于这个项目)。而不是描述我得到别的东西。

请帮我解决这个问题。

回答

0

因为有2个div class =“about-desc”,因为你使用select_one只返回第一个div,但你需要第二个div。这里的好办法:

description=soup.select("div.about-desc")[1].get_text() 

更新:该网站实际上块的urllib的默认用户代理,所以你应该掩盖。

from bs4 import BeautifulSoup 
from urllib.request 
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'} 
req = urllib.request.Request(url="http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159", headers=user_agent) 
main_page = urllib.request.urlopen(req) 
soup = BeautifulSoup(main_page,"lxml") 
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text() 
price=soup.select_one("span.Price-group").get_text() 
highLights=soup.select_one("div.ProductPage-short-description-body").get_text() 
description=soup.select("div.about-desc")[1].get_text() 
print(title,"\n",highLights,"\n",description,"\n",price) 
相关问题