2014-12-31 36 views
0

感谢您的关注! 我试图在搜索结果中检索产品的href。 例如本页:Beautifulsoup检索href列表

但是,当我缩小到产品图像类,retrived href是图像链接.... 任何人都可以解决这个问题吗?提前致谢!

url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5' 
content = urllib2.urlopen(url).read() 
content = preprocess_yelp_page(content) 
soup = BeautifulSoup(content) 

content = soup.findAll('div',{'class':'content dynamic'}) 
draft = str(content) 
soup = BeautifulSoup(draft) 
items = soup.findAll('div',{'class':'cell_section1'}) 
draft = str(items) 
soup = BeautifulSoup(draft) 
content = soup.findAll('div',{'class':'product-image'}) 
draft = str(content) 
soup = BeautifulSoup(draft) 

回答

0

你并不需要找到每个标签的内容与BeautifulSoup一遍又一遍的加载。

使用CSS selectors得到所有产品的链接(下diva标签与class="product-image"

import urllib2 
from bs4 import BeautifulSoup 

url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5' 
soup = BeautifulSoup(urllib2.urlopen(url)) 

for link in soup.select('div.product-image > a:nth-of-type(1)'): 
    print link.get('href') 

打印:

http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371 
http://www.homedepot.com/p/Husky-26-in-6-Drawer-Chest-and-Cabinet-Combo-Black-C-296BF16/203420937 
http://www.homedepot.com/p/Husky-52-in-18-Drawer-Tool-Chest-and-Cabinet-Set-Black-HOTC5218B1QES/204825971 
http://www.homedepot.com/p/Husky-26-in-4-Drawer-All-Black-Tool-Cabinet-H4TR2R/204648170 
... 

div.product-image > a:nth-of-type(1) CSS选择器将直接下div与每一个第一a标签相匹配类product-image

要保存链接到一个列表,使用列表理解

links = [link.get('href') for link in soup.select('div.product-image > a:nth-of-type(1)')] 
+0

真棒!你能告诉我如何将它们保存到列表中吗?所以我可以实际输出它们。 –

+0

@ plainvanilla好的,更新了答案。 – alecxe