2015-12-12 38 views
0

2个星期前,我可以在这个网址的源代码读到的一切:http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazonBeautifulSoup不能得到一切

然而,今天,当我再次运行相同的代码,所有的历史价格可能不会出现在汤....你知道如何解决这个问题?

这里是我的Python代码(它的工作好!)

from bs4 import BeautifulSoup 
from urllib2 import urlopen 

url = 'http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazon' 
soup = BeautifulSoup(urlopen(url),'html.parser') 
lst = soup.find_all('tbody') 
for tbody in lst: 
    trs = tbody.find_all('tr') 
    for elem in trs: 
     tr_class = elem.get('class') 
     if tr_class != None: 
      if tr_class[0] == 'highest_price' or tr_class[0] == 'lowest_price': 
       tds = elem.find_all('td') 
       td_label = tds[0].get_text().split(' ')[0] 
       td_price = tds[1].get_text() 
       td_date = tds[2].get_text() 
       print td_label, td_price, td_date 
     else: 
      tds = elem.find_all('td') 
      td_label = tds[0].get_text().split(' ')[0] 
      if td_label == 'Average': 
       td_price = tds[1].get_text() 
       print td_label, td_price 

ps = soup.find_all('p') 
for p in ps: 
    p_class = p.get('class') 
    if p_class != None and len(p_class) == 2 and p_class[0] == 'smalltext' and p_class[1] == 'grey': 
     p_text = p.get_text() 
     m = re.search('since([\w\d,\s]+)\.', p_text) 
     if m: 
      date = m.group(1) 
      dt = datetime.datetime.strptime(date, ' %b %d, %Y') 
      print datetime.date.strftime(dt, '%Y-%m-%d') 
     break 

回答

1

我真的不知道有关解决方案,但一般应该避免这么多的列表索引和find_all条款。原因在于元素的位置或数量比class,id等变得容易得多。所以我会推荐使用css选择器。

1

从阅读源代码,它似乎是通过JavaScript访问历史价格数据。因此,您需要找到一种模拟真实浏览器的方式。就我个人而言,我使用Selenium来完成这些任务。