2017-02-15 16 views
0

我一直在制作一个简单的刮刀,使用美丽的汤根据用户输入的邮编获得食品卫生评级。该代码正常工作,并正确地从URL中获取结果。Python - 显示来自所有页面的结果不仅仅是第一页(美丽的汤)

我需要帮助的是如何让所有结果显示,而不仅仅是第一页的结果。

我的代码如下:

import requests 
from bs4 import BeautifulSoup 

pc = input("Please enter postcode") 

url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode="+pc+"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt" 
r = requests.get(url) 

soup = BeautifulSoup(r.content, "lxml") 
g_data = soup.findAll("div", {"class": "search-result"}) 

for item in g_data: 
    print (item.find_all("a", {"class": "name"})[0].text) 
try: 
    print (item.find_all("span", {"class": "address"})[0].text) 
except: 
    pass 
try: 
    print (item.find_all("div", {"class": "rating-image"})[0].text) 
except: 
    pass 

我已经通过查看网址发现,显示的页面是依赖于所谓的页面

https://www.scoresonthedoors.org.uk/search.php?award_sort=ALPHA&name=&address=BT147AL&x=0&y=0&page=2#results 

分页代码的URL字符串变量Next Page按钮是:

<a style="float: right" href="?award_sort=ALPHA&amp;name=&amp;address=BT147AL&amp;x=0&amp;y=0&amp;page=3#results" rel="next " title="Go forward one page">Next <i class="fa fa-arrow-right fa-3"></i></a> 

有没有一种方法,我可以让我的代码,找出输出H多页结果呈现,然后从这些页面中获取结果?

最好的解决方案是让代码改变URL字符串以每次更改“page =”(例如for循环),或者有办法使用分页链接代码中的信息来查找解决方案?

非常感谢的人谁提供帮助或看这个问题

回答

1

你居然打算以正确的方式。生成分页的URL来预先刮取是一个好方法。

我其实写了几乎整个代码。你要看的是find_max_page()函数首先包含从分页字符串中获取最大页面。用这个数字,你可以生成所有需要刮取的网址,然后一个接一个地刮。

查看下面的代码,它几乎都在那里。

import requests 
from bs4 import BeautifulSoup 


class RestaurantScraper(object): 

    def __init__(self, pc): 
     self.pc = pc  # the input postcode 
     self.max_page = self.find_max_page()  # The number of page available 
     self.restaurants = list()  # the final list of restaurants where the scrape data will at the end of process 

    def run(self): 
     for url in self.generate_pages_to_scrape(): 
      restaurants_from_url = self.scrape_page(url) 
      self.restaurants += restaurants_from_url  # we increment the restaurants to the global restaurants list 

    def create_url(self): 
     """ 
     Create a core url to scrape 
     :return: A url without pagination (= page 1) 
     """ 
     return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \ 
       "&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt" 

    def create_paginated_url(self, page_number): 
     """ 
     Create a paginated url 
     :param page_number: pagination (integer) 
     :return: A url paginated 
     """ 
     return self.create_url() + "&page={}".format(str(page_number)) 

    def find_max_page(self): 
     """ 
     Function to find the number of pages for a specific search. 
     :return: The number of pages (integer) 
     """ 
     r = requests.get(self.create_url()) 
     soup = BeautifulSoup(r.content, "lxml") 
     pagination_soup = soup.findAll("div", {"id": "paginator"}) 
     pagination = pagination_soup[0] 
     page_text = pagination("p")[0].text 
     return int(page_text.replace('Page 1 of ', '')) 

    def generate_pages_to_scrape(self): 
     """ 
     Generate all the paginated url using the max_page attribute previously scraped. 
     :return: List of urls 
     """ 
     return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)] 

    def scrape_page(self, url): 
     """ 
     This is coming from your original code snippet. This probably need a bit of work, but you get the idea. 
     :param url: Url to scrape and get data from. 
     :return: 
     """ 
     r = requests.get(url) 
     soup = BeautifulSoup(r.content, "lxml") 
     g_data = soup.findAll("div", {"class": "search-result"}) 

     restaurants = list() 
     for item in g_data: 
      name = item.find_all("a", {"class": "name"})[0].text 
      restaurants.append(name) 
      try: 
       print item.find_all("span", {"class": "address"})[0].text 
      except: 
       pass 
      try: 
       print item.find_all("div", {"class": "rating-image"})[0].text 
      except: 
       pass 
     return restaurants 


if __name__ == '__main__': 
    pc = input('Give your post code') 
    scraper = RestaurantScraper(pc) 
    scraper.run() 
    print "{} restaurants scraped".format(str(len(scraper.restaurants))) 
+0

scrape_page函数是您的原始代码。它可以使用一些工作。只要确保这个功能已经建好。其他一切都已准备就绪。有关此代码的任何问题,请告知我。 –

+0

感谢Philippe,此代码工作完美。 –