使用python从网站抓取多个网页

我想知道如何从一个网站使用美丽的汤为一个城市（例如伦敦）抓取多个不同的网页，而不必一遍又一遍地重复我的代码。使用python从网站抓取多个网页

我的目标是理想的第一抓取与一个城市

下面的所有页面，我的代码：

session = requests.Session() 
session.cookies.get_dict() 
url = 'http://www.citydis.com' 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} 
response = session.get(url, headers=headers) 

soup = BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 


jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0" 
response = session.get(jsonUrl, headers=headers) 
js_dict = (json.loads(response.content.decode('utf-8'))) 

for item in js_dict: 
    headers = js_dict['searchResults']["tours"] 
    prices = js_dict['searchResults']["tours"] 

for title, price in zip(headers, prices): 
    title_final = title.get("title") 
    price_final = price.get("price")["original"] 

print("Header: " + title_final + " | " + "Price: " + price_final)

输出为下列之一：

Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 € 
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 € 
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 € 
Header: London: Themse Dinner Cruise | Price: 96,62 €

它给我只返回第一页的结果（4结果），但我想要获得伦敦的所有结果（必须超过200个结果）

你能给我什么建议吗？我想，我都数不过来了就jsonURL的网页，但不知道该怎么办呢

UPDATE

感谢帮助，I'm抽到了一步。

在这种情况下，我只能抓取一页（页面= 0），但我想抓取前10页。因此，我的做法是以下几点：从代码

相关片段：

soup = bs4.BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 

page = 0 
while page <= 11: 
    page += 1 

    jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=" + str(page) 
    response = session.get(jsonUrl, headers=headers) 
    js_dict = (json.loads(response.content.decode('utf-8'))) 


    for item in js_dict: 
     headers = js_dict['searchResults']["tours"] 
     prices = js_dict['searchResults']["tours"] 

     for title, price in zip(headers, prices): 
      title_final = title.get("title") 
      price_final = price.get("price")["original"] 

      print("Header: " + title_final + " | " + "Price: " + price_final)

I'm得到结果返回一个特定网页，但不是全部。除此之外，我还会收到一条错误消息。这与我为什么没有取回所有结果有关吗？

输出：

Traceback (most recent call last): 
File "C:/Users/Scripts/new.py", line 19, in <module> 
AttributeError: 'list' object has no attribute 'update'

感谢您的帮助

来源

2017-04-16 Serious Ruffy

如果你想正确的抓取网页的方式寻找'xpaths'。它会使你的代码减少很多，也许在你上面做的最多5行。它是做任何与抓取和抓取有关的标准方式。 – anekix

感谢您的信息。将尝试一下。尽管如此，你能否提供一些反馈，告诉我如何用上述方法解决上述问题？ –

你真的应该确保你的代码示例是完整的（你丢失了一些进口）和语法正确（代码包含缩进问题）。在试图做出一个工作示例时，我提出了以下内容。

import requests, json, bs4 
session = requests.Session() 
session.cookies.get_dict() 
url = 'http://www.getyourguide.de' 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} 
response = session.get(url, headers=headers) 

soup = bs4.BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 
metaConfigTxt = metaConfig["content"] 
csrf = json.loads(metaConfigTxt)["pageToken"] 


jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0" 
headers.update({'X-Csrf-Token': csrf}) 
response = session.get(jsonUrl, headers=headers) 
js_dict = (json.loads(response.content.decode('utf-8'))) 
print(js_dict.keys()) 

for item in js_dict: 
     headers = js_dict['searchResults']["tours"] 
     prices = js_dict['searchResults']["tours"] 

     for title, price in zip(headers, prices): 
      title_final = title.get("title") 
      price_final = price.get("price")["original"] 

      print("Header: " + title_final + " | " + "Price: " + price_final)

这给了我四个以上的结果。

一般而言，您会发现很多返回JSON的网站都会对他们的回复进行分页，每页提供固定数量的结果。在这些情况下，除最后一页以外的每个页面通常都会包含一个键，其值将为您提供下一页的URL。在页面上循环时很简单，当您检测到该键不存在时，break不在循环中。

来源

2017-04-17 09:26:22 holdenweb

非常感谢你。将考虑您的反馈。在这种情况下，我只能抓取一页（页面= 0），但我想抓取前10页。我在我的第一篇初始文章中发布了我的方法。希望，你可以引导我找到正确的解决方案。并感谢您的耐心:) –

很高兴。我认为任何进一步的进展将取决于网站的具体情况，因此可能会落在Stackoverflow之外 – holdenweb

使用python从网站抓取多个网页

回答

相关问题