我BeautifulSoup蜘蛛只爬行2页不是所有的页面

任何帮助将不胜感激，因为我是新的python。我创建了下面的Web爬网程序，但它不抓取所有页面，只有2页。它需要对所有页面进行哪些更改？我BeautifulSoup蜘蛛只爬行2页不是所有的页面

请参阅def trade_spider（max_pages）循环，在底部我有一个应该循环所有页面的trade_spider（18）。

感谢您的帮助。

import csv 
import re 
import requests 
from bs4 import BeautifulSoup 

f = open('dataoutput.csv','w', newline= "") 
writer = csv.writer(f) 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 
     for link in soup.findAll('a', {'class': 'listing-results-price text-price'}): 
      href = "http://www.zoopla.co.uk" + link.get('href') 
      title = link.string 
      get_single_item_data(href) 
     page += 1 
def get_single_item_data(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 

    for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}): 
    address = item_name.get_text(strip=True) 
writer.writerow([address]) 
trade_spider(18)

来源

2017-06-19 hello11

是否有错误发生或是否完全退出？ 'page'变量是18还是2？ –

您的代码工作正常，它会抓取所有页面（虽然只有14页不是18页）。这似乎是你试图刮街头地址，在这种情况下，第二个函数是不必要的，并且只是通过调用requests.get（）太多次而使得爬虫缓慢。我修改了一些代码，但是这个更快。

import csv 
import re 
import requests 
from bs4 import BeautifulSoup 

f = open('dataoutput.csv','w', newline="") 
writer = csv.writer(f) 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     furl = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 

     # Changed the class' value 

     for link in soup.findAll('a', {'class': 'listing-results-address'}):  
      #href = "http://www.zoopla.co.uk" + link.get('href') 
      #title = link.string 
      #get_single_item_data(href) 
      address = link.get_text() 
      print (address)    # Just to check it is working fine. 
      writer.writerow([address]) 

     print (page) 
     page += 1 

# Unnecessary code 

'''def get_single_item_data(item_url): 
source_code = requests.get(item_url) 
plain_text = source_code.text 
soup = BeautifulSoup(plain_text) 

for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}): 
    address = item_name.get_text(strip=True) 
    writer.writerow([address])''' 

trade_spider(18)

来源

2017-06-19 16:50:38 Rajeev

谢谢Rajeev，看起来上面的代码会得到地址，但我想要的信息比需要它进入每个链接并获取该信息的地址更多。即使把trade_spider（14）仍然只返回2页的结果，任何想法？ – hello11

我重新编写了代码，它返回了所有页面的信息。也许代码的另一部分（你可能没有发布）导致问题 – Rajeev

谢谢Rajeev，发生了一个非类型错误。我怎么过去没有类型？ – hello11

我BeautifulSoup蜘蛛只爬行2页不是所有的页面

回答

相关问题