2017-06-19 43 views
2

任何帮助将不胜感激,因为我是新的python。我创建了下面的Web爬网程序,但它不抓取所有页面,只有2页。它需要对所有页面进行哪些更改?我BeautifulSoup蜘蛛只爬行2页不是所有的页面

请参阅def trade_spider(max_pages)循环,在底部我有一个应该循环所有页面的trade_spider(18)。

感谢您的帮助。

import csv 
import re 
import requests 
from bs4 import BeautifulSoup 

f = open('dataoutput.csv','w', newline= "") 
writer = csv.writer(f) 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 
     for link in soup.findAll('a', {'class': 'listing-results-price text-price'}): 
      href = "http://www.zoopla.co.uk" + link.get('href') 
      title = link.string 
      get_single_item_data(href) 
     page += 1 
def get_single_item_data(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text) 

    for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}): 
    address = item_name.get_text(strip=True) 
writer.writerow([address]) 
trade_spider(18) 
+0

是否有错误发生或是否完全退出? 'page'变量是18还是2? –

回答

0

您的代码工作正常,它会抓取所有页面(虽然只有14页不是18页)。这似乎是你试图刮街头地址,在这种情况下,第二个函数是不必要的,并且只是通过调用requests.get()太多次而使得爬虫缓慢。我修改了一些代码,但是这个更快。

import csv 
import re 
import requests 
from bs4 import BeautifulSoup 

f = open('dataoutput.csv','w', newline="") 
writer = csv.writer(f) 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     furl = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 

     # Changed the class' value 

     for link in soup.findAll('a', {'class': 'listing-results-address'}):  
      #href = "http://www.zoopla.co.uk" + link.get('href') 
      #title = link.string 
      #get_single_item_data(href) 
      address = link.get_text() 
      print (address)    # Just to check it is working fine. 
      writer.writerow([address]) 

     print (page) 
     page += 1 

# Unnecessary code 

'''def get_single_item_data(item_url): 
source_code = requests.get(item_url) 
plain_text = source_code.text 
soup = BeautifulSoup(plain_text) 

for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}): 
    address = item_name.get_text(strip=True) 
    writer.writerow([address])''' 

trade_spider(18) 
+0

谢谢Rajeev,看起来上面的代码会得到地址,但我想要的信息比需要它进入每个链接并获取该信息的地址更多。即使把trade_spider(14)仍然只返回2页的结果,任何想法? – hello11

+0

我重新编写了代码,它返回了所有页面的信息。也许代码的另一部分(你可能没有发布)导致问题 – Rajeev

+0

谢谢Rajeev,发生了一个非类型错误。我怎么过去没有类型? – hello11