2015-04-04 111 views
-1

我想通过进入页面进入页面来刮掉agent_name的联系人详细信息。有时,这个脚本返回给我一个条目,有时候不同的条目无法找出原因。我的刮刀出了什么问题?

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import Selector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(CrawlSpider): 
    name = "comp" 
    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     sites = response.xpath('.//*[@id="frmSaveListing"]/ul') 
     items = [] 

     for site in sites: 
      item = CompItem() 
      item['title'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/text()').extract()[0] 
      item['link'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/@href').extract()[0] 
      if item['link']: 
       if 'http://' not in item['link']: 
        item['link'] = urljoin(response.url, item['link']) 
       yield scrapy.Request(item['link'], 
            meta={'item': item}, 
            callback=self.anchor_page) 

      items.append(item) 

    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]/div[1]/div/table/tbody/tr/td[1]/table/tbody/tr[3]/td/text()').extract() 
     yield old_item 
+0

当您的代码执行并且不起作用时,您看过网页的变化吗? – 2015-04-04 14:35:26

+0

我检查了网页,它随新的列表更改,但它应该拉出匹配xml路径的数据? – nik 2015-04-04 14:36:40

回答

0

即使您在浏览器中打开起始网址并多次刷新页面,您也会得到不同的搜索结果。

无论如何,你的蜘蛛需要调整,因为它不提取网页中的所有代理:

import scrapy 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(scrapy.Spider): 
    name = "comp" 

    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     agents = response.xpath('//li[@class="search-listing"]//div[@class="article-right"]') 
     for agent in agents: 
      item = CompItem() 
      item['title'] = agent.xpath('.//a/text()').extract()[0] 
      item['link'] = agent.xpath('.//a/@href').extract()[0] 
      yield scrapy.Request(urljoin("http://www.iproperty.com.my", item['link']), 
           meta={'item': item}, 
           callback=self.anchor_page) 


    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]//table//table//p/text()').extract() 
     yield old_item 

我已经解决:

  • 使用scrapy.Spider代替CrawlSpider
  • 修复了XPath表达式,使其遍历页面上的所有代理,然后访问链接并获取代理的自我描述/促销
+0

Thanx帮助好友 – nik 2015-04-05 15:39:04

+0

嘿,你能建议我如何编写一个规则来解析所有页面 – nik 2015-04-05 15:39:48

+0

@nik你能否请你详细说明一个单独的问题?谢谢 – alecxe 2015-04-05 15:46:02