我的刮刀出了什么问题？

-1

我想通过进入页面进入页面来刮掉agent_name的联系人详细信息。有时，这个脚本返回给我一个条目，有时候不同的条目无法找出原因。我的刮刀出了什么问题？

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import Selector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(CrawlSpider): 
    name = "comp" 
    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     sites = response.xpath('.//*[@id="frmSaveListing"]/ul') 
     items = [] 

     for site in sites: 
      item = CompItem() 
      item['title'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/text()').extract()[0] 
      item['link'] = site.xpath('.//li[2]/div[3]/div[1]/div[2]/p[1]/a/@href').extract()[0] 
      if item['link']: 
       if 'http://' not in item['link']: 
        item['link'] = urljoin(response.url, item['link']) 
       yield scrapy.Request(item['link'], 
            meta={'item': item}, 
            callback=self.anchor_page) 

      items.append(item) 

    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]/div[1]/div/table/tbody/tr/td[1]/table/tbody/tr[3]/td/text()').extract() 
     yield old_item

来源

2015-04-04 nik

当您的代码执行并且不起作用时，您看过网页的变化吗？ – 2015-04-04 14:35:26

我检查了网页，它随新的列表更改，但它应该拉出匹配xml路径的数据？ – nik 2015-04-04 14:36:40

即使您在浏览器中打开起始网址并多次刷新页面，您也会得到不同的搜索结果。

无论如何，你的蜘蛛需要调整，因为它不提取网页中的所有代理：

import scrapy 
from urlparse import urljoin 


class CompItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    data = scrapy.Field() 


class criticspider(scrapy.Spider): 
    name = "comp" 

    allowed_domains = ["iproperty.com.my"] 
    start_urls = ["http://www.iproperty.com.my/property/searchresult.aspx?t=S&gpt=AR&st=&ct=&k=&pt=&mp=&xp=&mbr=&xbr=&mbu=&xbu=&lo=&wp=&wv=&wa=&ht=&au=&sby=&ns=1"] 


    def parse(self, response): 
     agents = response.xpath('//li[@class="search-listing"]//div[@class="article-right"]') 
     for agent in agents: 
      item = CompItem() 
      item['title'] = agent.xpath('.//a/text()').extract()[0] 
      item['link'] = agent.xpath('.//a/@href').extract()[0] 
      yield scrapy.Request(urljoin("http://www.iproperty.com.my", item['link']), 
           meta={'item': item}, 
           callback=self.anchor_page) 


    def anchor_page(self, response): 
     old_item = response.request.meta['item'] 

     old_item['data'] = response.xpath('.//*[@id="main-content3"]//table//table//p/text()').extract() 
     yield old_item

我已经解决：

使用scrapy.Spider代替CrawlSpider
修复了XPath表达式，使其遍历页面上的所有代理，然后访问链接并获取代理的自我描述/促销

来源

2015-04-05 08:12:12 alecxe

Thanx帮助好友 – nik 2015-04-05 15:39:04

嘿，你能建议我如何编写一个规则来解析所有页面 – nik 2015-04-05 15:39:48

@nik你能否请你详细说明一个单独的问题？谢谢 – alecxe 2015-04-05 15:46:02

我的刮刀出了什么问题？

回答

相关问题