爬行蜘蛛不爬行规则问题

我遇到了一个我正在编程的蜘蛛问题。我试图递归地从我大学的网站上删除课程，但我对Rule和LinkExtractor有很大的麻烦。爬行蜘蛛不爬行规则问题

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.spider import Spider 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 

from ..items import BotItem 

class UlsterSpider(CrawlSpider): 
    name = "ulster" 
    allowed_domains = ["ulster.ac.uk"] 
    start_urls = (
     'http://www.ulster.ac.uk/courses/course-finder?query=&f.Year_of_entry|E=2015/16&f.Type|D=Undergraduate', 
    ) 

    rules = (
     Rule(LinkExtractor(allow=("index\.php",)), callback="parse"), 
     Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]'), follow=True), 
    ) 

    def parse(self, response): 
     item = BotItem() 

     for title in response.xpath('//html'): 
      item['name'] = title.xpath('//*[@id="course_list"]/div/h2/a/text()').extract() 
      yield item

我的蜘蛛布局如下。在第16 - 18行是规则。我试图做的是按照课程下面的分页划出标题。但是，它不会遵循。如果有人能指引我朝着正确的方向发展，那将是一个很大的帮助。我试图使用SGML提取器复制示例，但它表示它已被弃用，不使用它。

免责声明

虽然这是一所大学的网站，这不是功课。这是为了好玩和学习。我真的很困难。

来源

2015-06-20 plotplot

家庭作业问题在SO上完全可以接受，我们甚至有一个[tag：homework]标签，只要他们遵循[这里]（http://stackoverflow.com/help/mcve）的指导方针。国际海事组织能够正确地提出有关SO的问题是一项非常有价值的技能，因为毕业后您很可能会回到这里，所以获得一点帮助没有问题（当然，您的教授可能会有不同的感觉）。 – IanAuld

你想用你的第一条规则捕捉什么？它似乎没有捕捉任何东西。 – tegancp

我不认为你需要两条规则，你可以声明一条规则，并遵循链接并解析每一页。

在规则中，我将xpath限制为列表的最后一个链接，否则可能会多次解析某些链接。

我使用parse_start_url作为回调来包含start_urls变量的url。

在xpath命令中，它返回一个列表，其中包含标签之间的所有文本，但有趣的是第一个，因此获取它并去掉空白。

用以下items.py：

import scrapy 

class BotItem(scrapy.Item): 
    name = scrapy.Field()

与Spider：

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from ..items import BotItem 
from scrapy.linkextractors import LinkExtractor 


class UlsterSpider(CrawlSpider): 
    name = "ulster" 
    allowed_domains = ["ulster.ac.uk"] 
    start_urls = ( 
     'http://www.ulster.ac.uk/courses/course-finder?query=&f.Year_of_entry|E=2015/16&f.Type|D=Undergraduate', 
    ) 

    rules = ( 
     Rule(
      LinkExtractor(restrict_xpaths='//div[@class="pagination"]/ul/li[position() = last()]'), 
      follow=True, 
      callback='parse_start_url'), 
    ) 

    def parse_start_url(self, response): 
     item = BotItem() 

     for title in response.xpath('//*[@id="course_list"]/div/h2/a'): 
      item['name'] = title.xpath('text()')[0].extract().strip() 
      yield item

您可以运行它像：

scrapy crawl ulster -o titles.json

国债收益率：

[{"name": "ACCA - Association of Chartered Certified Accountants"}, 
{"name": "Accounting"}, 
{"name": "Accounting"}, 
{"name": "Accounting and Advertising"}, 
{"name": "Accounting and Human Resource Management"}, 
{"name": "Accounting and Law"}, 
{"name": "Accounting and Management"}, 
{"name": "Accounting and Managerial Finance"}, 
{"name": "Accounting and Marketing"}, 
{"name": "Accounting with Finance"}, 
{"name": "Advertising"}, 
{"name": "Advertising and Human Resource Management"}, 
{"name": "Advertising with Computing"}, 
{"name": "Advertising with Drama"}, 
{"name": "Advertising with Human Resource Management"}, 
{"name": "Advertising with Psychology"}, 
...]

UPDATE：请注意，我用最后scrapy版本。我不知道它是否与你的相匹配，所以也许你需要调整一些进口产品。

来源

2015-06-20 17:59:25 Birei

非常感谢。你介意多分解一下XPath链接吗？〜'// div [@ class =“pagination”]/ul/li [position（）= last（）]'我不完全理解'[position（）= last（）]'。 – plotplot

@plotplot：它意味着'li [-1]'，但'xpath'的方式。 – Birei

有些事情你应该考虑：

调试： Scrapy有几种方式来帮助确定为什么你的蜘蛛是不是表现你想/希望的方式。在scrapy文档中查看Debugging Spiders;这可能是文档中最重要的一页。
你是在混淆蜘蛛： 再参照scrapy docs，你会发现下面的

警告

当写爬行蜘蛛的规则，避免使用parse作为回调，因为 CrawlSpider使用parse我自我实现其逻辑。因此，如果您覆盖parse方法，抓取蜘蛛将不再工作工作。

为非默认回调使用不同的名称。

检查蜘蛛的行为：
你可能会想修改您的项目加载代码;我怀疑你得到的名单不是你想要的。

来源

2015-06-20 18:19:20 tegancp

爬行蜘蛛不爬行规则问题

回答

相关问题