如何喂蜘蛛蜘蛛爬行内的链接？

我正在为网上商店写一个蜘蛛（CrawlSpider）。根据客户需求，我需要编写两个规则：一个用于确定哪些页面有项目，另一个用于提取项目。如何喂蜘蛛蜘蛛爬行内的链接？

我已经独立工作的这两个规则：

如果我start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]和我评论的Rule和代码parse_category ，我parse_item将提取每一个项目。
在另一方面，如果start_urls = "http://www.example.com"我发表意见Rule和parse_item代码，parse_category将返回的每一个环节，其中有一个项目提取，即 parse_category将返回www.example.com/books.php和 www.example.com/movies.php。

我的问题是，我不知道怎么两个模块合并，使start_urls = "http://www.example.com"然后parse_category提取www.example.com/books.php和www.example.com/movies.php和饲料这些链接到parse_item，在那里我居然提取每个项目的信息。

我需要找到一种方法来做到这一点，而不是仅仅使用start_urls = ["www.example.com/books.php", "www.example.com/movies.php"]，因为如果将来添加了新类别（例如www.example.com/music.php），蜘蛛将无法自动检测到新类别，应该手动编辑。没什么大不了的，但客户不想要这个。

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 
    #start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"] 

    rules = (
     Rule(LinkExtractor(), follow=True, callback='parse_category'), 
     Rule(LinkExtractor(), follow=False, callback="parse_item"), 
    ) 

def parse_category(self, response): 
    category = StoreCategory() 
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category: 
     category['name'] = name 
     category['url'] = response.url 
    return category 

def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

来源

2015-11-02 yzT

相反使用parse_category，我在LinkExtractor中使用restrict_css来获得我想要的链接，并且它似乎在提取第二个Rule与提取的链接，所以我的问题得到了回答。它结束了这种方式：

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 

    rules = (
     Rule(LinkExtractor(restrict_css=("#movies", "#books"))), 
     Rule(LinkExtractor(), callback="parse_item"), 
    ) 

def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

仍无法检测到新添加的类别（并没有使用在restrict_css没有获取其他垃圾花纹清晰），但至少它与的的先决条件符合客户端：2个规则，一个用于提取类别的链接，另一个用于提取项目的数据。

来源

2015-11-02 10:43:58 yzT

CrawlSpider规则不能像你想要的那样工作，你需要自己实现逻辑。当您指定follow=True你不能使用回叫，因为思想是保持获取链接（没有项目），而遵守规则，检查documentation

你可以用类似尝试：

class StoreSpider (CrawlSpider): 
    name = "storyder" 

    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/"] 
    # no rules 
def parse(self, response): # this is parse_category 
    category_le = LinkExtractor("something for categories") 
    for a in category_le.extract_links(response): 
     yield Request(a.url, callback=self.parse_category) 
    item_le = LinkExtractor("something for items") 
    for a in item_le.extract_links(response): 
     yield Request(a.url, callback=self.parse_item) 
def parse_category(self, response): 
    category = StoreCategory() 
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category: 
     category['name'] = name 
     category['url'] = response.url 
     yield category 
    for req in self.parse(response): 
     yield req 
def parse_item(self, response): 
    item = StoreItem() 
    # some code for extracting the item's data 
    return item

来源

2015-11-02 01:54:57 eLRuLL

'scrapy crawl storyder -o output.json -t json'的输出只是类别列表和其他一些链接，但根本没有任何项目。国际海事组织，它不进入'parse_item'因为检查日志，当它抓取一个项目的链接，它返回名称和URL，这是StoreCategory的字段。 – yzT

如何喂蜘蛛蜘蛛爬行内的链接？

回答

相关问题