2017-10-04 133 views
0

我是新来的Scrapy,我真的只是失去了如何在一个块中返回多个项目。Scrapy返回多个项目

基本上,我得到一个HTML标记,其中有一个引号,其中包含文本,作者姓名和有关该引用的一些标记的嵌套标记。

这里的代码只返回一个报价,就是这样。它不使用循环来返回其余的。我一直在网上搜索几个小时,我只是绝望,我不明白。这里是我到目前为止的代码:

Spider.py

import scrapy 
from scrapy.loader import ItemLoader 
from first_spider.items import FirstSpiderItem 

class QuotesSpider(scrapy.Spider): 
name = 'quotes' 
allowed_domains = ['quotes.toscrape.com'] 
start_urls = ['http://quotes.toscrape.com/'] 

def parse(self, response): 
    l = ItemLoader(item = FirstSpiderItem(), response=response) 

    quotes = response.xpath("//*[@class='quote']") 

    for quote in quotes: 
     text = quote.xpath(".//span[@class='text']/text()").extract_first() 
     author = quote.xpath(".//small[@class='author']/text()").extract_first() 
     tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first() 

     # removes quotation marks from the text 
     for c in ['“', '”']: 
      if c in text: 
       text = text.replace(c, "") 

     l.add_value('text', text) 
     l.add_value('author', author) 
     l.add_value('tags', tags) 
     return l.load_item() 

    next_page_path = 
    response.xpath(".//li[@class='next']/a/@href").extract_first() 

    next_page_url = response.urljoin(next_page_path) 
    yield scrapy.Request(next_page_url) 

Items.py

import scrapy 

class FirstSpiderItem(scrapy.Item): 

text = scrapy.Field() 
author = scrapy.Field() 
tags = scrapy.Field() 

这里是我试图刮掉页:

Link

回答

0

试试这个。它会给你所有的数据,你想刮。

import scrapy 

class QuotesSpider(scrapy.Spider): 

    name = 'quotes' 
    start_urls = ['http://quotes.toscrape.com/'] 

    def parse(self, response): 
     for quote in response.xpath("//*[@class='quote']"): 
      text = quote.xpath(".//span[@class='text']/text()").extract_first() 
      author = quote.xpath(".//small[@class='author']/text()").extract_first() 
      tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first() 
      yield {"Text":text,"Author":author,"Tags":tags} 

     next_page = response.xpath(".//li[@class='next']/a/@href").extract_first() 
     if next_page: 
      next_page_url = response.urljoin(next_page) 
      yield scrapy.Request(next_page_url) 
+0

我已经创建了这种形式的蜘蛛。我试图用项目来创建它,而不是让步。尽管如此,谢谢你的回应! –

0

我还在为同样的问题寻找解决方案。这里是我已经找到了解决办法:

def parse(self, response): 
    for selector in response.xpath("//*[@class='quote']"): 
     l = ItemLoader(item=FirstSpiderItem(), selector=selector) 
     l.add_xpath('text', './/span[@class="text"]/text()') 
     l.add_xpath('author', '//small[@class="author"]/text()') 
     l.add_xpath('tags', './/meta[@class="keywords"]/@content') 
     yield l.load_item() 

    next_page = response.xpath(".//li[@class='next']/a/@href").extract_first() 
    if next_page is not None: 
     yield response.follow(next_page, callback=self.parse) 

从文本中去掉引号,您可以使用输出处理器items.py

from scrapy.loader.processors import MapCompose 

def replace_quotes(text): 
    for c in ['“', '”']: 
     if c in text: 
      text = text.replace(c, "") 
    return text 

class FirstSpiderItem(scrapy.Item): 
    text = scrapy.Field() 
    author = scrapy.Field() 
    tags = scrapy.Field(output_processor=MapCompose(replace_quotes)) 

请让我知道它是否是有帮助的。