Scrapy Spider返回最后一个元素时，给出一个选择器列表

我已经遇到了一个问题，我已经把一个蜘蛛放在一起。我试图从this site的抄本中找出各行文字以及相应的时间戳，并找到了我认为合适的选择器，但运行时，蜘蛛的输出只是最后一行和时间戳。我见过一些其他类似问题的人，但还没有找到解决我的问题的答案。Scrapy Spider返回最后一个元素时，给出一个选择器列表

这里是蜘蛛：

# -*- coding: utf-8 -*- 
import scrapy 
from this_american_life.items import TalTranscriptItem 

class CrawlSpider(scrapy.Spider): 
    name = "transcript2" 
    allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"] 
    start_urls = (
     'https://www.thisamericanlife.org/radio-archives/episode/1/transcript', 
    ) 

    def parse(self, response): 
     item = TalTranscriptItem() 
     for line in response.xpath('//p'): 
      item['begin_timestamp'] = line.xpath('//@begin').extract() 
      item['line_text'] = line.xpath('//text()').extract() 
     yield item

这里是在items.py为TalTranscriptItem()代码：

# -*- coding: utf-8 -*- 

# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/en/latest/topics/items.html 

import scrapy 


class TalTranscriptItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    episode_id = scrapy.Field() 
    episode_num_text = scrapy.Field() 
    year = scrapy.Field() 
    radio_date_text = scrapy.Field() 
    radio_date_datetime = scrapy.Field() 
    episode_title = scrapy.Field() 
    episode_hosts = scrapy.Field() 
    act_id = scrapy.Field() 
    line_id = scrapy.Field() 
    begin_timestamp = scrapy.Field() 
    speaker_class = scrapy.Field() 
    speaker_name = scrapy.Field() 
    line_text = scrapy.Field() 
    full_audio_link = scrapy.Field() 
    transcript_url = scrapy.Field()

当scrapy shell运行，它似乎正常工作（绘制所有线路的文字），但由于某种原因，我还没有能够得到它在蜘蛛的工作。

我很高兴澄清任何这些问题，并将不胜感激任何人都可以提供的帮助！

来源

2017-10-19 Chris Jewell

'TalTranscriptItem'是什么类型？ – Hackerman

@Hackerman我会将TalTranscriptItem的代码添加到问题中。它是scrapy项目目录中items.py文件的一个类。 –

如果我没有记错，'scrapy.Field（）'是一个普通的旧python字典，而不是一个列表 – Hackerman

我不知道是什么项目，但你可以这样做：

item = [] 

for line in response.xpath('//p'): 
    dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()} 
    item.append(dictItem) 

print(item)

来源

2017-10-19 20:38:21 Wandrille

谢谢，这在scrapy外壳中工作，但由于某些原因，它仍然只是在蜘蛛中运行时拉出最后一个元素。 –

如果你想每一个人行得到，因为我觉得一个项目，这是你想要的（注意为yield行的最后一个缩进）：

for line in response.css('p'): 
    item = TalTranscriptItem() 
    item['begin_timestamp'] = line.xpath('./@begin').extract_first() 
    item['line_text'] = line.xpath('./text()').extract_first() 
    yield item

来源

2017-10-23 07:50:42 Wilfredo

谢谢！这似乎是有道理的，但由于某种原因，它仍然只返回最后一项，即使在scrapy shell中也是如此。任何想法为什么这可能是？再次感谢 –

你能告诉我你如何测试它，它可以在我的外壳中正常工作 – Wilfredo

Scrapy Spider返回最后一个元素时，给出一个选择器列表

回答

相关问题