使Scrapy按照链接顺序

我写了一个脚本，并使用Scrapy在第一阶段查找链接，并在第二阶段中按照链接和页面提取内容。 Scrapy它，但它遵循一个无序的方式链接，即我期望的输出如下：使Scrapy按照链接顺序

link1 | data_extracted_from_link1_destination_page 
link2 | data_extracted_from_link2_destination_page 
link3 | data_extracted_from_link3_destination_page 
. 
. 
.

，但我得到

link1 | data_extracted_from_link2_destination_page 
link2 | data_extracted_from_link3_destination_page 
link3 | data_extracted_from_link1_destination_page 
. 
. 
.

这里是我的代码：

import scrapy 


class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 

    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 

     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first()) 

      yield {"LinkText": LinkText} 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents) 

    def parse_contents(self, response): 

     lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     yield {"LinkContent": sContent}

我的代码有什么问题？

来源

2017-05-28 Gmosy Gnaq

产量不同步，你应该使用meta来实现这一点。文件：https://doc.scrapy.org/en/latest/topics/request-response.html
代码：

import scrapy 
class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 
    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 
     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = 
       response.urljoin(link.xpath(LinkDestSelector).extract_first()) 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText}) 

    def parse_contents(self, response): 
     lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     linkText = response.meta['LinkText'] 
     yield {"LinkContent": sContent,"LinkText": linkText}

来源

2017-05-29 01:12:45

使Scrapy按照链接顺序

回答

相关问题