绝对与scrapy的相对路径

我正在尝试抓取一个论坛，最终在帖子中发布链接的帖子。现在我只是试图抓取帖子的用户名。但我认为，这些网址不是静态的。绝对与scrapy的相对路径

spider.py 

from scrapy.spiders import CrawlSpider 
from scrapy.selector import Selector 
from scrapy.item import Item, Field 


class TextPostItem(Item): 
    title = Field() 
    url = Field() 
    submitted = Field() 


class RedditCrawler(CrawlSpider): 
    name = 'post-spider' 
    allowed_domains = ['flashback.org'] 
    start_urls = ['https://www.flashback.org/t2637903'] 


    def parse(self, response): 
     s = Selector(response) 
     next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
     if len(next_link): 
      yield self.make_requests_from_url(next_link) 
     posts = Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]') 
     for post in posts: 
      i = TextPostItem() 
      i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0] 
      #i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0] 
      yield i

为我提供了以下错误：

raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: /t2637903p2

任何想法？

来源

2015-10-22 Jomasdf

你需要“加入” response.url与你使用urljoin()提取相对URL：

from urlparse import urljoin 

urljoin(response.url, next_link)

另外请注意，没有必要实例化一个对象Selector - 您可以使用response.xpath()的快捷方式直接输入：

def parse(self, response): 
    next_link = response.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
    # ...

来源

2015-10-23 00:31:32 alecxe

您好，非常感谢您的回答。之前使用过“urljoin”我见过类似的解决方案。但我不明白如何在我的代码中使用它。那究竟在哪里呢？ – Jomasdf

@Jomasdf好的，当你提出请求时使用它：'yield self.make_requests_from_url（urljoin（response.url，next_link））'。 – alecxe

啊，我明白了。非常感谢你！ – Jomasdf

绝对与scrapy的相对路径

回答

相关问题