2015-10-22 231 views
1

我正在尝试抓取一个论坛,最终在帖子中发布链接的帖子。现在我只是试图抓取帖子的用户名。但我认为,这些网址不是静态的。绝对与scrapy的相对路径

spider.py 

from scrapy.spiders import CrawlSpider 
from scrapy.selector import Selector 
from scrapy.item import Item, Field 


class TextPostItem(Item): 
    title = Field() 
    url = Field() 
    submitted = Field() 


class RedditCrawler(CrawlSpider): 
    name = 'post-spider' 
    allowed_domains = ['flashback.org'] 
    start_urls = ['https://www.flashback.org/t2637903'] 


    def parse(self, response): 
     s = Selector(response) 
     next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
     if len(next_link): 
      yield self.make_requests_from_url(next_link) 
     posts = Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]') 
     for post in posts: 
      i = TextPostItem() 
      i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0] 
      #i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0] 
      yield i 

为我提供了以下错误:

raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: /t2637903p2 

任何想法?

回答

1

你需要“加入” response.url与你使用urljoin()提取相对URL:

from urlparse import urljoin 

urljoin(response.url, next_link) 

另外请注意,没有必要实例化一个对象Selector - 您可以使用response.xpath()的快捷方式直接输入:

def parse(self, response): 
    next_link = response.xpath('//a[@class="smallfont2"]//@href').extract()[0] 
    # ... 
+0

您好,非常感谢您的回答。之前使用过“urljoin”我见过类似的解决方案。但我不明白如何在我的代码中使用它。那究竟在哪里呢? – Jomasdf

+0

@Jomasdf好的,当你提出请求时使用它:'yield self.make_requests_from_url(urljoin(response.url,next_link))'。 – alecxe

+0

啊,我明白了。非常感谢你! – Jomasdf