1
我正在尝试抓取一个论坛,最终在帖子中发布链接的帖子。现在我只是试图抓取帖子的用户名。但我认为,这些网址不是静态的。绝对与scrapy的相对路径
spider.py
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
class TextPostItem(Item):
title = Field()
url = Field()
submitted = Field()
class RedditCrawler(CrawlSpider):
name = 'post-spider'
allowed_domains = ['flashback.org']
start_urls = ['https://www.flashback.org/t2637903']
def parse(self, response):
s = Selector(response)
next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0]
if len(next_link):
yield self.make_requests_from_url(next_link)
posts = Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]')
for post in posts:
i = TextPostItem()
i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0]
#i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0]
yield i
为我提供了以下错误:
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /t2637903p2
任何想法?
您好,非常感谢您的回答。之前使用过“urljoin”我见过类似的解决方案。但我不明白如何在我的代码中使用它。那究竟在哪里呢? – Jomasdf
@Jomasdf好的,当你提出请求时使用它:'yield self.make_requests_from_url(urljoin(response.url,next_link))'。 – alecxe
啊,我明白了。非常感谢你! – Jomasdf