2016-01-01 50 views
0

我是scrapy的新手,我试图用一个简单的蜘蛛(建立在另一个蜘蛛的基础上)构建一个网站:http://scraping.pro/web-scraping-python-scrapy-blog-series/)。无法抓取简单的scrapy spider页面

为什么我的蜘蛛爬行0页(没有错误):

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from items import NewsItem 

class TutsPlus(CrawlSpider): 
    name = "tutsplus" 
    allowed_domains = ["net.tutsplus.com"] 
    start_urls = [ 
    "http://code.tutsplus.com/posts?page=" 
    ] 

    rules = [Rule(LinkExtractor(allow=['/posts?page=\d+']), 'parse_story')] 

    def parse_story(self, response): 
     story = NewsItem() 
     story['url'] = response.url 
     story['title']  = response.xpath("//li[@class='posts__post']/a/text()").extract()   
     return story 

非常类似蜘蛛运行良好:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from items import NewsItem 

class BbcSpider(CrawlSpider): 
    name = "bbcnews" 
    allowed_domains = ["bbc.co.uk"] 
    start_urls = [ 
    "http://www.bbc.co.uk/news/technology/", 
    ] 

    rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')] 

    def parse_story(self, response): 
     story = NewsItem() 
     story['url'] = response.url 
     story['headline'] = response.xpath("//title/text()").extract() 
     story['intro'] = response.css('story-body__introduction::text').extract() 
     return story 
+0

我想你的'allowed_domains'不允许开始页面 – furas

+0

@furas,不,这不是。我将allowed_domains更改为:allowed_domains = [“code.tutsplus.com”],仍为0页。 – Macro

回答

0

看起来你的正则表达式'/posts?page=\d+'是不是你真正想要的,因为这匹配url:'/postspage=2''/postpage=2'

我想你想要的东西像'/posts\?page=\d+',它逃脱?

+0

它几乎按预期工作(不幸“差不多”)。蜘蛛爬行只有4页:http://code.tutsplus.com/posts?page=2,http://code.tutsplus.com/posts?page=3,http://code.tutsplus.com/posts? page = 465,http://code.tutsplus.com/posts?page=466。任何想法为什么只有这些网页? – Macro

+0

,因为这些是唯一可用的匹配那个正则表达式?检查网站。 – eLRuLL

+0

所有可用从1到466(http://code.tutsplus.com/posts?page=1,http://code.tutsplus.com/posts?page=2等等) – Macro