2014-07-09 81 views
0

我写了一个Scrapy中的蜘蛛,它基本上做得很好,并且完全做它应该做的事情。但问题是,在日志中,当我执行的scrapy爬行抓取Scrapy的URL正则表达式

# -*- coding: utf-8 -*- 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from ecommerce.items import ArticleItem 


class WikiSpider(CrawlSpider): 
    name = 'wiki' 
    start_urls = (
    'http://www.wiki.tn/index.php', 
    ) 
    rules= [Rule(SgmlLinkExtractor(allow=[r'\w+\/\d{1,4}\/\d{1,4}\/\d{1,4}\X+']),follow=True,  callback='parse_Article_wiki'), 
] 

    def parse_Article_wiki(self, response): 
     hxs = HtmlXPathSelector(response) 
     item = ArticleItem() 

     print '*******************>> '+response.url 

但它不到风度工作时,我执行蜘蛛它表明我:

2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, 
     OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled item pipelines: 
2014-07-09 15:03:13+0100 [wiki] INFO: Spider opened 
2014-07-09 15:03:13+0100 [wiki] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0  items/min) 
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2014-07-09 15:03:13+0100 [wiki] DEBUG: Crawled (200) <GET http://www.wiki.tn/index.php> (referer:  None) 
2014-07-09 15:03:13+0100 [wiki] INFO: Closing spider (finished) 
2014-07-09 15:03:13+0100 [wiki] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 219, 
    'downloader/request_count': 1, 
    'downloader/request_method_count/GET': 1, 
    'downloader/response_bytes': 13062, 
    'downloader/response_count': 1, 
    'downloader/response_status_count/200': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 416073), 
    'log_count/DEBUG': 3, 
    'log_count/INFO': 7, 
    'response_received_count': 1, 
    'scheduler/dequeued': 1, 
    'scheduler/dequeued/memory': 1, 
     'scheduler/enqueued': 1, 
     'scheduler/enqueued/memory': 1, 
    'start_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 210430)} 
2014-07-09 15:03:13+0100 [wiki] INFO: Spider closed (finished) 
+0

什么的'\ X +'在你'allow'模式到底用意何在?我在https://docs.python.org/2/library/re.html中看不到它的支持。没有它,你应该很好 –

回答

0

我不知道什么你问题是。猜测,我认为你想抓取该网站,并没有这样做。

如果这是问题,它可以是您在规则定义中使用的正则表达式。你试图遵循什么样的链接?

在另一方面,我也建议你使用变量allowed_domains,你的情况应该是:

allowed_domains = ['wiki.tn']