2012-04-19 137 views
1

我有一个问题,我的CrawlSpider没有抓取整个网站。我试图抓取一个新闻网站;它收集了大约5900个物品,然后以“完成”的理由退出,但是在刮掉的物品中存在大量日期间隔。我没有使用任何自定义中间件或设置。谢谢你的帮助!Scrapy不刮整个网站

我的蜘蛛(原谅底部的凌乱清单代码),之后的日志文件的最后几行:

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from news.items import NewsItem 
import re 

class CrawlSpider(CrawlSpider): 
name = 'crawl' 
allowed_domains = ['domain.com'] 
start_urls = ['http://www.domain.com/portal//'] 
rules = (
    Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'), callback='parse_item', follow=True), 
) 

def parse_item(self, response): 
    p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)") 
    hxs = HtmlXPathSelector(response) 
    i = NewsItem() 
    i['headline'] = hxs.select('//p[@class = "detailedArticleTitle"]/text()').extract()[0].strip().encode("utf-8") 
    i['date'] = hxs.select('//div[@id = "DateTime"]/text()').re('\d+/\d+/[12][09]\d\d')[0].encode("utf-8") 
    text = [graf.strip().encode("utf-8") for graf in hxs.select('//div[@id = "article"]//div[@style = "LINE-HEIGHT: 100%"]|//div[@id = "article"]//p//text()').extract()] 
    text2 = ' '.join(text) 
    text3 = re.sub("'", ' ', p.sub(' ', text2)) 
    i['text'] = re.sub('"', ' ', text3) 
    return i 

日志输出:

2012-04-19 11:13:57-0700 [crawl] INFO: Closing spider (finished) 
2012-04-19 11:13:57-0700 [crawl] INFO: Stored csv feed (5949 items) in: news.csv 
2012-04-19 11:13:57-0700 [crawl] INFO: Dumping spider stats: 
{'downloader/exception_count': 2, 
'downloader/exception_type_count/twisted.internet.error.ConnectionLost': 2, 
'downloader/request_bytes': 5778930, 
'downloader/request_count': 12380, 
'downloader/request_method_count/GET': 12380, 
'downloader/response_bytes': 635795595, 
'downloader/response_count': 12378, 
'downloader/response_status_count/200': 6081, 
'downloader/response_status_count/302': 6062, 
'downloader/response_status_count/400': 234, 
'downloader/response_status_count/404': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2012, 4, 19, 18, 13, 57, 343594), 
'item_scraped_count': 5949, 
'request_depth_max': 23, 
'scheduler/disk_enqueued': 12380, 
'spider_exceptions/IndexError': 131, 
'start_time': datetime.datetime(2012, 4, 19, 17, 16, 40, 75935)} 
2012-04-19 11:13:57-0700 [crawl] INFO: Spider closed (finished) 
2012-04-19 11:13:57-0700 [scrapy] INFO: Dumping global stats: 
{} 

回答

1

方法parse_item()应该返回加载项。见scrapy docs. 事情是这样的:

class MySpider(CrawlSpider): 
    name = 'crawl' 
    allowed_domains = ['domain.com'] 
    start_urls = ['http://www.domain.com/portal/'] 
    rules = (Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'), 
      callback='parse_item', follow=True)) 

    def parse_item(self, response): 
     p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)") 
     hxs = HtmlXPathSelector(response) 
     i = NewsItem(selector=hxs) 
     i.add_xpath('headline', '//p[@class = "detailedArticleTitle"]/text()') 
     i.add_xpath('date', '//div[@id = "DateTime"]/text()', 
        re=('\d+/\d+/[12][09]\d\d')) 
     # Do something... 
     return i.load_item() 

后处理(如strip()encode("utf-8"))可在“管道”进行。

更新:有在你的代码的几个不准确之处:

  • 您的自定义蜘蛛类名必须是从继承类(CrawlSpider)不同,改变它的名称(例如,MySpider
  • start_urls是不正确的:'http://www.domain.com/portal//'有2个斜杠
  • 好的样式是将您的选择器的参数设置为NewsItem对象定义(i = NewsItem(selector=hxs)