我一直在试图通过构建刮板和最近从bs4切换到scrapy来磨练我的python技能,以便我可以使用它的多线程和下载延迟功能。我已经能够创建一个基本的刮板并将数据输出到csv,但是当我尝试添加递归功能时遇到问题。我试着按照Scrapy Recursive download of Content的建议,但不断收到以下错误:用Scrapy递归刮Craigslist
DEBUG:重试http://medford.craigslist.org%20%5Bu'/cto/4359874426.html'%5D> DNS查找失败:地址找不到
这让我想到我试图加入链接的方式不起作用,因为它将字符插入到网址中,但我无法弄清楚如何修复它。有什么建议?
下面的代码:
#-------------------------------------------------------------------------------
# Name: module1
# Purpose:
#
# Author: CD
#
# Created: 02/03/2014
# Copyright: (c) CD 2014
# Licence: <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
class PageSpider(BaseSpider):
name = "cto"
start_urls = ["http://medford.craigslist.org/cto/"]
rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html",), restrict_xpaths=('//p[@class="nextpage"]' ,))
, callback="parse", follow=True),)
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
item = CraigslistSampleItem()
item['title'] = titles.select("a/text()").extract()
item['link'] = titles.select("a/@href").extract()
url = "http://medford.craiglist.org %s" % item['link']
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
'url =“http://medford.craiglist.org%s”%item ['link']'不可能是正确的,但我不知道是什么你真的想要。您可能想了解标准库[urlparse](http://docs.python.org/2/library/urlparse.html)模块。 – zwol
我从我发布的链接中拉出该行。我试图做的是让它找到它找到的链接并将其添加到Craigslist url中以获取每个页面的地址。 – ISuckAtLife
在'url ='行之前立即在文件顶部插入'import sys'和'sys.stderr.write(repr(item [link])+“\ n”)''。它打印什么? – zwol