1
我不知道什么地方错了这种蜘蛛,但它不会抓取任何网页:Scrapy不是爬行网页
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from paper_crawler.items import PaperCrawlerItem
class PlosGeneticsSpider(CrawlSpider):
name = 'plosgenetics'
allowed_domains = ['plosgenetics.org']
start_urls = ['http://www.plosgenetics.org/article/browse/volume']
rules = [
Rule(SgmlLinkExtractor(allow=(),restrict_xpaths=('//ul[@id="journal_slides"]')), callback='parse_item', follow=True)
]
def parse_item(self, response):
self.log(response.url)
print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select('//div[@class="item cf"]')
items = []
for title in titles:
item = PaperCrawlerItem()
item['title'] = "".join(title.xpath('.//div[@class="header"]//h3//a[contains(@href,"article")]/text()').extract()).strip()
item['URL'] = title.xpath('.//div[@class="header"]//h3//a[contains(@href,"article")]/@href').extract()
item['authors'] = "".join(title.xpath('.//div[@class="header"]//div[@class="authors"]/text()').extract()).replace('\n', "")
items.append(item)
return(items)
语法是正确的,但它一直对如何说INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
任何想法我搞砸了?
这看起来不像问题(我也没有这个问题了)。 '规则'只需要可迭代。如果它是一个元素且没有逗号的元组,它将不会被迭代(就像在那个问题中一样),但是如果它是一个列表,它将是可迭代的。这很容易用'object .__ iter__'来检查:'[“test”] .__ iter__'将返回一个方法包装,但是'(“test”).__ iter__'不会 – ohblahitsme