2017-01-17 69 views
0

我想从网站上点击链接scrapy收集文本。从scrapy“点击”链接收集文本?

请看下面的例子:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 


class DnsDbSpider(CrawlSpider): 
    name = 'dns_db' 
    allowed_domains = ['www.iana.org'] 
    start_urls = ['http://www.iana.org/'] 

    rules = (
     Rule(LinkExtractor(
      allow_domains='www.iana.org', 
      restrict_css=r'#home-panel-domains > h2'), 
      callback='parse_item', 
      follow=True), 
     Rule(LinkExtractor(
      allow_domains='www.iana.org', 
      restrict_css=r'#main_right > p:nth-child(3)'), 
      callback='parse_item', 
      follow=True), 
     Rule(LinkExtractor(
      allow_domains='www.iana.org', 
      restrict_css=r'#main_right > ul:nth-child(4) > li'), 
      callback='parse_item', 
      follow=True), 
    ) 


    def parse_item(self, response): 
     self.logger.info('## Parsing URL: %s', response.url) 
     i = {} 
     return i 

scrapy日志:

$ scrapy crawl dns_db 2>&1 | grep 'Parsing URL' 
2017-01-17 22:14:01 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains 
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root 
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root/db 

在这种情况下scrapy做了以下内容:

  1. 打开“www.iana.org
    path = []
  2. 点击“域名”的网址。
    path = ['Domain Names']
  3. 在 “域名” 页面点击 “DNS根区域” 网址。
    path = ['Domain Names', 'The DNS Root Zone']
  4. 在 “DNS根区域” 页面点击 “根区数据库” 网址。
    path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
  5. 在“根区数据库”页面我会开始报废数据,从而创建项目。最后一个项目也将有路径属性:
    path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']

一个人可以在网站只是看这条道路/列表导航。

我该如何做到这一点?

编辑

这里有一个工作示例:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.linkextractors import LinkExtractor 


class DnsDbSpider(scrapy.Spider): 
    name = "dns_db" 
    allowed_domains = ["www.iana.org"] 
    start_urls = ['http://www.iana.org/'] 

    def parse(self, response): 
     if 'req_path' not in response.meta: 
      response.meta['req_path'] = [] 
     self.logger.warn('## Request path: %s', response.meta['req_path']) 
     restrict_css = (
      r'#home-panel-domains > h2', 
      r'#main_right > p:nth-child(3)', 
      r'#main_right > ul:nth-child(4) > li' 
     ) 
     links = [link for css in restrict_css for link in self.links(response, css)] 
     for link in links: 
      #self.logger.info('## Link: %s', link) 
      request = scrapy.Request(
       url=link.url, 
       callback=self.parse) 
      request.meta['req_path'] = response.meta['req_path'].copy() 
      request.meta['req_path'].append(dict(text=link.text, url=link.url)) 
      yield request 

    def links(self, response, restrict_css=None): 
     lex = LinkExtractor(
      allow_domains=self.allowed_domains, 
      restrict_css=restrict_css) 
     return lex.extract_links(response) 

命令行输出:

$ scrapy crawl -L WARN dns_db 
2017-02-12 00:13:50 [dns_db] WARNING: ## Request path: [] 
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}] 
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}] 
2017-02-12 00:13:52 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}, {'text': 'Root Zone Database', 'url': 'http://www.iana.org/domains/root/db/'}] 
+0

为什么不直接从“根区域数据库”网址开始?即http://www.iana.org/domains/root/db – Granitosaurus

+0

我也需要从中间的URL上取消一些数据。请注意,在这个例子中,我只限于在哪里使用'n-child(3)','nnth-child(4)'搜索后续URL,否则它将爬满整个站点。 * iana.org *只是一个示例网站,我真正的目标是不同的。 –

回答

0

您可以随身携带在你的URL文本,并保持合并,直到你到达您想要的页面并合并到所有页面:

from scrapy import Spider, Request 

class MySpider(Spider): 
    name = 'iana' 
    start_urls = ['http://iana.org'] 
    link_extractors = [LinkExtract()] 

    def parse(self, response): 
     path = response.meta.get('path', []) # retrieve the path we have so far or set default 
     links = [l.extract_links(response) for l in self.link_extractors] 
     for l in links: 
      url = l.url 
      current_path = [l.text] 
      yield Request(url, self.parse, 
          meta={'path': path + current_path}) 
     # now when we reach the last page that we want, 
     # we return an item with all gathered path parts 
     last_page = True # some condition to determine that it's last page, e.g. no links found 
     if last_page: 
      item = dict() 
      item['path'] = ' > '.join(path) 
      # e.g. 'Domain Names > The DNS Root Zone > Root Zone Database' 
      return item 

这个蜘蛛会一直保存抓取url,每次保存url文本到meta['path']当满足一些条件时,它会返回一个包含到目前为止遇到的所有路径值的项目。

+0

这不是我要问的。我需要爬虫来创建这个路径,从而收集它遍历的URL的文本。最初这个列表是空的。 –

+0

@NarūnasK这条路是什么?项目中的字段?您可以简单地通过将其添加到请求元参数中,然后检索它来继承当前网址或其名称,与我的示例类似。 – Granitosaurus

+0

我刚刚编辑的问题,我希望现在更有意义。一般来说,是*路径*可以是'scrapy.Field()'。我创造了这个问题,因为我不明白一件物品的生命时间。如果在爬虫处于“域名”时创建了项目,那么当爬虫到达“根区域数据库”时该项目是否仍然存在?我怎样才能访问这个项目? –