我想从网站上点击链接scrapy
收集文本。从scrapy“点击”链接收集文本?
请看下面的例子:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DnsDbSpider(CrawlSpider):
name = 'dns_db'
allowed_domains = ['www.iana.org']
start_urls = ['http://www.iana.org/']
rules = (
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#home-panel-domains > h2'),
callback='parse_item',
follow=True),
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#main_right > p:nth-child(3)'),
callback='parse_item',
follow=True),
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#main_right > ul:nth-child(4) > li'),
callback='parse_item',
follow=True),
)
def parse_item(self, response):
self.logger.info('## Parsing URL: %s', response.url)
i = {}
return i
scrapy
日志:
$ scrapy crawl dns_db 2>&1 | grep 'Parsing URL'
2017-01-17 22:14:01 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root/db
在这种情况下scrapy
做了以下内容:
- 打开“www.iana.org “
path = []
- 点击“域名”的网址。
path = ['Domain Names']
- 在 “域名” 页面点击 “DNS根区域” 网址。
path = ['Domain Names', 'The DNS Root Zone']
- 在 “DNS根区域” 页面点击 “根区数据库” 网址。
path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
- 在“根区数据库”页面我会开始报废数据,从而创建项目。最后一个项目也将有路径属性:
path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
一个人可以在网站只是看这条道路/列表导航。
我该如何做到这一点?
编辑
这里有一个工作示例:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
class DnsDbSpider(scrapy.Spider):
name = "dns_db"
allowed_domains = ["www.iana.org"]
start_urls = ['http://www.iana.org/']
def parse(self, response):
if 'req_path' not in response.meta:
response.meta['req_path'] = []
self.logger.warn('## Request path: %s', response.meta['req_path'])
restrict_css = (
r'#home-panel-domains > h2',
r'#main_right > p:nth-child(3)',
r'#main_right > ul:nth-child(4) > li'
)
links = [link for css in restrict_css for link in self.links(response, css)]
for link in links:
#self.logger.info('## Link: %s', link)
request = scrapy.Request(
url=link.url,
callback=self.parse)
request.meta['req_path'] = response.meta['req_path'].copy()
request.meta['req_path'].append(dict(text=link.text, url=link.url))
yield request
def links(self, response, restrict_css=None):
lex = LinkExtractor(
allow_domains=self.allowed_domains,
restrict_css=restrict_css)
return lex.extract_links(response)
命令行输出:
$ scrapy crawl -L WARN dns_db
2017-02-12 00:13:50 [dns_db] WARNING: ## Request path: []
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}]
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}]
2017-02-12 00:13:52 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}, {'text': 'Root Zone Database', 'url': 'http://www.iana.org/domains/root/db/'}]
为什么不直接从“根区域数据库”网址开始?即http://www.iana.org/domains/root/db – Granitosaurus
我也需要从中间的URL上取消一些数据。请注意,在这个例子中,我只限于在哪里使用'n-child(3)','nnth-child(4)'搜索后续URL,否则它将爬满整个站点。 * iana.org *只是一个示例网站,我真正的目标是不同的。 –