0
我需要在广泛抓取过程中抓取前10-20个内部链接,因此我不会影响Web服务器,但是存在太多“allowed_domains”的域。我在这里问,因为Scrapy文档不包括这一点,我无法通过Google找到答案。Scrapy广泛抓取 - 在广泛抓取时只允许内部链接,allowed_domains允许使用太多域名
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class DomainLinks(Item):
links = Field()
class ScapyProject(CrawlSpider):
name = 'scapyproject'
#allowed_domains = []
start_urls = ['big domains list loaded from database']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)
def parse_start_url(self, response):
self.parse_links(response)
def parse_links(self, response):
item = DomainLinks()
item['links'] = []
domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]
links = LxmlLinkExtractor(allow=(),deny =()).extract_links(response)
links = [link for link in links if domain in link.url]
# Filter duplicates and append to
for link in links:
if link.url not in item['links']:
item['links'].append(link.url)
return item
是下列理解过滤链接,而无需使用allowed_domains列表和LxmlLinkExtractor允许过滤器的最好方式,因为这些似乎都使用正则表达式,这将影响性能并限制允许域列表的大小,如果每个废弃的链接是否与列表中的每个域进行正则表达式匹配?
links = [link for link in links if domain in link.url]
我正在努力解决的另一个问题是,我如何让蜘蛛只跟随内部链接而不使用allowed_domains列表?自定义中间件?
感谢