Scrapy和谷歌网页抓取

我想利用scrapy收集谷歌搜索结果并将它们放到MongoDB中。但是，我没有得到任何答复......我错过了什么？Scrapy和谷歌网页抓取

看起来很简单。

# -*- coding: utf-8 -*- 
import scrapy 


class GoogleSpider(scrapy.Spider): 
    name = "google" 
    allowed_domains = ["google.com"] 
    start_urls = (
     'https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)', 
    ) 

    def parse(self, response): 
     for sel in response.xpath('//*[@id="rso"]/div/div[1]/div/h3'): 
      title = sel.xpath('a/text()').extract() 
      link = sel.xpath('a/@href').extract() 
      desc = sel.xpath('text()').extract() 
      print title, link, desc 
     pass

来源

2015-10-05 Michael Bloom

您错过了响应没有使用XPath请求的元素。

这是因为您在使用Scrapy时以及使用浏览器时看到了另一个网站。这是因为当您拨打start_url时，它会加载Google，然后发送XHR请求来查询搜索。

Scrapy不发送这个XHR调用，因为这些事情是由Scrapy不执行的JavaScript启动的。

要查看调用此URL时scrapy得到什么，看看你是否发现你的期望使用Scrapy壳牌：

scrapy shell "https://www.google.com/#q=site:www.linkedin.com%2Fpub+intext:(security+or+jsp)+and+(power+or+utility)"

然后出现命令提示符时，你可以看到为什么你没有得到结果：

>>> response.xpath('//*[@id="rso"]/div/div[1]/div/h3') 
[] 
>>>

因此，Scrapy找不到您的XPath，因为缺少内容。

来源

2015-10-05 11:00:59 GHajba

Scrapy和谷歌网页抓取

回答

相关问题