scrapy/Python抓取但不抓取数据

-4

作为scrapy的新手，我无法弄清楚为什么这个spider不抓取网站上的数据来抓取数据。我已经通过stackoverflow搜索可能的答案，但我看到它没有得到充分解决。我试图从网站上刮掉一个小镇餐厅列表。我对网站的安全功能没有详细的了解。请问与XPath选择元素相关的问题是什么？蜘蛛运行良好，除非它不会刮擦任何东西。你能否建议为什么它不刮，以及如何解决问题。蜘蛛具有下面的代码：scrapy/Python抓取但不抓取数据

try: 
    from scrapy.spiders import Spider 
    from urllib.parse import urljoin 
    from scrapy.selector import Selector 
    from scrapy.http import Request 

except ImportError: 
    print ("\nERROR IMPORTING THE NESSASARY LIBRARIES\n") 

#scrapy.optional_features.remove('boto') 


class YelpSpider(Spider): 
    name = 'yelp_spider' 
    allowed_domains=["yelp.com"] 
    headers=['venuename','services','address','phone','location'] 

    def __init__(self): 
     self.start_urls = ['https://www.yelp.com/springfield-il-us'] 

    def start_requests(self): 
     requests = [] 
     for item in self.start_urls: 
      requests.append(Request(url=item, headers={'Referer':'http://www.google.com/'})) 
      return requests 

    def parse(self, response): 
     requests=[] 
     sel=Selector(response) 
     restaurants=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[1]/h1') 
     items=[] 
     for restaurant in restaurants: 
      item=YelpRestaurantItem() 
      item['venuename']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[1]/h1') 
      item['services']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[2]/div[2]/span[2]/a[1]') 
      item['address']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[1]/div/strong/address') 
      item['phone']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[3]/span[3]') 
      item['location']=sel.xpath('//*[@id="dropperText_Mast"]') 
      item['url']=response.url 
      items.append(item) 
      yield item

我items.py具有下面的代码：

import scrapy 

class YelpRestaurantItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    url=scrapy.Field() 
    venuename = scrapy.Field() 
    services = scrapy.Field() 
    address = scrapy.Field() 
    phone = scrapy.Field() 
    location=scrapy.Field()

来源

2017-04-09 Kaleab

我假设你有缩进问题，请更正问题中的代码。而且，您是否尝试过调试您的代码？也许在每一次“餐厅......”的迭代中印刷一些东西？ – eLRuLL

你在想什么？蜘蛛搜索一个id属性是“包装”的东西，但是当我打开起始url时，我没有发现任何匹配的东西。 – Casper

@Casper，我试图抓住名字，服务，地址，电话，地点。我也应该说，这是我第一次使用xpath和scrapy。我刚刚复制了Chrome/Developer Tools中突出显示的餐厅的xpath。然而，我想要在这个小镇上的餐馆业务清单，名称，服务，地址，电话和位置。 – Kaleab

您的进口没有工作那么好过来，但可能是配置问题在我身边。我想下面的刮刀做了你要搜索的内容：

import scrapy 

class YelpSpider(scrapy.Spider): 
    name = 'yelp_spider' 
    allowed_domains=["yelp.com"] 
    headers=['venuename','services','address','phone','location'] 

    def __init__(self): 
     self.start_urls = ['https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1'] 

    def start_requests(self): 
     requests = [] 
     for item in self.start_urls: 
      requests.append(scrapy.Request(url=item, headers={'Referer':'http://www.google.com/'})) 
      return requests 

    def parse(self, response): 
     for restaurant in response.xpath('//div[@class="biz-listing-large"]'): 
      item={} 
      item['venuename']=restaurant.xpath('.//h3[@class="search-result-title"]/span/a/span/text()').extract_first() 
      item['services']=u",".join(line.strip() for line in restaurant.xpath('.//span[@class="category-str-list"]/a/text()').extract()) 
      item['address']=restaurant.xpath('.//address/text()').extract_first() 
      item['phone']=restaurant.xpath('.//span[@class="biz-phone"]/text()').extract_first() 
      item['location']=response.xpath('.//input[@id="dropperText_Mast"]/@value').extract_first() 
      item['url']=response.url 
      yield item

一些解释：

我已经改变了起始URL。这个网址实际上提供了所有餐馆的概览，而其他网站没有（或者至少从我的位置查看时没有）。

我已经删除了管道，因为它没有在我的系统中定义，我不能在代码中使用不存在的管道进行测试。

解析函数是我做出的真正改变。你定义的xpaths不是很清楚。现在代码循环遍历每个列出的餐厅。

response.xpath('//div[@class="biz-listing-large"]')

此代码捕获所有的餐馆数据。我在for循环中使用了这个，所以我们可以为每个餐厅执行操作。该数据在变量restaurant中可用。

所以，如果我想从餐厅提取数据，我使用这个变量。另外，我们需要使用.来启动xpath，因为脚本将从网页的开头开始（这与使用响应相同）。

为了理解我的答案中的xpath，我可以向你解释这个，但是有很多可用的文档，他们可能比我更好地解释这个。

Some documentation

And some more

请注意，我用餐厅的item最值。位置和网址的值不是真正的餐厅数据，但位于网页的其他位置。这就是为什么这些值使用response而不是restaurant。

来源

2017-04-10 12:40:45 Casper

谢谢你的专门答复，它的工作，虽然它没有刮'手机'和'地址'，可能是语法错误？ – Kaleab

您是否对代码进行了任何更改？当我在这里运行爬虫时，它返回项目的所有定义的属性。 – Casper

按您的建议使用的代码，它填充除“电话”和“地址”以外的其他字段。 – Kaleab

scrapy/Python抓取但不抓取数据

回答

相关问题