2017-01-17 75 views
0

我的HTML代码中包含了一些与大部分同类结构的div ...以下是包含2周这样的divScrapy条件爬行

<!-- 1st Div start --> 

<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.xxxxxx.com"></a> 
<div class="abc xyz" title="verified"></div> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">1223 Industrial Blvd</span><br> 
        <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 

<!-- 2nd Div start --> 

<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.yyyyyy.com"></a> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">7890 Business St</span><br> 
        <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 

所以这里的代码片段是我想Scrapy做 .. 。

如果类=“外容器”的div包含另一个DIV与标题=在第一格上面的“验证”一样,它应该去的URL上面(即w^ww.xxxxxx.com)并在该页面上获取其他一些场景。

如果存在包含标题无DIV =“验证”,如上面第二DIV,应该下DIV类=“么”取所有数据。即公司名称(Fat Dude,LLC),地址,城市,州等...并且不遵循网址(即www.yyyyy.com)

那么我如何在Scrapy爬行器中应用这个条件/逻辑。我在考虑使用BeautifulSoup的,但不知道....

有什么我试过到目前为止....

class MySpider(CrawlSpider): 
    name = 'dknfetch' 
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing'] 
    allowed_domains = ['www.xxxxx.com'] 
    def parse(self, response): 
      hxs = Selector(response) 
      soup = BeautifulSoup(response.body, 'lxml') 
      nf = NewsFields() 
      cName = soup.find_all("a", class_="mheading primary h4") 
      addrs = soup.find_all("span", itemprop_="Address") 
      loclity = soup.find_all("span", itemprop_="Locality") 
      region = soup.find_all("span", itemprop_="Region") 
      post = soup.find_all("span", itemprop_="postalCode") 

      nf['companyName'] = cName[0]['content'] 
      nf['address'] = addrs[0]['content'] 
      nf['locality'] = loclity[0]['content'] 
      nf['state'] = region[0]['content'] 
      nf['zipcode'] = post[0]['content'] 
      yield nf 
      for url in hxs.xpath('//div[@class="inner-container"]/a/@href').extract(): 
      yield Request(url, callback=self.parse) 

Ofcourse,上面的代码返回并抓取的所有网址的下DIV CLASS =“内部容器”因为没有在此代码爬行规定,监守我不知道在哪里/如何设置条件。

如果有人之前做类似的事情,请大家分享。由于

回答

0

无需使用BeautifulSoup,Scrapy,用它自己的选择能力(也分别发布了作为parsel)。让我们用你的HTML做出了榜样:

html = u""" 
<!-- 1st Div start --> 
<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.xxxxxx.com"></a> 
<div class="abc xyz" title="verified"></div> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">1223 Industrial Blvd</span><br> 
        <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 
<!-- 2nd Div start --> 
<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.yyyyyy.com"></a> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">7890 Business St</span><br> 
        <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 
""" 

from parsel import Selector 
sel = Selector(text=html) 
for div in sel.css('.outer-container'): 
    if div.css('div[title="verified"]'): 
     url = div.css('a::attr(href)').extract_first() 
     print 'verified, follow this URL:', url 
    else: 
     nf = {} 
     nf['companyName'] = div.xpath('string(.//h2)').extract_first() 
     nf['address'] = div.css('span[itemprop="Address"]::text').extract_first() 
     nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first() 
     nf['state'] = div.css('span[itemprop="Region"]::text').extract_first() 
     nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first() 
     print 'not verified, extracted item is:', nf 

前一个片断的结果是:

verified, follow this URL: www.xxxxxx.com 
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'} 

但Scrapy你甚至都不需要实例化Selector类,有捷径传递给回调函数的response对象中可用。此外,你不应该继承CrawlSpider,只是普通Spider类是不够的。全部放在一起:

from scrapy import Spider, Request 
from myproject.items import NewsFields 

class MySpider(Spider): 
    name = 'dknfetch' 
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing'] 
    allowed_domains = ['www.xxxxx.com'] 

    def parse(self, response): 
     for div in response.selector.css('.outer-container'): 
      if div.css('div[title="verified"]'): 
       url = div.css('a::attr(href)').extract_first() 
       yield Request(url) 
      else: 
       nf = NewsFields() 
       nf['companyName'] = div.xpath('string(.//h2)').extract_first() 
       nf['address'] = div.css('span[itemprop="Address"]::text').extract_first() 
       nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first() 
       nf['state'] = div.css('span[itemprop="Region"]::text').extract_first() 
       nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first() 
       yield nf 

我建议你获得familar与Parsel的API:https://parsel.readthedocs.io/en/latest/usage.html

刮快乐!