我的HTML代码中包含了一些与大部分同类结构的div ...以下是包含2周这样的divScrapy条件爬行
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<a href="www.xxxxxx.com"></a>
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<a href="www.yyyyyy.com"></a>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
所以这里的代码片段是我想Scrapy做 .. 。
如果与类=“外容器”的div包含另一个DIV与标题=在第一格上面的“验证”一样,它应该去的URL上面(即w^ww.xxxxxx.com)并在该页面上获取其他一些场景。
如果存在包含标题无DIV =“验证”,如上面第二DIV,应该下DIV类=“么”取所有数据。即公司名称(Fat Dude,LLC),地址,城市,州等...并且不遵循网址(即www.yyyyy.com)
那么我如何在Scrapy爬行器中应用这个条件/逻辑。我在考虑使用BeautifulSoup的,但不知道....
有什么我试过到目前为止....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[@class="inner-container"]/a/@href').extract():
yield Request(url, callback=self.parse)
Ofcourse,上面的代码返回并抓取的所有网址的下DIV CLASS =“内部容器”因为没有在此代码爬行规定,监守我不知道在哪里/如何设置条件。
如果有人之前做类似的事情,请大家分享。由于