2014-09-28 135 views
0

我是Scrapy(& Python!)的新手,我试图取消Cricinfo网站的评论。 这里是一个网页的例子: http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary使用python&scrapy刮去网站

我感兴趣的是刮过数字(比如0.1)和它旁边的文字。

使用Firebug我可以看到“0.1”的xpath是: /html/body/div [2]/div [3]/div [4]/div [5]/div/div [3 ]/table/tbody/tr/td [2]/div/table/tbody/tr [2]/td [1]/p

和旁边的文字是: /html/body/2]/DIV [3]/DIV [4]/DIV [5]/DIV/DIV [3] /表/ tbody的/ TR/TD [2]/DIV /表/ tbody的/ TR [2]/TD [2 ]/p

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from crictest.items import CrictestItem 

class MySpider(BaseSpider): 
    name = "cricinfo" 
    allowed_domains = ["espncricinfo.com/"] 
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     rows = hxs.select('//html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr') 
     items =[] 
     for row in rows: 
      item = CrictestItem() 
      item['overnum'] = row.select('td[1]/p/text()').extract() 
      item['overnumtext'] = row.select('td[2]/p/text()').extract() 
      items.append(item) 
     return items 

我通过行试图环路(/ TR),然后返回TD [1]/p /文本,然后TD [2]/p /文本 我items.py是这样的:

import scrapy 


class CrictestItem(scrapy.Item): 
    overnum = scrapy.Field() 
    overnumtext = scrapy.Field() 

使用scrapy crawl cricinfo -o items.csv -t csv它只是给我一个items.csv文件,根本没有数据。

我哪里错了?任何帮助将不胜感激。

回答

1

你拥有的xpath是不正确的,而且非常脆弱。

据我所知,您需要粗体数字和旁边的文字。我会依靠与battingCommstd元素:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from crictest.items import CrictestItem 


class MySpider(BaseSpider): 
    name = "cricinfo" 
    allowed_domains = ["espncricinfo.com/"] 
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     rows = hxs.select('//td[@class="battingComms" and b]') 
     for row in rows: 
      item = CrictestItem() 
      item['overnum'] = row.select('b/text()').extract()[0] 
      item['overnumtext'] = row.select('b/following-sibling::text()').extract()[0] 
      yield item 

输出在控制台上:

{'overnum': u'0.4', 
'overnumtext': u" bingo! that's a good ol slog from van Wyk right across the line of a good length ball that nips back in. No bat involved, but loads of timber. Lovely bowling from Paris and he knows it "} 
{'overnum': u'1.3', 
'overnumtext': u' and dies by his reputation. Behrendorff is assisted by some swing away, Delport flings his bat at with all his might and only ends up with an outside edge that is pouched behind the wicket. Brilliant catch from Whiteman as he leaps to his left and stretches as high as he could '} 
... 
+0

这似乎是更喜欢它,但它并没有拿起每一个号码。它只显示11条记录?另外我应该如何发现关于battingComms类?谢谢 – Del 2014-09-30 17:28:41

+0

@Del,我怎么知道你想从页面上得到什么? – alecxe 2014-09-30 17:34:00

+0

如果我不清楚,我很抱歉。我想要一个带有2列的csv文件。一列说数字:0.1,0.3 ... 19.5,19.6。另一栏显示网页上该号码旁边的文字。 – Del 2014-09-30 18:57:29

0

你可以从下面的例子确切的结果。

使用python next兄弟来获得合适的结果。

的HTML代码是:

<div id="provider-region-addresses"> 
<h3>Contact details</h3> 
<h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>North Shore Hospital</dd><dt>Physical address</dt> 
       <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt> 
       <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt> 
       <dd>0740</dd><dt>District/town</dt> 

       <dd> 
       North Shore, Takapuna</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 486 8996</dd><dt>Fax</dt> 
       <dd>(09) 486 8342</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Helensville</dd><dt>Postal address</dt> 
       <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt> 
       <dd>0840</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Helensville</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 420 9450</dd><dt>Fax</dt> 
       <dd>(09) 420 7050</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Warkworth</dd><dt>Postal address</dt> 
       <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt> 
       <dd>0941</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Warkworth</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 422 2700</dd><dt>Fax</dt> 
       <dd>(09) 422 2709</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Waitakere Hospital</dd><dt>Physical address</dt> 
       <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt> 
       <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt> 
       <dd>0650</dd><dt>District/town</dt> 

       <dd> 
       Waitakere, Henderson</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 839 0000</dd><dt>Fax</dt> 
       <dd>(09) 837 6634</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt> 
       <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt> 
       <dd>0932</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Red Beach</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 427 0300</dd><dt>Fax</dt> 
       <dd>(09) 427 0391</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    </div> 

蜘蛛的代码是:

def parse(self, response): 
     hxs = HtmlXPathSelector(response) 

     practice = hxs.select('//h1/text()').extract() 
     items1 = [] 

     results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl') 
     for result in results: 
      item = WebhealthItem1() 
      #item['url'] = result.select('//dl/a/@href').extract() 
      item['practice'] = practice 
      item['hours'] = map(unicode.strip, 
       result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract()) 
      item['more_hours'] = map(unicode.strip, 
       result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract()) 
      item['physical_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract()) 
      item['postal_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract()) 
      item['postcode'] = map(unicode.strip, 
       result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract()) 
      item['district_town'] = map(unicode.strip, 
       result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract()) 
      item['region'] = map(unicode.strip, 
       result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract()) 
      item['phone'] = map(unicode.strip, 
       result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract()) 
      item['website'] = map(unicode.strip, 
       result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract()) 
      item['email'] = map(unicode.strip, 
       result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract()) 
      items1.append(item) 
     return items1