2015-04-05 97 views
1

我在第一次尝试Scrapy。在做了一点研究之后,我得到了一些基础知识。现在我正在尝试获取表格的数据。它不工作。检查下面的源代码。用Scrapy刮掉表格中的数据

items.py

from scrapy.item import Item, Field 

class Digi(Item): 

    sl = Field() 
    player_name = Field() 
    dismissal_info = Field() 
    bowler_name = Field() 
    runs_scored = Field() 
    balls_faced = Field() 
    minutes_played = Field() 
    fours = Field() 
    sixes = Field() 
    strike_rate = Field() 

digicric.py

from scrapy.spider import Spider 
from scrapy.selector import Selector 
from crawler01.items import Digi 

class DmozSpider(Spider): 
    name = "digicric" 
    allowed_domains = ["digicricket.marssil.com"] 
    start_urls = ["http://digicricket.marssil.com/match/MatchData.aspx?op=2&match=1250"] 

    def parse(self, response): 

     sel = Selector(response) 
     sites = sel.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr') 
     items = [] 

     for site in sites: 
      item = Digi() 
      item['sl'] = sel.xpath('td/text()').extract() 
      item['player_name'] = sel.xpath('td/a/text()').extract() 
      item['dismissal_info'] = sel.xpath('td/text()').extract() 
      item['bowler_name'] = sel.xpath('td/text()').extract() 
      item['runs_scored'] = sel.xpath('td/text()').extract() 
      item['balls_faced'] = sel.xpath('td/text()').extract() 
      item['minutes_played'] = sel.xpath('td/text()').extract() 
      item['fours'] = sel.xpath('td/text()').extract() 
      item['sixes'] = sel.xpath('td/text()').extract() 
      item['strike_rate'] = sel.xpath('td/text()').extract() 
      items.append(item) 
     return items 

回答

0

的关键问题是,你正在使用sel内循环。另一个关键问题是,您的XPath表达式指向td元素,而您需要按索引获取td元素,并将其与item字段相关联。

工作溶液:

def parse(self, response): 
    sites = response.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')[1:-2] 

    for site in sites: 
     item = Digi() 
     item['sl'] = site.xpath('td[1]/text()').extract() 
     item['player_name'] = site.xpath('td[2]/a/text()').extract() 
     item['dismissal_info'] = site.xpath('td[3]/text()').extract() 
     item['bowler_name'] = site.xpath('td[4]/text()').extract() 
     item['runs_scored'] = site.xpath('td[5]/b/text()').extract() 
     item['balls_faced'] = site.xpath('td[6]/text()').extract() 
     item['minutes_played'] = site.xpath('td[7]/text()').extract() 
     item['fours'] = site.xpath('td[8]/text()').extract() 
     item['sixes'] = site.xpath('td[9]/text()').extract() 
     item['strike_rate'] = site.xpath('td[10]/text()').extract() 
     yield item 

它正确地输出11个实例。

+0

它显示错误。这里是错误屏幕截图 [error screenshot](http://i.imgur.com/HPh5lia.png) 这里是代码: [链接](http://i.imgur.com/InxV60O .png) [链接](http://i.imgur.com/XtKyOkr.png) – 2015-04-06 06:17:34

+0

@TanzibHossainNirjhor奇怪,为我工作。您使用的是什么Scrapy版本? – alecxe 2015-04-06 09:26:51

+0

[Scrapy 0.24.5] [Python 2.7.9] [PIP 6.0.8] [Windows 8.1] – 2015-04-06 16:41:14

1

我只是Scrapy跑了你的代码,它完美地工作。什么不适合你?

P.S.这应该是一个评论,但我还没有足够的声誉呢......我会根据需要编辑/关闭答案。

编辑:

我想你应该在每一个循环,而不是return item年底做yield item。其余的代码应该没问题。

下面是来自Scrapy documentaion一个例子:

import scrapy 
from myproject.items import MyItem 

class MySpider(scrapy.Spider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = [ 
     'http://www.example.com/1.html', 
     'http://www.example.com/2.html', 
     'http://www.example.com/3.html', 
    ] 

    def parse(self, response): 
     for h3 in response.xpath('//h3').extract(): 
      yield MyItem(title=h3) 

     for url in response.xpath('//a/@href').extract(): 
      yield scrapy.Request(url, callback=self.parse) 
+0

问题是发生了循环,但没有数据被抓取到项目中。 – 2015-04-06 06:01:44