卡住刮特定表scrapy

所以我想凑表可以在这里找到：http://www.betdistrict.com/tipsters 卡住刮特定表scrapy

名为“六月统计信息”表后我。

这里是我的蜘蛛：

from __future__ import division 
from decimal import * 

import scrapy 
import urlparse 

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider): 
name = "betdistrict" 
allowed_domains = ["betdistrict.com"] 
start_urls = ["http://www.betdistrict.com/tipsters"] 

def parse(self, response): 
    for sel in response.xpath('//table[1]/tr'): 
     item = TtscrapeItem() 
     name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] 
     url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0] 
     tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' 
     item['Tipster'] = tipster 
     won = sel.xpath('td[2]/text()').extract()[0] 
     lost = sel.xpath('td[3]/text()').extract()[0] 
     void = sel.xpath('td[4]/text()').extract()[0] 
     tips = int(won) + int(void) + int(lost) 
     item['Tips'] = tips 
     strike = Decimal(int(won)/tips) * 100 
     strike = str(round(strike,2)) 
     item['Strike'] = [strike + "%"] 
     profit = sel.xpath('//td[5]/text()').extract()[0] 
     if profit[0] in ['+']: 
      profit = profit[1:] 
     item['Profit'] = profit 
     yield_str = sel.xpath('//td[6]/text()').extract()[0] 
     yield_str = yield_str.replace(' ','') 
     if yield_str[0] in ['+']: 
      yield_str = yield_str[1:] 
     item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>' 
     item['Site'] = 'Bet District' 
     yield item

这给了我一个列表索引超出范围的错误的第一个变量（名称）。

然而，当我重写我的XPath选择开始//，e.g：

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

蜘蛛运行，但一遍又一遍刮掉第一线人。

我认为这与表没有一个thead，但在tbody的第一个tr中包含th标签有关。

任何帮助，非常感谢。

---------- ----------编辑

针对拉尔斯建议：

我试图用你提出什么但仍得到超出范围的错误列表：

from __future__ import division 
from decimal import * 

import scrapy 
import urlparse 

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider): 
    name = "betdistrict" 
    allowed_domains = ["betdistrict.com"] 
    start_urls = ["http://www.betdistrict.com/tipsters"] 

def parse(self, response): 
    for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'): 
     item = TtscrapeItem() 
     name = sel.xpath('a/text()').extract()[0] 
     url = sel.xpath('a/@href').extract()[0] 
     tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' 
     item['Tipster'] = tipster 
     yield item

另外，我做的事情这样假设，多为循环需要，因为不是所有的细胞具有相同的类？

我也尝试做的事情，而没有for循环，但在这种情况下，它再次刮只有第一个线人多次：当您们的说法

感谢

来源

2015-06-10 preach

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]

XPath表达式以td开头，所以相对于变量sel中的上下文节点（即tr元素中的tr元素表示for循环迭代）。

但是，当你说

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

XPath表达式与//td开始，即选择文档中的任何地方都td元素;这与sel不相关，所以在for循环的每次迭代中结果都是相同的。这就是为什么它一遍又一遍地刮伤了第一位技巧。

为什么第一个XPath表达式失败，并且列表索引超出范围错误？尝试一次将XPath表达式一步一步地打印出来，然后很快就会发现问题。在这种情况下，这似乎是因为table[1]的第一个tr孩子没有td孩子（只有th孩子）。因此，xpath()什么也没有选择，extract()返回一个空列表，并且您尝试引用该空列表中的第一个项目，给出列表索引超出范围错误。

for sel in response.xpath('//table[1]/tr[td]'):

你可以让发烧友，需要正确类的td：

为了解决这个问题，你可以为循环XPath表达式只在有td了孩子们tr元素改变你的循环

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):

来源

2015-06-10 16:25:38 LarsH

感谢您的回复拉尔斯。自从试图实现这一点以来，我已经添加了一个编辑，但仍然没有运气！ – preach

@preach，尽管我们已经改变了for循环语句的XPath表达式，但sel仍然保存着tr元素而不是td元素。这是因为XPath谓词（方括号内的内容）不表示进一步的位置步骤;他们只是筛选你已经选择的'tr's。因此，您需要将'name'的XPath更改为'td [@ class =“tipst”]/a/text（）'，而不仅仅是'a/text（）'。 – LarsH

卡住刮特定表scrapy

回答

相关问题