Scrapy - 抓取多个物品

首先，这里是我的代码：

from scrapy.spider  import BaseSpider 
from scrapy.selector  import HtmlXPathSelector 
from usdirectory.items import UsdirectoryItem 
from scrapy.http import Request 


class MySpider(BaseSpider): 
    name   = "usdirectory" 
    allowed_domains = ["domain.com"] 
    start_urls = ["url_removed_sorry"] 

    def parse(self, response): 
     hxs  = HtmlXPathSelector(response) 
     titles  = hxs.select('//*[@id="holder_result2"]/a[1]/span/span[1]/text()').extract() 
     for title in titles: 
       item = UsdirectoryItem() 
       item["title"] = title 
       item 


     yield item

这工作...但它只是抓住了第一个项目。

我注意到在我试图抓取的项目中，每行的Xpath更改。例如，第一行是你看到上面的XPath：

//*[@id="holder_result2"]/a[1]/span/span[1]/text()

然后递增2，一路29.于是，第二个结果：

//*[@id="holder_result2"]/a[3]/span/span[1]/text()

最后结果：

//*[@id="holder_result2"]/a[29]/span/span[1]/text()

所以我的问题是如何让脚本抓住所有这些，我不在乎我是否需要复制和粘贴每个项目的代码。所有其他页面都完全一样。我只是不确定如何去做。

非常感谢。

编辑：

import scrapy 
from scrapy.item import Item, Field 

class UsdirectoryItem(scrapy.Item): 
    title = scrapy.Field()

来源

2016-02-13 dkeeper09

这可能是你的问题的格式问题上的代码，但一个问题是，“屈服项目”需要在里面了“冠军”循环。只有在“解析”结束时才有一个收益，您只会得到1个项目 –

让我知道这对你的作品。注意我们正在遍历[i]而不是[1]。结果存储在一个列表中（希望）。

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 

    for i in xrange(15): 
     titles = hxs.select('//*[@id="holder_result2"]/a[' + str(1+i*2) + ']/span/span[1]/text()').extract() 
     for title in titles: 
       item = UsdirectoryItem() 
       item["title"] = title 
       item #erroneous line? 
     items.append(item) 
     yield item

来源

2016-02-13 04:57:12 weezilla

我收到了一大堆错误，但是我会玩弄代码并查看是否可以使其工作。 – dkeeper09

请不要使用未经测试的代码提交答案，如果您不确定自己的代码的功能，则无用。 '对于xrange（15）中的我''_not_ return' 1,3,5 ...'和'i'不在XPath字符串内插补。 –

谢谢@ Mathias-Müller。没有复制我的部分代码。在我睡眠不足的状态下，我也不知何故预计'我'被插入。 dkeeper09：你有东西在工作吗？ – weezilla

鉴于该模式是完全按照你描述的，你可以使用XPath modulo operatormod上的a位置索引来获取所有目标a元素：

//*[@id="holder_result2"]/a[position() mod 2 = 1]/span/span[1]/text()

对于一个快速演示，请考虑以下输入XML：

<div> 
<a>1</a> 
<a>2</a> 
<a>3</a> 
<a>4</a> 
<a>5</a> 
</div>

鉴于这个XPath /div/a[position() mod 2 = 1]，以下内容会返回：

<a>1</a> 
<a>3</a> 
<a>5</a>

见xpathtester.com现场演示here

来源

2016-02-13 04:59:33 har07

好吧，当我把这个XPath，它只抓住最后一个记录，而不是之间的任何东西。想法？ – dkeeper09

@ dkeeper09问题几乎可以肯定的是你不显示你的输入文件。 –

确定检查原来的帖子，看看你是否在寻找。 – dkeeper09

Scrapy - 抓取多个物品

回答

相关问题