为什么我在scrapy中为我的物品获得空输出？

我是Python和scrapy的新手。我要抓取一些链接来获取我想要的数据，但是当我生成我的输出时，我所需的项目是空的。为什么我在scrapy中为我的物品获得空输出？

我items.py代码如下：

class CinemaItem(Item): 
    url = Field() 
    name = Field() 
    pass

我cinema_spider.py如下：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.selector import HtmlXPathSelector 
from cinema.items import CinemaItem 

class CinemaSpider(CrawlSpider): 
    name = "cinema" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     "http://www.example.com/?user=artists" 
    ] 
    rules = [Rule(SgmlLinkExtractor(allow=['/\?user=profile&detailid=\d+']),'parse_cinema')] 

    def parse_cinema(self, response): 
     hxs = HtmlXPathSelector(response) 
     cinema = CinemaItem() 
     cinema['url'] = response.url 
     cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract() 
     return cinema

当我运行下面的命令：

scrapy crawl cinema -o scraped_data.json -t json

的输出文件有这样的内容：

[{"url": "http://www.example.com/?detailid=218&user=profile", "name": []}, 
{"url": "http://www.example.com/?detailid=322&user=profile", "name": []}, 
{"url": "http://www.example.com/?detailid=219&user=profile", "name": []}, 
{"url": "http://www.example.com/?detailid=221&user=profile", "name": []}]

正如你所看到的，名称项是空的，尽管实际上它们有值，当我在scrapy shell中获取它们时我可以得到它们。但是，因为它们的值是波斯语并可能在Unicode格式，在外壳的输出是：

[u'\u0631\u06cc\u062d\u0627\u0646\u0647 \u0628\u0627\u0642\u0631\u06cc \u0628\u0627\u06cc\u06af\u06cc']

我改变了蜘蛛的代码如下更改的项目编码：

cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract()[0].encode('utf-8')

但得到了这样的错误：

cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract()[0].encode('utf-8') 
exceptions.IndexError: list index out of range

然后，我撤消更改我的蜘蛛的代码，并根据此post，写我自己的pipelines.py更改默认VAL ue of ensure_ascii并将其变成“False”：

import json 
import codecs 

class CinemaPipeline(object): 

    def __init__(self): 
     self.file = codecs.open('scraped_data_utf8.json', 'wb', encoding='utf-8') 

    def process_item(self, item, spider): 
     line = json.dumps(dict(item), ensure_ascii=False) + "\n" 
     self.file.write(line) 
     return item 

    def spider_closed(self, spider): 
     self.file.close()

但是，结果输出文件与空名称项相同。

我几乎读了关于这个问题的所有帖子，但无法解决这个问题。有什么问题？

编辑：

HTML的一些片段：

<div class="content font-fa" style="margin-top:10px;"> 
    <div class="content-box"> 
     <div class="content-text" dir="rtl" style="width:240px;min-height:200px;text-align:center"><img src='../images/others/no-photo.jpg' ></div> 
     <div class="content-text" dir="rtl" style="width:450px;float:right;min-height:200px;" > 
      <div class="content-row" style="text-align:right;margin-right:0px;"> 
       <span class="FontsFa"> 
        <span align="right"> 
         <strong class="font-11 "> نام/نام خانوادگی : </strong> 
        </span> 
        <span class="large-title"> 
         <span class="bold font-13" style="color:#900;">ریحانه باقری بایگی 
         </span> 
        </span> 
       </span> 
      </div>

我想<span class="bold font-13" style="color:#900;">ریحانه باقری بایگی </span>

来源

2014-02-26 sepidfekr

它工作。我不知道发生了什么，但今天，没有任何代码的变化，它是正确的。 – sepidfekr

之间的文本的问题很可能是您的XPath不匹配您所需的数据。

hxs.select(...).extract()会给你一个空数组，当你试图改变你正在调用hxs.select(...).extract()[0]的编码时，会抛出一个IndexError。

你是怎么找到XPath的？你在蜘蛛里面测试过它吗？请注意，如浏览器和scrapy中所示的HTML可能与不同，通常是因为scrapy不执行JavaScript。作为一般规则，你应该总是检查response.body是你所期望的。

另外，您的XPath非常容易破解，因为它使用绝对位置。这意味着你的路径中任何地方的任何改变都会破坏整个事情。通常最好尝试依靠id或独特的特征（//td[id="foobar"]）。

您能否提供您正在尝试解析的HTML的相关代码片段？

来源

2014-02-26 22:26:58 Robin

谢谢。正如我所说，这个xpath在'scrapy shell'中运行良好，并输出我想要的数据，但是以ascii格式输出，尽管当我“打印”它时，它也显示了正确的编码。而且，关于我的xpath可分性，我应该说我知道，但既然给定的html代码不是标准的，并且ID不是唯一的，我有这种形式的xpath。无论如何，我在我的问题中添加了一些HTML代码。 – sepidfekr

我通过Firebog插件发现了这个xpath。我如何测试它在我的蜘蛛里面？ – sepidfekr

那么因为'hxs.select（...）。extract（）'是一个空数组，所以引发了IndexError ...因为那里没有数据，所以修改后来的管道不会改变任何东西。你是如何用shell检索这些信息的？相同的代码？顺便说一句，要轻松地调试你的代码，你应该使用python调试器，[pdb]（http://docs.python.org/2/library/pdb.html）（或者你甚至可以安装[ipdb]（https：// pypi .python.org/pypi/ipdb）用于完成等很酷的功能） – Robin

为什么我在scrapy中为我的物品获得空输出？

回答

相关问题