2016-08-19 32 views
0

我想提取从新闻网站与RSS订阅项内容,如下面Scrapy:XPath的错误://媒体无效的表达式:内容

<item> 
<title>BPS: Kartu Bansos Bantu Turunkan Angka Gini Ratio</title> 
<media:content url="/image.jpg" expression="full" type="image/jpeg"/> </item> 

但错误提出用时像媒体标签解析信息: ( ':内容//媒体')

Traceback (most recent call last): 
    File "<console>", line 1, in <module> 
    File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 183, in xpath 
    six.reraise(ValueError, ValueError(msg), sys.exc_info()[2]) 
    File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 179, in xpath 
    smart_strings=self._lxml_smart_strings) 
    File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57923) 
    File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:167084) 
    File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:166043) 
ValueError: XPath error: Undefined namespace prefix in //media:content 

是否有人知道我应该怎么办使用XPath像item.xpath内容?谢谢:)

回答

4

你需要告诉它的XPath命名空间中的media前缀通过调用选择的register_namespace(prefix, namespace)第一映射到,例如:

selector.register_namespace('media', 'http://the.namespace.of/media') 

,或者如果你只想使用本地名称,你可以使用:

item.xpath("//*[local-name()='content']") 
+0

Scrapy选择器的'.xpath()'不接受像'lxml'这样的名称空间参数(但是[开放PR](https://github.com/scrapy/parsel/)拉/ 45)在此)。必须事先在选择器上调用['.register_namespace(prefix,namespace)'](https://parsel.readthedocs.io/en/latest/usage.html#parsel.selector.Selector.register_namespace)。 –

+0

@paultrmbrth thx,我没有意识到这不是lxml的xpath(),应该更近一点看堆栈跟踪...感谢参考,我更正了我的回答 – mata

+0

谢谢@mata,它的工作原理~~ – NGloom