2016-05-16 80 views
1

我得到一个Python的错误,我无法理解。我简化了我的代码非常最低限度:lxml.etree.XPathEvalError:无效的表达式

response = requests.get('http://pycoders.com/archive') 
tree = html.fromstring(response.text) 
r = tree.xpath('//divass="campaign"]/a/@href') 
print(r) 

,并仍然得到错误

Traceback (most recent call last): 
File "ultimate-1.py", line 17, in <module> 
r = tree.xpath('//divass="campaign"]/a/@href') 
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702) 
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954) 
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962) 
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817) 
lxml.etree.XPathEvalError: Invalid expression 

有人会具有其中的问题是来自的想法?可能它是一个依赖关系问题?谢谢。

回答

1

表达式'//divass="campaign"]/a/@href'在语法上不正确,没有多大意义。相反,你的意思是检查class属性:

//div[@class="campaign"]/a/@href 

现在,这将有助于避免无效的表达错误,但你会得到什么用表达式中。这是因为requests收到的响应中没有数据。您需要模仿浏览器所做的操作来获取所需数据,并提出额外请求以获取包含广告系列的JavaScript文件。

这里对我来说是什么工作:

import ast 
import re 

import requests 
from lxml import html 

with requests.Session() as session: 
    # extract script url 
    response = session.get('http://pycoders.com/archive') 
    tree = html.fromstring(response.text) 
    script_url = tree.xpath("//script[contains(@src, 'generate-js')]/@src")[0] 

    # get the script 
    response = session.get(script_url) 
    data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1)) 

    # extract the desired data 
    tree = html.fromstring(data) 
    campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[@class="campaign"]/a')] 
    print(campaigns) 

打印:

['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140', 
... 
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481' 
] 
+0

谢谢!我必须做response.content.decode('utf-8')来使它工作。 – Bastien

0

ü是错误做出的XPath。 如果ü要采取一切的HREF您的XPath应该像

hrefs = tree.xpath('//div[@class="campaign"]/a') 
for href in hrefs: 
    print(href.get('href')) 

或一条线:

hrefs = [item.get('href') for item in tree.xpath('//div[@class="campaign"]/a')]