Python lxml xpath返回无输出

我尝试使用Python中的lxml在网站上刮取特定元素。下面你可以找到我的代码，但没有输出。Python lxml xpath返回无输出

from lxml import html 

    webpage = 'http://www.funda.nl/koop/heel-nederland/' 
    page = requests.get(webpage) 
    tree = html.fromstring(page.content) 

    content = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()' 
    content = str(tree.xpath(content)) 
    print content

来源

2017-04-27 Nweijden

看起来你正试图报废的网站并不喜欢被废弃。他们利用各种技术来检测请求是来自合法用户还是来自bot，并且如果他们认为它来自bot，则阻止访问。这就是为什么你的xpath找不到任何东西，这就是为什么你应该重新考虑你正在做的事情。

如果您决定要继续，那么欺骗这个特定网站的最简单方法似乎是将cookies添加到您的请求中。

首先，使用你真正的浏览器获得cookie字符串：

打开新的标签页
开放开发工具
转到开发工具
如果网络选项卡是空的“网络”选项卡，刷新页面
查找请求到heel-nederland/并点击它
在请求标题中，你会发现cookie字符串 - 它很长，并且包含许多看似随机的字符。它复制

然后，修改程序中使用这些cookie：

import requests 
from lxml import html 

webpage = 'http://www.funda.nl/koop/heel-nederland/' 
headers = { 
     'Cookie': '<string copied from browser>' 
     } 
page = requests.get(webpage, headers=headers) 
tree = html.fromstring(page.content) 

selector = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()' 
content = str(tree.xpath(selector)) 
print content

来源

2017-04-29 21:05:19

Python lxml xpath返回无输出

回答

相关问题