2017-01-09 104 views
1

我使用下面的代码来获得一个项目的URL:如何从xpath获取绝对网址?

node.xpath('//td/a[starts-with(text(),"itunes")]')[0].attrib['href'] 

它给了我这样的:

itunes20170107.tbz 

不过,我希望得到完整的URL,这是:

https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current/itunes20170109.tbz 

有没有一种简单的方法可以从lxml中获得完整的url,而无需自己构建它?

回答

2

lxml.html只会解析href,因为它是HTML里面。如果你想链接的绝对和相对不,你应该使用urljoin()

from urllib.parse import urljoin # Python3 
# from urlparse import urljoin # Python2 

url = "https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current" 

relative_url = node.xpath('//td/a[starts-with(text(),"itunes")]')[0].attrib['href'] 
absolute_url = urljoin(url, relative_url) 

演示:

>>> from urllib.parse import urljoin # Python3 
>>> 
>>> url = "https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current" 
>>> 
>>> relative_url = "itunes20170107.tbz" 
>>> absolute_url = urljoin(url, relative_url) 
>>> absolute_url 
'https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/itunes20170107.tbz' 
0

另一种方式来做到这一点:

import requests 
from lxml import fromstring 

url = 'http://server.com' 
response = reqests.get(url) 
etree = fromstring(response.text) 
etree.make_links_absolute(url)`