机械化，urllib，beautifulsoup和相对路径

机械化，urllib或beautifulsoup有任何内置的方法来处理网站的绝对和相对URL的混合爬行？机械化，urllib，beautifulsoup和相对路径

一种解决方法是很多的例外

'http://' + 'www.stackoverflow.com' 
'http://www.stackoverflow.com' + '/questions/ask'

是否还有更好的选择吗？

来源

2012-06-12 user642897

根据记录，这是我的解决方案:)

domain = re.search('(http:\/\/.*\.\D+?|https:\/\/.*\.\D+?)\/',url.strip()).group(1) 

if re.search('mailto',url.strip()) != None: 
    pass 
elif re.search('(http:\/\/.*\.\D+?|https:\/\/.*\.\D+?)\/',url.strip()) != None: 
    u = url.strip().encode('utf8') 
elif re.search('^/',url.strip()) != None: 
    u = domain+url.strip().encode('utf8') 
else: 
    u = domain+'/'+url.strip().encode('utf8')

来源

2012-06-13 09:33:13 user642897

机械化，urllib，beautifulsoup和相对路径

回答

相关问题