如何递归查找来自网页与美丽的所有链接？

我一直在尝试使用一些代码，我发现in this answer递归找到一个给定的URL的所有链接：如何递归查找来自网页与美丽的所有链接？

import urllib2 
from bs4 import BeautifulSoup 

url = "http://francaisauthentique.libsyn.com/" 

def recursiveUrl(url,depth): 

    if depth == 5: 
     return url 
    else: 
     page=urllib2.urlopen(url) 
     soup = BeautifulSoup(page.read()) 
     newlink = soup.find('a') #find just the first one 
     if len(newlink) == 0: 
      return url 
     else: 
      return url, recursiveUrl(newlink,depth+1) 


def getLinks(url): 
    page=urllib2.urlopen(url) 
    soup = BeautifulSoup(page.read()) 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(link,0)) 
    return links 

links = getLinks(url) 
print(links)

再说警告

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this: 

BeautifulSoup(YOUR_MARKUP}) 

to this: 

BeautifulSoup(YOUR_MARKUP, "lxml")

我收到以下错误：

Traceback (most recent call last): 
    File "downloader.py", line 28, in <module> 
    links = getLinks(url) 
    File "downloader.py", line 25, in getLinks 
    links.append(recursiveUrl(link,0)) 
    File "downloader.py", line 11, in recursiveUrl 
    page=urllib2.urlopen(url) 
    File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen 
    return _opener.open(url, data, timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 396, in open 
    protocol = req.get_type() 
TypeError: 'NoneType' object is not callable

问题是什么？

来源

2017-10-08 Alex

我想你传递一个BeautifulSoup对象'urlopen'，而不是URL。试试类似'link ['href']'，但一定要检查它是否在第一位。 – Thomas

谢谢托马斯，但现在我收到一个错误“ValueError：unknown url type：/ webpage/categery/general”。也许是因为这是一个相对的链接而不是绝对的链接？ – Alex

@Alex正确：） –

您的recursiveUrl会尝试访问一个无效的url链接，如：/ webpage/category/general，这是您从某个href链接提取的值。

您应该将提取的href值附加到网站的网址，然后尝试打开网页。您将需要处理递归算法，因为我不知道您想要实现什么。

代码：

import requests 
from bs4 import BeautifulSoup 

def recursiveUrl(url, link, depth): 
    if depth == 5: 
     return url 
    else: 
     print(link['href']) 
     page = requests.get(url + link['href']) 
     soup = BeautifulSoup(page.text, 'html.parser') 
     newlink = soup.find('a') 
     if len(newlink) == 0: 
      return link 
     else: 
      return link, recursiveUrl(url, newlink, depth + 1) 

def getLinks(url): 
    page = requests.get(url) 
    soup = BeautifulSoup(page.text, 'html.parser') 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(url, link, 0)) 
    return links 

links = getLinks("http://francaisauthentique.libsyn.com/") 
print(links)

输出：

http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/10 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/09 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/08 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/07 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general

来源

2017-10-08 17:53:40 Ali

如何递归查找来自网页与美丽的所有链接？

回答

相关问题