0
我一直在尝试使用一些代码,我发现in this answer递归找到一个给定的URL的所有链接:如何递归查找来自网页与美丽的所有链接?
import urllib2
from bs4 import BeautifulSoup
url = "http://francaisauthentique.libsyn.com/"
def recursiveUrl(url,depth):
if depth == 5:
return url
else:
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
newlink = soup.find('a') #find just the first one
if len(newlink) == 0:
return url
else:
return url, recursiveUrl(newlink,depth+1)
def getLinks(url):
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
links = soup.find_all('a')
for link in links:
links.append(recursiveUrl(link,0))
return links
links = getLinks(url)
print(links)
再说警告
/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "lxml")
我收到以下错误:
Traceback (most recent call last):
File "downloader.py", line 28, in <module>
links = getLinks(url)
File "downloader.py", line 25, in getLinks
links.append(recursiveUrl(link,0))
File "downloader.py", line 11, in recursiveUrl
page=urllib2.urlopen(url)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 396, in open
protocol = req.get_type()
TypeError: 'NoneType' object is not callable
问题是什么?
我想你传递一个BeautifulSoup对象'urlopen',而不是URL。试试类似'link ['href']',但一定要检查它是否在第一位。 – Thomas
谢谢托马斯,但现在我收到一个错误“ValueError:unknown url type:/ webpage/categery/general”。也许是因为这是一个相对的链接而不是绝对的链接? – Alex
@Alex正确:) –