试图从网页抓取使用BeautifulSoup的绝对链接

我正在使用BeautifulSoup阅读网页的内容。我想要的只是抓住<a href>，以http://开头。我知道在美丽的你可以通过属性进行搜索。我想我只是有一个语法问题。我会想象它会像这样。试图从网页抓取使用BeautifulSoup的绝对链接

page = urllib2.urlopen("http://www.linkpages.com") 
soup = BeautifulSoup(page) 
for link in soup.findAll('a'): 
    if link['href'].startswith('http://'): 
     print links

但返回：

Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ 
    return self._getAttrMap()[key] 
KeyError: 'href'

任何想法？提前致谢。

编辑这不是特别针对任何网站。该脚本从用户获取URL。所以内部链接目标将是一个问题，这也是为什么我只想从网页中获得<'a'>。如果我把它推向www.reddit.com，它解析开始链接，它会这样：

<a href="http://www.reddit.com/top/">top</a> 
<a href="http://www.reddit.com/saved/">saved</a> 
Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ 
    return self._getAttrMap()[key] 
KeyError: 'href'

来源

2010-03-23 Kevin

reddit.com has this：。所以，这不是一个语法错误，它是API。 – 2010-03-23 18:48:22

from BeautifulSoup import BeautifulSoup 
import re 
import urllib2 

page = urllib2.urlopen("http://www.linkpages.com") 
soup = BeautifulSoup(page) 
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): 
    print link

来源

2010-03-23 17:38:05

你可能有一些<a>标签不href属性？内部链接目标，也许？

来源

2010-03-23 17:25:14

请给我们一个关于你在这里解析什么的想法 - 正如Andrew指出的那样，似乎有一些锚标签没有关联的hrefs。

如果你真的想忽略他们，你可以在一个try块包起来，并与

except KeyError: pass

后来赶上，但它有自己的问题。

来源

2010-03-23 17:32:14

f=open('Links.txt','w') 
import urllib2 
from bs4 import BeautifulSoup 
url='http://www.redit.com' 
page=urllib2.urlopen(url) 
soup=BeautifulSoup(page) 
atags=soup.find_all('a') 
for item in atags: 
    for x in item.attrs: 
     if x=='href': 
      f.write(item.attrs[x]+',\n') 
     else: 
      continue 
f.close()

一种不太有效的解决方案。

来源

2013-02-16 00:16:09 Alex

试图从网页抓取使用BeautifulSoup的绝对链接

回答

相关问题