如何从BeautifulSoup结果获得第三个链接

我正在使用以下代码来使用BeautifulSoup检索一堆链接。它会返回所有链接，但我想获得第三个链接，解析该链接，然后获取第三个链接，依此类推。我怎样才能修改下面的代码来完成呢？如何从BeautifulSoup结果获得第三个链接

import urllib 
from BeautifulSoup import * 

url = raw_input('Enter - ') 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# Retrieve all of the anchor tags 
tags = soup('a') 
for tag in tags: 
    print tag.get('href', None) 
    print tag.contents[0]

来源

2016-06-12 martinbshp

首先，你应该停止使用BeautifulSoup版本3 - 它已经很老了，不再维护。切换到BeautifulSoup version 4。通过安装：

pip install beautifulsoup4

，改变你的进口：

from bs4 import BeautifulSoup

然后，你需要使用find_all()和指标取得第三个环节递归，直到有一个页面上没有链接3。下面是做这件事：

import urllib 
from bs4 import BeautifulSoup 

url = raw_input('Enter - ') 

while True: 
    html = urllib.urlopen(url) 
    soup = BeautifulSoup(html, "html.parser") 

    try: 
     url = soup.find_all('a')[2]["href"] 
     # if the link is not absolute, you might need `urljoin()` here 
    except IndexError: 
     break # could not get the 3rd link - exiting the loop

来源

2016-06-12 12:17:30 alecxe

谢谢您回复alecxe。在上面的代码中，“tags = soup（'a'）返回一个列表，然后当执行”print“时，我得到很多链接，所以它似乎给我所有链接而不使用”find_all“。这就是为什么我不能'简单地打印标签[2]，我认为这是循环迭代的第3个链接。 – martinbshp

@martinbshp是的，'soup（）'是'soup.find_all（）'的快捷方式。是的，你需要得到'href'属性值，如答案中所示： – alecxe

哦，我现在明白了，你的回应促使我回去重新考虑这件事，看看标签是我需要询问的索引而不是for循环中的变量var。谢谢。 – martinbshp

另一种选择是使用css selector，nth-of-type拿到3锚杆循环，直到CSS选择返回None：

import urllib 
from bs4 import BeautifulSoup 

url = raw_input('Enter - ') 
html = urllib.urlopen(url) 
soup = BeautifulSoup(html, "html.parser") 
a = soup.select_one("a:nth-of-type(3)") 
while a: 
    html = urllib.urlopen(a["href"]) 
    soup = BeautifulSoup(html, "html.parser") 
    a = soup.select_one("a:nth-of-type(3)")

如果你想找到具有href属性的第三个锚点，您可以使用"a:nth-of-type(3)[href]"

来源

2016-06-12 17:40:28

如何从BeautifulSoup结果获得第三个链接

回答

相关问题