BeautifulSoup获取给定标签后的所有链接

我试图使用BeautifulSoup来刮取以下页面（例如1,2）以获取从曼谷的一个地方到另一个地方的行动列表。BeautifulSoup获取给定标签后的所有链接

基本上，我可以查询并选择旅行的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'})

的descriptions的HTML看起来像下面

<div id="routeDescription"> 
... 
<br/> 
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b> 
<br/> 
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/> 
... 
</div>

基本上，我试图让行动和公交线路列表，行驶到下一个位置（问题的答案更新，但仍然没” t解决）。

route_descrtions = [] 
for description in descriptions.find_all('img'): 
    action = description.next_sibling 
    to_station = action.next_sibling 
    n = action.find_next_siblings('a') 
    if 'travel' in action.lower(): 
     lines = [to_station.find_next('b').text] + [a.contents[0] for a in n] 
    else: 
     lines = [] 
    desp = {'action': action, 
      'to': to_station.text, 
      'lines': lines} 
    route_descrtions.append(desp)

不过，我不知道如何通过链接循环的每个动作（Travel to行动）之后，并追加到我的名单。我试过find_next('a')和find_next_siblings('a')，但没有完成我的任务。

输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 
    'lines': ['16', '40', '48', '501', '508'], 
    'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

所需的输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
...

来源

2017-04-09 titipata

下面应该工作：

from bs4 import BeautifulSoup 
import requests 
import pprint 

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
routes = soup_route.find('div', attrs={'id': 'routeDescription'}) 

parsed_routes = list() 
for img in routes.find_all('img'): 
    action = img.next_sibling 
    to_station = action.next_sibling 
    links = list() 
    for sibling in img.next_siblings: 
     if sibling.name == 'a': 
      links.append(sibling) 
     elif sibling.name == 'img': 
      break 

    lines = list() 
    if 'travel' in action.lower(): 
     lines.extend([to_station.find_next('b').text]) 
     lines.extend([link.contents[0] for link in links]) 

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines} 
    parsed_routes.append(parsed_route) 

pprint.pprint(parsed_routes)

此输出：

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

你的关键问题是n = action.find_next_siblings('a')因为它得到了在后您的“当前”的形象同级别的所有链接。看到所有图像和所有链接都处于同一水平，这不是你想要的。

您可能正在考虑将图像作为链接的父节点。喜欢的东西：

IMG1
- 链接1
IMG2
- 链接2
IMG3
- LINK3
- LINK4
- link5

然而，在现实中，它更像是以下几点：

IMG1
链接1
IMG2
链接2
IMG3
LINK3
LINK4
link5

当你问你有IMG1，IMG2和IMG3图像（在这个例子中）。当你要求所有下一个链接兄弟姐妹你得到了。所以，如果你在IMG2，并要求下一环节的兄弟姐妹，你得到了他们，即

IMG1
链接1
IMG2 <你在这里，并得到了...
链接2 <这
IMG3 - （不是这个，因为它不是一个链接）
LINK3 <此，
LINK4 <这一点，
link5 <这

我希望解释。我所做的改变只是循环，直到找到图像并停在那里。因此你的外部图像循环从那里继续。我还清理了一些代码。只是为了清楚。

来源

2017-04-09 21:37:23

谢谢安德烈！该解决方案适用于我。也感谢您的好解释。已经接受了答案（并竖起大拇指）！ – titipata

您可以尝试find_next_siblings（使用Python 2.7）：

import bs4 

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x''' 

soup = bs4.BeautifulSoup(text, 'lxml') 
img = soup.find('img') 
action = img.next_sibling 
to_station = action.next_sibling 
n = to_station.find_next_siblings('a') 
d = { 
    'action': action, 
    'to': to_station.text, 
    'buses': [a.contents[0] for a in n] 
}

结果：

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']}

来源

2017-04-09 04:30:29

嗨Yohanes，我试过了，但它不适合我的特殊问题。您是否有适用于给定完整HTML的解决方案？ – titipata

BeautifulSoup获取给定标签后的所有链接

回答

相关问题