2017-04-09 47 views
1

我试图使用BeautifulSoup来刮取以下页面(例如1,2)以获取从曼谷的一个地方到另一个地方的行动列表。BeautifulSoup获取给定标签后的所有链接

基本上,我可以查询并选择旅行的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'}) 

descriptions的HTML看起来像下面

<div id="routeDescription"> 
... 
<br/> 
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b> 
<br/> 
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/> 
... 
</div> 

基本上,我试图让行动和公交线路列表,行驶到下一个位置(问题的答案更新,但仍然没” t解决)。

route_descrtions = [] 
for description in descriptions.find_all('img'): 
    action = description.next_sibling 
    to_station = action.next_sibling 
    n = action.find_next_siblings('a') 
    if 'travel' in action.lower(): 
     lines = [to_station.find_next('b').text] + [a.contents[0] for a in n] 
    else: 
     lines = [] 
    desp = {'action': action, 
      'to': to_station.text, 
      'lines': lines} 
    route_descrtions.append(desp) 

不过,我不知道如何通过链接循环的每个动作(Travel to行动)之后,并追加到我的名单。我试过find_next('a')find_next_siblings('a'),但没有完成我的任务。

输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 
    'lines': ['16', '40', '48', '501', '508'], 
    'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}] 

所需的输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
... 

回答

1

下面应该工作:

from bs4 import BeautifulSoup 
import requests 
import pprint 

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en' 
route_request = requests.get(url) 
soup_route = BeautifulSoup(route_request.content, 'lxml') 
routes = soup_route.find('div', attrs={'id': 'routeDescription'}) 

parsed_routes = list() 
for img in routes.find_all('img'): 
    action = img.next_sibling 
    to_station = action.next_sibling 
    links = list() 
    for sibling in img.next_siblings: 
     if sibling.name == 'a': 
      links.append(sibling) 
     elif sibling.name == 'img': 
      break 

    lines = list() 
    if 'travel' in action.lower(): 
     lines.extend([to_station.find_next('b').text]) 
     lines.extend([link.contents[0] for link in links]) 

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines} 
    parsed_routes.append(parsed_route) 

pprint.pprint(parsed_routes) 

此输出:

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'}, 
{'action': 'Travel to ', 
    'lines': ['Chao Phraya Express Boat'], 
    'to': 'Si Phraya'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'}, 
{'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'}, 
{'action': 'Travel to ', 
    'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'], 
    'to': 'Asok'}, 
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}] 

你的关键问题是n = action.find_next_siblings('a')因为它得到了在后您的“当前”的形象同级别的所有链接。看到所有图像和所有链接都处于同一水平,这不是你想要的。

您可能正在考虑将图像作为链接的父节点。喜欢的东西:

  • IMG1
    • 链接1
  • IMG2
    • 链接2
  • IMG3
    • LINK3
    • LINK4
    • link5

然而,在现实中,它更像是以下几点:

  • IMG1
  • 链接1
  • IMG2
  • 链接2
  • IMG3
  • LINK3
  • LINK4
  • link5

当你问你有IMG1,IMG2和IMG3图像(在这个例子中)。当你要求所有下一个链接兄弟姐妹你得到了。所以,如果你在IMG2,并要求下一环节的兄弟姐妹,你得到了他们,即

  • IMG1
  • 链接1
  • IMG2 <你在这里,并得到了...
  • 链接2 <这
  • IMG3 - (不是这个,因为它不是一个链接)
  • LINK3 <此,
  • LINK4 <这一点,
  • link5 <这

我希望解释。我所做的改变只是循环,直到找到图像并停在那里。因此你的外部图像循环从那里继续。我还清理了一些代码。只是为了清楚。

+0

谢谢安德烈!该解决方案适用于我。也感谢您的好解释。已经接受了答案(并竖起大拇指)! – titipata

0

您可以尝试find_next_siblings(使用Python 2.7):

import bs4 

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x''' 

soup = bs4.BeautifulSoup(text, 'lxml') 
img = soup.find('img') 
action = img.next_sibling 
to_station = action.next_sibling 
n = to_station.find_next_siblings('a') 
d = { 
    'action': action, 
    'to': to_station.text, 
    'buses': [a.contents[0] for a in n] 
} 

结果:

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']} 
+0

嗨Yohanes,我试过了,但它不适合我的特殊问题。您是否有适用于给定完整HTML的解决方案? – titipata

相关问题