2016-09-30 37 views
2

我试图抓取https://www.wellstar.org/locations/pages/default.aspx的位置数据,当我查看源代码时,我注意到医院地址的类有时拼写有额外的'd' - 'adddress'和'address' 。有没有办法来解决以下代码中的这种差异?我试图加入一个if语句来测试address对象的长度,但我只能得到与'adddress'类关联的地址。我觉得我很接近但没有想法。BeautifulSoup - 拼错类

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
import re 

def make_soup(url): 
    thepage = urllib.request.urlopen(url) 
    soupdata = BeautifulSoup(thepage,"html.parser") 
    return soupdata 

soup = make_soup("https://www.wellstar.org/locations/pages/default.aspx") 

for table in soup.findAll("table",class_="s4-wpTopTable"): 
    for type in table.findAll("h3"): 
     type = type.get_text() 
    for name in table.findAll("div",class_="PurpleBackgroundHeading"): 
     name = name.get_text() 
    address="" 
    for address in table.findAll("div",class_="WS_Location_Adddress"): 
      address = address.get_text(separator=" ") 
    if len(address)==0: 
     for address in table.findAll("div",class_="WS_Location_Address"): 
      address = address.get_text(separator = " ") 
      print(type, name, address) 

回答

2

BeautifulSoup为适应大,你可以使用正则表达式:

for address in table.find_all("div", class_=re.compile(r"WS_Location_Ad{2,}ress")): 

其中d{2,}将匹配d 2倍以上。


或者,你可以指定一个类的列表

for address in table.find_all("div", class_=["WS_Location_Address", "WS_Location_Adddress"]): 
+0

两个很好的选择 - 我很好奇/正则表达式吓倒,是诚实的。这可能是花点时间学习操作员的理由。 – Daniel