2016-09-21 45 views
2

我有这个HTML表:我需要从这个表中获取特定数据并将其分配给一个变量,我不需要所有的信息。 flag =“阿拉伯联合酋长国”,home_port =“Sharjah”等。由于html元素没有'class',我们如何提取这些数据。BeautifulSoup HTML表分析为无标记的标记

 r = requests.get('http://maritime-connector.com/ship/'+str(imo_number), headers={'User-Agent': 'Mozilla/5.0'}) 
    soup = BeautifulSoup(r.content, "lxml") 
    table = soup.find("table", { "class" : "ship-data-table" }) 
    for row in table.findAll("tr"): 
     tname = row.findAll("th") 
     cells = row.findAll("td") 


     print (type(tname)) 
     print (type(cells)) 

我使用python模块beautfulSoup。

<table class="ship-data-table" style="margin-bottom:3px"> 
         <thead> 
         <tr> 
          <th>IMO number</th> 
          <td>9492749</td> 
         </tr> 
         <tr> 
          <th>Name of the ship</th> 
          <td>SHARIEF PILOT</td> 
         </tr> 
                <tr> 
          <th>Type of ship</th> 
          <td>ANCHOR HANDLING VESSEL</td> 
         </tr> 
                       <tr> 
          <th>MMSI</th> 
          <td>470535000</td> 
         </tr> 
                       <tr> 
          <th>Gross tonnage</th> 
          <td>499 tons</td> 
         </tr> 
                       <tr> 
          <th>DWT</th> 
          <td>222 tons</td> 
         </tr> 
                       <tr> 
          <th>Year of build</th> 
          <td>2008</td> 
         </tr> 
                       <tr> 
          <th>Builder</th> 
          <td>NANYANG SHIPBUILDING - JINGJIANG, CHINA</td> 
         </tr> 
                       <tr> 
          <th>Flag</th> 
          <td>UNITED ARAB EMIRATES</td> 
         </tr> 
                              <tr> 
          <th>Home port</th> 
          <td>SHARJAH</td> 
         </tr> 
                              <tr> 
          <th>Manager & owner</th> 
          <td>GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES</td> 
         </tr> 
                                     <tr> 
          <th>Former names</th> 
          <td>SUPERIOR PILOT until 2008 Sep</td> 
         </tr> 
                </thead> 
        </table> 
+0

内容我使用Python模块beautfulSoup。不使用任何正则表达式。 –

回答

2

去了所有在表格中th元素,让文字和以下td兄弟姐妹的文字:

from pprint import pprint 

from bs4 import BeautifulSoup 

data = """your HTML here""" 

soup = BeautifulSoup(data, "html.parser") 

result = {header.get_text(strip=True): header.find_next_sibling("td").get_text(strip=True) 
      for header in soup.select("table.ship-data-table tr th")} 
pprint(result) 

这将构建一个很好的字典,标题密钥和相应的td文本作为值:

{'Builder': 'NANYANG SHIPBUILDING - JINGJIANG, CHINA', 
'DWT': '222 tons', 
'Flag': 'UNITED ARAB EMIRATES', 
'Former names': 'SUPERIOR PILOT until 2008 Sep', 
'Gross tonnage': '499 tons', 
'Home port': 'SHARJAH', 
'IMO number': '9492749', 
'MMSI': '470535000', 
'Manager & owner': 'GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES', 
'Name of the ship': 'SHARIEF PILOT', 
'Type of ship': 'ANCHOR HANDLING VESSEL', 
'Year of build': '2008'} 
+1

我喜欢这个解决方案。 –

+0

谢谢@alecxe。它的工作.. –

+0

@alecxe我得到错误时,值是没有的。 AttributeError:'NoneType'对象没有属性'get_text'。我在哪里可以使用try和exception –

0

我会做这样的事情:

html = """ 
     <your table> 
    """ 

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 

flag = soup.find("th", string="Flag").find_next("td").get_text(strip=True) 
home_port = soup.find("th", string="Home port").find_next("td").get_text(strip=True) 


print(flag) 
print(home_port) 

这样,我只在th要素确保我匹配文字和获取的下一td