2013-07-25 139 views
4

我需要解析一个嵌套的HTML列表并将其转换为父子字典。鉴于此列表:用BeautifulSoup解析嵌套的HTML列表

<ul> 
    <li>Operating System 
    <ul> 
     <li>Linux 
     <ul> 
      <li>Debian</li> 
      <li>Fedora</li> 
      <li>Ubuntu</li> 
     </ul> 
     </li> 
     <li>Windows</li> 
     <li>OS X</li> 
    </ul> 
    </li> 
    <li>Programming Languages 
    <ul> 
     <li>Python</li> 
     <li>C#</li> 
     <li>Ruby</li> 
    </ul> 
    </li> 
</ul> 

我想将其转换为这样一个字典:

{ 
    'Operating System': { 
     'Linux': { 
      'Debian': None, 
      'Fedora': None, 
      'Ubuntu': None, 
     }, 
     'Windows': None, 
     'OS X': None, 
    }, 
    'Programming Languages': { 
     'Python': None, 
     'C#': None, 
     'Ruby': None, 
    } 
} 

我最初尝试使用find_all('li', recursive=False)。它返回顶层项目(操作系统和编程语言),但也返回子项。

我怎样才能用BeautifulSoup做到这一点?

回答

7

这里有一种方法:

def dictify(ul): 
    result = {} 
    for li in ul.find_all("li", recursive=False): 
     key = next(li.stripped_strings) 
     ul = li.find("ul") 
     if ul: 
      result[key] = dictify(ul) 
     else: 
      result[key] = None 
    return result 

使用例:

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(""" 
... <ul> 
... <li>Operating System 
...  <ul> 
...  <li>Linux 
...   <ul> 
...   <li>Debian</li> 
...   <li>Fedora</li> 
...   <li>Ubuntu</li> 
...   </ul> 
...  </li> 
...  <li>Windows</li> 
...  <li>OS X</li> 
...  </ul> 
... </li> 
... <li>Programming Languages 
...  <ul> 
...  <li>Python</li> 
...  <li>C#</li> 
...  <li>Ruby</li> 
...  </ul> 
... </li> 
... </ul> 
... """) 
>>> ul = soup.body.ul 
>>> from pprint import pprint 
>>> pprint(dictify(ul), width=1) 
{u'Operating System': {u'Linux': {u'Debian': None, 
            u'Fedora': None, 
            u'Ubuntu': None}, 
         u'OS X': None, 
         u'Windows': None}, 
u'Programming Languages': {u'C#': None, 
          u'Python': None, 
          u'Ruby': None}}