2015-08-16 132 views
3

工作HTML:Python的 - 如何将多个标签之间提取元素

<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2> 

问题: 我想每一个h2标签之间抽取h3标签,并提取所有标签anchorsh3之间

我有什么:

soup = BeautifulSoup("""<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2>""", 'html5lib') 

for row in soup.find_all("h2"): 
    print(row.text) 
    print(row.find_next('h3')) 
    print('################') 

当前的结果:

################ 
Heading 1 
<h3> Subheading 1.1 </h3> 
################ 
Heading 2 
<h3> Subheading 2.1 </h3> 
################ 
Heading 3 
None 
################ 

通缉的结果:

################ 
Heading 1 
Subheading 1.1 
Link 1 
Link 2 
Link 3 
-------- 
Subheading 1.2 
Link 1 
Link 2 
Link 3 
Link 4 
-------- 
Subheading 1.3 
Link 1 
################ 
Heading 2 
Subheading 2.1 
Link 1 
Link 2 
-------- 
Subheading 2.2 
Link 1 
Link 2 
-------- 
Subheading 2.3 
Link 1 
################ 

或者类似的东西

回答

2

这工作!

s = """ 

<h2> Heading 1 </h2> 
<h3> Subheading 1.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> 
<h3> Subheading 1.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a> 
<h3> Subheading 1.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 2 </h2> 
<h3> Subheading 2.1 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2</a> 
<h3> Subheading 2.2 </h3> 
<a href="#">Link 1</a> | <a href="#">Link 2 </a> 
<h3> Subheading 2.3 </h3> 
<a href="#">Link 1</a> 
<h2> Heading 3 </h2> 

""" 

from bs4 import BeautifulSoup as bs 

soup = bs(s) 

for i in soup.find_all('h2'): 
    print i.text 
    for j in i.next_siblings: 
     if j.name == 'h2': break 
     if j.name == 'h3': 
      print '\t'+j.text 
      for k in j.next_siblings: 
       if k.name == 'h3': break 
       if k.name == 'a': 
        print '\t\t'+k.text 
相关问题