2014-01-29 49 views
2

例如,我想仅抽出Child1,CHILD2和Child3出以下列表,它是H3的第一个实例后和H3的下一个标签之前的如何使用BeautifulSoup查找两个标签之间的所有列表项?

<h3>HeaderName1<h3> 
<ul class="prodoplist"> 
<li>Parent</li> 
<li class="lev1">Child1</li> 
<li class="lev1">Child2</li> 
<li class="lev1">Child3</li> 
    </ul> 
    <h3>HeaderName2<h3> 
    <ul class="prodoplist"> 
    <li>Parent2</li> 
    <li class="lev1">Child4</li> 
    <li class="lev1">Child5</li> 
    <li class="lev1">Child6</li> 
    </ul> 

回答

2

使用findChildren像:

for ul in soup.find_all('ul'): 
    print 'ul start' 
    for idx, li in enumerate(ul.findChildren('li')): 
     if idx in range(3): 
      print li 

输出:

ul start 
<li>Parent</li> 
<li class="lev1">Child1</li> 
<li class="lev1">Child2</li> 
ul start 
<li>Parent2</li> 
<li class="lev1">Child4</li> 
<li class="lev1">Child5</li> 

然而,因为在大多数情况下是lxml and xpath优越的解决方案:

from lxml import html 
doc = html.parse('input.html') 
print [ul.xpath('li[1] | li[2] | li[3]') for ul in doc.xpath('//ul')] 
+1

您可以将lxml用作beautifulsoup4的解析器。就像这样使用:'bs4.BeautifulSoup(response.text,'lxml')' –

2

这应该工作。

import re 
from BeautifulSoup import BeautifulSoup 
html_doc = '<h3>HeaderName1</h3><ul class="prodoplist"><li>Parent</li><li class="lev1">Child1</li><li class="lev1">Child2</li><li class="lev1">Child3</li></ul> <h3>HeaderName2</h3><ul class="prodoplist"><li>Parent2</li><li class="lev1">Child4</li><li class="lev1">Child5</li><li class="lev1">Child6</li></ul>' 
m = re.search(r'<h3>.*?<h3>', html_doc, re.DOTALL) 
s = m.start() 
e = m.end() - len('<h3>') 
target_html = html_doc[s:e] 
new_bs = BeautifulSoup(target_html) 
ul_eles = new_bs.findAll('ul', attrs={'class' : 'prodoplist'}) 
for ul_ele in ul_eles: 
    li_eles = new_bs.findAll('li', attrs={'class' : 'lev1'}) 
    for li_ele in li_eles: 
     print li_ele.text 
1
import requests 
from BeautifulSoup import BeautifulSoup 

children = [] 

url = "http://someurl.html" 
r = requests.get(url) 
bs = BeautifulSoup(r.text) 
for uls in bs.findAll('ul', 'prodoplist'): 
    lis = uls.findAll('li', 'lev1') 
    for li in lis: 
     children.append(li.text) 

print children 
相关问题