2011-03-11 88 views
15

我有以下的HTML是我目前使用BeautifulSoup获得HTML中的其他元素更大的文件使用beautifulsoup提取(如<br />标签)换行之间的文本

<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br /> 

内,但我一直没能找到一种方法来获取<br />标签之间的重要文本行。我可以分离并导航到每个<br />元素,但无法找到在两者之间获取文本的方法。任何帮助将不胜感激。谢谢。

回答

21

如果你只是想这是两个<br />标签之间的任何文本,你可以这样做以下:

from BeautifulSoup import BeautifulSoup, NavigableString, Tag 

input = '''<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br />''' 

soup = BeautifulSoup(input) 

for br in soup.findAll('br'): 
    next_s = br.nextSibling 
    if not (next_s and isinstance(next_s,NavigableString)): 
     continue 
    next2_s = next_s.nextSibling 
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br': 
     text = str(next_s).strip() 
     if text: 
      print "Found:", next_s 

但是,也许我误解你的问题?你的问题的说明似乎并没有与“重要” /“非重要”在你的例子数据,所以我已经与描述匹配;)

+0

啊,问题是我是用findNextSibling(),以及刚跳过文本并进入下一个换行符。使用nextSibling工作。谢谢您的帮助! – maltman 2011-03-14 15:22:29

+0

很好的回答,这让我很头疼! – Nick 2013-07-24 01:58:41

+0

'next'不是Python中的保留字吗?也许不同的变量名会更好? (这是一个小点,但这样的东西加起来!) – duhaime 2013-10-18 02:20:50

4

所以,用于测试目的,让我们假设该段HTML是span标签中:

x = """<span><br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br /></span>""" 

现在我要分析它,并找到我的跨度标签:

from BeautifulSoup import BeautifulSoup 
y = soup.find('span') 

如果您遍历在y.childGenerator()发电机,你会得到br和文本:

In [4]: for a in y.childGenerator(): print type(a), str(a) 
    ....: 
<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 1 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Not Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 2 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 3 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Non Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 4 

<type 'instance'> <br /> 
0

以下为我工作:

for br in soup.findAll('br'): 
    if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>': 
     print br.contents[0] 
+0

请不要依赖代码逻辑的对象的字符串表示。 – Sylvain 2017-05-05 10:13:07