使用beautifulsoup提取（如 标签）换行之间的文本

我有以下的HTML是我目前使用BeautifulSoup获得HTML中的其他元素更大的文件使用beautifulsoup提取（如 标签）换行之间的文本

<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br />

内，但我一直没能找到一种方法来获取 标签之间的重要文本行。我可以分离并导航到每个 元素，但无法找到在两者之间获取文本的方法。任何帮助将不胜感激。谢谢。

来源

2011-03-11 maltman

如果你只是想这是两个 标签之间的任何文本，你可以这样做以下：

from BeautifulSoup import BeautifulSoup, NavigableString, Tag 

input = '''<br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br />''' 

soup = BeautifulSoup(input) 

for br in soup.findAll('br'): 
    next_s = br.nextSibling 
    if not (next_s and isinstance(next_s,NavigableString)): 
     continue 
    next2_s = next_s.nextSibling 
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br': 
     text = str(next_s).strip() 
     if text: 
      print "Found:", next_s

但是，也许我误解你的问题？你的问题的说明似乎并没有与“重要” /“非重要”在你的例子数据，所以我已经与描述匹配;）

来源

2011-03-11 17:00:28

啊，问题是我是用findNextSibling（），以及刚跳过文本并进入下一个换行符。使用nextSibling工作。谢谢您的帮助！ – maltman 2011-03-14 15:22:29

很好的回答，这让我很头疼！ – Nick 2013-07-24 01:58:41

'next'不是Python中的保留字吗？也许不同的变量名会更好？（这是一个小点，但这样的东西加起来！） – duhaime 2013-10-18 02:20:50

所以，用于测试目的，让我们假设该段HTML是span标签中：

x = """<span><br /> 
Important Text 1 
<br /> 
<br /> 
Not Important Text 
<br /> 
Important Text 2 
<br /> 
Important Text 3 
<br /> 
<br /> 
Non Important Text 
<br /> 
Important Text 4 
<br /></span>"""

现在我要分析它，并找到我的跨度标签：

from BeautifulSoup import BeautifulSoup 
y = soup.find('span')

如果您遍历在y.childGenerator()发电机，你会得到br和文本：

In [4]: for a in y.childGenerator(): print type(a), str(a) 
    ....: 
<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 1 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Not Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 2 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 3 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Non Important Text 

<type 'instance'> <br /> 
<class 'BeautifulSoup.NavigableString'> 
Important Text 4 

<type 'instance'> <br />

来源

2011-03-11 17:01:44

以下为我工作：

for br in soup.findAll('br'): 
    if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>': 
     print br.contents[0]

来源

2016-02-02 16:59:20 Pontios

请不要依赖代码逻辑的对象的字符串表示。 – Sylvain 2017-05-05 10:13:07

使用beautifulsoup提取（如<br />标签）换行之间的文本

回答

相关问题