元素之间的Python BeautifulSoup提取文本

我尝试提取 “这是我TEXT” 从下面的HTML：元素之间的Python BeautifulSoup提取文本

<html> 
<body> 
<table> 
    <td class="MYCLASS"> 
     <!-- a comment --> 
     <a hef="xy">Text</a> 
     <p>something</p> 
     THIS IS MY TEXT 
     <p>something else</p> 
     </br> 
    </td> 
</table> 
</body> 
</html>

我试着这样说：

soup = BeautifulSoup(html) 

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
    print hit.text

但我得到的所有文字在所有嵌套标签加评论之间。

任何人都可以帮我取得“这是我的文字”吗？

来源

2013-05-30 ɥɔǝnq ɹǝƃloɥ

使用.children代替：

from bs4 import NavigableString, Comment 
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

是的，这是一个有点舞蹈。

输出：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
...  print ''.join(unicode(child) for child in hit.children 
...   if isinstance(child, NavigableString) and not isinstance(child, Comment)) 
... 




     THIS IS MY TEXT

来源

2013-05-30 11:59:13

这会返回'u'\ n评论\ nText \ nsomething \ n这是我的文本\ n别的\ n''或'u'a commentTextsomethingThis是我的文本\'其他'\'，其中有更多的文本比需要。 –

@CristianCiupitu：当然，你是对的，在这里没有注意。更新。 –

这是唯一的解决方案，它不依赖于文本与特定其他文本的顺序或位置关系，而是从指定的标签/元素中提取所有文本，同时忽略子标签/元素的文本（或其他内容）。谢谢！这是尴尬的，但它的工作和解决我的问题（我不是OP，但有类似的需求）。 – geewiz

您可以使用.contents：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
...  print hit.contents[6].strip() 
... 
THIS IS MY TEXT

来源

2013-05-30 12:27:58 TerryA

谢谢，但文本并不总是在相同的地方。无论如何，它会工作吗？ –

@ɥɔǝnqɹǝƃloɥ唉，不是。也许使用其他人的答案 – TerryA

数字'6'表示什么？ – User

详细了解如何导航through the parse tree in BeautifulSoup。解析树已得到tags和NavigableStrings（因为这是一个文本）。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>', 
     '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', 
     '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', 
     '</html>'] 
soup = BeautifulSoup(''.join(doc)) 

print soup.prettify() 
# <html> 
# <head> 
# <title> 
# Page title 
# </title> 
# </head> 
# <body> 
# <p id="firstpara" align="center"> 
# This is paragraph 
# <b> 
#  one 
# </b> 
# . 
# </p> 
# <p id="secondpara" align="blah"> 
# This is paragraph 
# <b> 
#  two 
# </b> 
# . 
# </p> 
# </body> 
# </html>

要下移你有contents和string解析树。

内容是标签的有序列表和NavigableString对象包含在一个页面元素中
如果一个标签只有一个子节点，该子节点是字符串，子节点可用作tag.string，以及 tag.contents [0]

针对上述情况，也就是说，你可以得到

soup.b.string 
# u'one' 
soup.b.contents[0] 
# u'one'

对于几个孩子节点，你可以有例如

pTag = soup.p 
pTag.contents 
# [u'This is paragraph ', <b>one</b>, u'.']

所以在这里你可以与contents玩，获取你想要的索引的内容。

你也可以迭代一个标签，这是一个快捷方式。例如，

for i in soup.body: 
    print i 
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> 
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

来源

2013-05-30 12:46:44 octoback

'hit.string'是'None'和'hit.contents [0] ''''''''''所以请为这个问题的例子提供一个答案。 –

所以在这里你可以玩内容并获得你想要的索引的内容。 – octoback

是对问题的回答 – octoback

的BeautifulSoup documentation提供关于从使用提取方法的文件删除对象的例子。在下面的例子中，目的是要删除文档中的所有注释：

移除构件

一旦你有一个元素的引用，您可以用提取物撕出树方法。此代码删除所有评论 从文档：

from BeautifulSoup import BeautifulSoup, Comment 
soup = BeautifulSoup("""1<!--The loneliest number--> 
        <a>2<!--Can be as bad as one--><b>3""") 
comments = soup.findAll(text=lambda text:isinstance(text, Comment)) 
[comment.extract() for comment in comments] 
print soup 
# 1 
# <a>2<b>3</b></a>

来源

2013-05-30 13:10:09

简短的回答：soup.findAll('p')[0].next

真正的答案：你需要一个不变的参考点，从中可以得到你的目标。

你在你的评论中提到海德罗的回答，你想要的文本并不总是在同一个地方。找出它与某个元素在相同位置的感觉。然后找出如何让BeautifulSoup在不变路径之后导航分析树。

例如，在原始帖子中提供的HTML中，目标字符串紧接在第一个段落元素后面出现，并且该段落不是空的。由于findAll('p')将会找到段落元素，soup.find('p')[0]将成为第一段落元素。

你可以在这种情况下使用soup.find('p')，但soup.findAll('p')[n]更通用，因为也许你的实际情况需要第5段或类似的东西。

next field属性将成为树中下一个已解析的元素，包括子元素。因此soup.findAll('p')[0].next包含该段的文本，并且soup.findAll('p')[0].next.next将返回您提供的HTML中的目标。

来源

2013-05-31 03:46:28

用自己的汤对象：

soup.p.next_sibling.strip()

你抢<p>直接与soup.p *（这取决于它是第一个<p>解析树）
然后使用next_sibling对soup.p返回的标记对象，因为所需文本嵌套在解析树的相同级别，因为它们与<p>
.strip()仅仅是一个Python海峡方法除去开头和结尾的空白

*否则只是find使用您的filter（S）

选择在解释的元素，这看起来是这样的：

In [4]: soup.p 
Out[4]: <p>something</p> 

In [5]: type(soup.p) 
Out[5]: bs4.element.Tag 

In [6]: soup.p.next_sibling 
Out[6]: u'\n  THIS IS MY TEXT\n  ' 

In [7]: type(soup.p.next_sibling) 
Out[7]: bs4.element.NavigableString 

In [8]: soup.p.next_sibling.strip() 
Out[8]: u'THIS IS MY TEXT' 

In [9]: type(soup.p.next_sibling.strip()) 
Out[9]: unicode

来源

2014-07-18 21:05:58

您能否添加更多关于如何回答此问题的解释性文字？ –

很高兴！（往上看） –

soup = BeautifulSoup(html) 
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
    hit = hit.text.strip() 
    print hit

这将打印：这是我的文本试试这个..

来源

2018-01-24 10:17:22 Naiswita

元素之间的Python BeautifulSoup提取文本

回答

相关问题