从Beautifulsoup提取标签“提取”中的内容

我有一个xml语料库，其中一个标签名为<EXTRACT>。但是该术语是Beautifulsoup中的关键字。我如何提取这个标签的内容。当我写entry.extract.text它返回错误，当我使用entry.extract时，整个内容被提取。从Beautifulsoup提取标签“提取”中的内容

从我所了解的Beautifulsoup，它执行标签的案例折叠。如果有一些方法可以解决这个问题，那也可能对我有所帮助。

注：我暂时用下面的方法解决了问题。

extra = entry.find('extract') 
absts.write(str(extra.text))

但我想知道是否有什么办法，因为我们与其他标签使用像entry.tagName

来源

2014-03-01 Amrith Krishna

根据BS源代码tag.tagname使用它实际上是引擎盖下称tag.find("tagname")。这里有一个Tag类的__getattr__()方法的样子：

def __getattr__(self, tag): 
    if len(tag) > 3 and tag.endswith('Tag'): 
     # BS3: soup.aTag -> "soup.find("a") 
     tag_name = tag[:-3] 
     warnings.warn(
      '.%sTag is deprecated, use .find("%s") instead.' % (
       tag_name, tag_name)) 
     return self.find(tag_name) 
    # We special case contents to avoid recursion. 
    elif not tag.startswith("__") and not tag=="contents": 
     return self.find(tag) 
    raise AttributeError(
     "'%s' object has no attribute '%s'" % (self.__class__, tag))

看到，它是完全基于find()，所以这是非常好的，你的情况使用tag.find("extract")：

from bs4 import BeautifulSoup 


data = """<test><EXTRACT>extract text</EXTRACT></test>""" 
soup = BeautifulSoup(data, 'html.parser') 
test = soup.find('test') 
print test.find("extract").text # prints 'extract text'

此外，您还可以使用test.extractTag.text，但它已被弃用，我不会推荐它。

希望有所帮助。

来源

2014-03-01 05:46:55 alecxe

从Beautifulsoup提取标签“提取”中的内容

回答

相关问题