用BeautifulSoup中的标签替换CDATA NavigableStrings

我正在使用BeautifulSoup解析多个XML文档源，并且想要执行一些预处理以用自定义XML标记替换非标准CDATA标记。为了说明：用BeautifulSoup中的标签替换CDATA NavigableStrings

下面的XML源...

<title>The end of the world as we know it</title> 
<category><![CDATA[Planking Dancing]]></category> 
<pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate> 
<dc:creator><![CDATA[Bart Simpson]]></dc:creator>

...会变成：

<title>The end of the world as we know it</title> 
<category><myTag>Planking Dancing<myTag></category> 
<pubDate><myTag>Sun, 16 Sep 2012 12:00:00 EDT<myTag></pubDate> 
<dc:creator><myTag>Bart Simpson<myTag></dc:creator>

我不认为这个问题已经被问之前，SO（我尝试了几个不同的SO查询）。我也尝试了几种不同的方法，使用.findAll('cdata', text=True)并将BeautifulSoup replaceWith()方法应用于每个产生的NavigableString。我所做的尝试导致没有替换，或者看起来像递归循环。

我很高兴能发布我以前的尝试，但考虑到这里的问题是很简单的，我希望有人可以张贴的如何完成一个明显的例子，搜索和替换上述使用BeautifulSoup 3.

来源

2012-09-16 tohster

CData是NavigableString子类，所以你可以先搜索所有NavigableString对象，然后测试每个是否是CData实例找到所有CData 元素。一旦你得到了一个，它很容易使用replaceWith取代，如你所说：

>>> from BeautifulSoup import BeautifulSoup, CData, Tag 
>>> source = """ 
... <title>The end of the world as we know it</title> 
... <category><![CDATA[Planking Dancing]]></category> 
... <pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate> 
... <dc:creator><![CDATA[Bart Simpson]]></dc:creator> 
... """ 
>>> soup = BeautifulSoup(source) 
>>> for navstr in soup(text=True): 
...  if isinstance(navstr, CData): 
...   tag = Tag(soup, "myTag") 
...   tag.insert(0, navstr[:]) 
...   navstr.replaceWith(tag) 
... 
>>> soup 

<title>The end of the world as we know it</title> 
<category><myTag>Planking Dancing</myTag></category> 
<pubdate><myTag>Sun, 16 Sep 2012 12:00:00 EDT</myTag></pubdate> 
<dc:creator><myTag>Bart Simpson</myTag></dc:creator> 

>>>

有两点要注意：

你可以调用一个BeautifulSoup对象，就好像是一个函数，和的效果与调用其.findAll()方法相同。
我知道在BS3中获取CData对象的内容的唯一方法是将切片，如上面的代码片段所示。 str(navstr)会保留所有的 <![CDATA[...]]>垃圾，显然你不想要。在BS4中，str(navstr) 为您提供没有垃圾的内容。

来源

2012-11-18 03:13:07

用BeautifulSoup中的标签替换CDATA NavigableStrings

回答

相关问题