过滤xml文件以删除其中包含特定文本的行吗？

例如，假设我有：过滤xml文件以删除其中包含特定文本的行吗？

<div class="info"><p><b>Orange</b>, <b>One</b>, ... 
<div class="info"><p><b>Blue</b>, <b>Two</b>, ... 
<div class="info"><p><b>Red</b>, <b>Three</b>, ... 
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...

而且我想删除有话从一个列表，所以我只能在适合我的标准行使用XPath的所有行。例如，我可以使用列表['Orange', 'Red']来标记不需要的行，因此在上面的示例中，我只想使用第2行和第4行进行进一步处理。

我该怎么做？

来源

2011-07-03 roni

问得好，+1。查看我的答案以获得完整但简短的单行XPath表达式解决方案。 –

使用：

//div 
    [not(p/b[contains('|Orange|Red|', 
        concat('|', ., '|') 
        ) 
      ] 
     ) 
    ]

这将选择XML文档中的任何div元素，使得它具有无p的孩子，他b孩子的字符串VALU是字符串的管道分隔的列表中的一个字符串用作过滤器。

该方法允许扩展性，只需将新的过滤器值添加到管道分隔列表中，而不更改XPath表达式中的其他任何内容。

注意：当XML文档的结构是静态已知时，请始终避免使用// XPath伪操作符，因为它导致显着的低效率（减速）。

来源

2011-07-03 20:52:39

import lxml.html as lh 

# http://lxml.de/xpathxslt.html 
# http://exslt.org/regexp/functions/match/index.html 
content='''\ 
<table> 
<div class="info"><p><b>Orange</b>, <b>One</b></p></div> 
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div> 
<div class="info"><p><b>Red</b>, <b>Three</b></p></div> 
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div> 
</table> 
''' 
NS = 'http://exslt.org/regular-expressions' 
tree = lh.fromstring(content) 
exclude=['Orange','Red'] 
for elt in tree.xpath(
    "//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)), 
    namespaces={'re': NS}): 
    print(lh.tostring(elt)) 
    print('-'*80)

产生

<div class="info"><p><b>Blue</b>, <b>Two</b></p></div> 

-------------------------------------------------------------------------------- 
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div> 

--------------------------------------------------------------------------------

来源

2011-07-03 21:04:09 unutbu

过滤xml文件以删除其中包含特定文本的行吗？

回答

相关问题