2010-06-09 58 views
30

基于another SO question构建,如何检查两个格式良好的XML片段是否在语义上相同。我所需要的只是“平等”,因为我使用它来进行单元测试。比较XML片段?

在我想,这将是相等的系统(注意“开始” 秩序和“结束”):

<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
<Stats start="1275955200" end="1276041599"> 
</Stats> 

# Reordered start and end 

<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
<Stats end="1276041599" start="1275955200" > 
</Stats> 

我lmxl,并在我的处置等工具,以及一个简单的函数只允许对属性进行重新排序也能很好地工作!


工作代码片段基于IANB的回答是:

from formencode.doctest_xml_compare import xml_compare 
# have to strip these or fromstring carps 
xml1 = """ <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200" end="1276041599"></Stats>""" 
xml2 = """  <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats end="1276041599" start="1275955200"></Stats>""" 
xml3 = """ <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200"></Stats>""" 

from lxml import etree 
tree1 = etree.fromstring(xml1.strip()) 
tree2 = etree.fromstring(xml2.strip()) 
tree3 = etree.fromstring(xml3.strip()) 

import sys 
reporter = lambda x: sys.stdout.write(x + "\n") 

assert xml_compare(tree1,tree2,reporter) 
assert xml_compare(tree1,tree3,reporter) is False 
+1

'从formencode.doctest_xml_compare进口xml_compare' – laike9m 2015-01-04 08:59:35

回答

24

您可以使用formencode.doctest_xml_compare - xml_compare函数比较两个ElementTree或lxml树。

+0

谢谢伊恩,我很高兴你已经有这个人解决了! – 2010-06-09 17:37:56

+2

此函数不正确,如果您在xml中交换属性顺序,它将返回False。 – mnowotka 2014-05-29 13:59:08

+0

@mnowotka不正确,它不同于_attributes_以不同的顺序相等 – Anentropic 2015-02-06 12:44:15

2

如果你把一个DOM方法,您可以同时穿过两棵树,而比较节点(节点类型,文本,属性),当您去。

递归的解决方案将是最优雅的 - 曾经一对节点只是短路进一步比较不“平等”或一旦你在一棵树检测叶当它在另一个分支等

+1

这是解决方案,我只是希望有人已经写了一个。 – 2010-06-11 14:25:01

5

我有同样的问题:我想要比较两个文件具有相同的属性,但顺序不同。

lxml中的XML Canonicalization(C14N)似乎适用于此,但我绝对不是XML专家。我很想知道其他人是否可以指出这种方法的缺点。

parser = etree.XMLParser(remove_blank_text=True) 

xml1 = etree.fromstring(xml_string1, parser) 
xml2 = etree.fromstring(xml_string2, parser) 

print "xml1 == xml2: " + str(xml1 == xml2) 

ppxml1 = etree.tostring(xml1, pretty_print=True) 
ppxml2 = etree.tostring(xml2, pretty_print=True) 

print "pretty(xml1) == pretty(xml2): " + str(ppxml1 == ppxml2) 

xml_string_io1 = StringIO() 
xml1.getroottree().write_c14n(xml_string_io1) 
cxml1 = xml_string_io1.getvalue() 

xml_string_io2 = StringIO() 
xml2.getroottree().write_c14n(xml_string_io2) 
cxml2 = xml_string_io2.getvalue() 

print "canonicalize(xml1) == canonicalize(xml2): " + str(cxml1 == cxml2) 

运行这给了我:

$ python test.py 
xml1 == xml2: false 
pretty(xml1) == pretty(xml2): false 
canonicalize(xml1) == canonicalize(xml2): true 
+0

也有这种做法的思想和我正在寻找的弊端,或者这是否可能真正的比较xml文件的规范方法......(双关语意见) – michuelnik 2014-01-29 22:02:57

+0

我一直在使用这一点在一个网站上运行,比较用于版本控制目的的XML文档。它工作得很好,但c14n不能控制具有不同顺序的相同子元素,所以我有时仍会得到虚假结果。 – 2014-01-30 00:55:33

+0

c14n是否对孩子重新排序?我猜想不会......你的意思是在同一个孩子出现的情况下,但是按照不同的顺序,你想要一个“没有区别”的结果,但是这会产生“差异检测”?在我看来,孩子的顺序可能很重要。 ;) – michuelnik 2014-01-30 13:41:40

1

对这个问题的思考,我想出了以下的解决方案,使XML元素可比性和可排序:

import xml.etree.ElementTree as ET 
def cmpElement(x, y): 
    # compare type 
    r = cmp(type(x), type(y)) 
    if r: return r 
    # compare tag 
    r = cmp(x.tag, y.tag) 
    if r: return r 
    # compare tag attributes 
    r = cmp(x.attrib, y.attrib) 
    if r: return r 
    # compare stripped text content 
    xtext = (x.text and x.text.strip()) or None 
    ytext = (y.text and y.text.strip()) or None 
    r = cmp(xtext, ytext) 
    if r: return r 
    # compare sorted children 
    if len(x) or len(y): 
     return cmp(sorted(x.getchildren()), sorted(y.getchildren())) 
    return 0 

ET._ElementInterface.__lt__ = lambda self, other: cmpElement(self, other) == -1 
ET._ElementInterface.__gt__ = lambda self, other: cmpElement(self, other) == 1 
ET._ElementInterface.__le__ = lambda self, other: cmpElement(self, other) <= 0 
ET._ElementInterface.__ge__ = lambda self, other: cmpElement(self, other) >= 0 
ET._ElementInterface.__eq__ = lambda self, other: cmpElement(self, other) == 0 
ET._ElementInterface.__ne__ = lambda self, other: cmpElement(self, other) != 0 
14

的顺序元素在XML中可能是重要的,这可能是为什么大多数其他方法建议将比较不等,如果顺序不同......即使元素具有相同的属性和文本内容。

但我也想要一个顺序不敏感的比较,所以我想出了这个:

from lxml import etree 
import xmltodict # pip install xmltodict 


def normalise_dict(d): 
    """ 
    Recursively convert dict-like object (eg OrderedDict) into plain dict. 
    Sorts list values. 
    """ 
    out = {} 
    for k, v in dict(d).iteritems(): 
     if hasattr(v, 'iteritems'): 
      out[k] = normalise_dict(v) 
     elif isinstance(v, list): 
      out[k] = [] 
      for item in sorted(v): 
       if hasattr(item, 'iteritems'): 
        out[k].append(normalise_dict(item)) 
       else: 
        out[k].append(item) 
     else: 
      out[k] = v 
    return out 


def xml_compare(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    if not isinstance(a, basestring): 
     a = etree.tostring(a) 
    if not isinstance(b, basestring): 
     b = etree.tostring(b) 
    a = normalise_dict(xmltodict.parse(a)) 
    b = normalise_dict(xmltodict.parse(b)) 
    return a == b 
+1

这绝对是最好的答案,应该被接受。这是唯一的答案,它实际上关心的是XML中的字段顺序无关紧要的事实。 – mnowotka 2014-05-29 13:58:02

+3

有两件事情需要考虑:_attributes_的顺序真的没有关系。但是元素的顺序在XML中很重要,这个代码适用于你不关心元素顺序的特殊情况。 – Anentropic 2014-05-29 14:15:49

0

适应Anentropic's great answer到Python 3(基本上,改变iteritems()items(),并basestringstring):

from lxml import etree 
import xmltodict # pip install xmltodict 

def normalise_dict(d): 
    """ 
    Recursively convert dict-like object (eg OrderedDict) into plain dict. 
    Sorts list values. 
    """ 
    out = {} 
    for k, v in dict(d).items(): 
     if hasattr(v, 'iteritems'): 
      out[k] = normalise_dict(v) 
     elif isinstance(v, list): 
      out[k] = [] 
      for item in sorted(v): 
       if hasattr(item, 'iteritems'): 
        out[k].append(normalise_dict(item)) 
       else: 
        out[k].append(item) 
     else: 
      out[k] = v 
    return out 


def xml_compare(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    if not isinstance(a, str): 
     a = etree.tostring(a) 
    if not isinstance(b, str): 
     b = etree.tostring(b) 
    a = normalise_dict(xmltodict.parse(a)) 
    b = normalise_dict(xmltodict.parse(b)) 
    return a == b 
+1

你可以为xmltodict使用'dict_constructor = dict'选项:'xmltodict.parse(a,dict_constructor = dict) ',所以你不需要使用'normalise_dict'函数。 – inoks 2016-06-11 19:16:20

0

由于order of attributes is not significant in XML,您希望忽略由于不同属性排序和XML canonicalization (C14N)确定性排序属性s,你可以用这种方法来测试是否相等:

xml1 = b''' <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200" end="1276041599"></Stats>''' 
xml2 = b'''  <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats end="1276041599" start="1275955200"></Stats>''' 
xml3 = b''' <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200"></Stats>''' 

import lxml.etree 

tree1 = lxml.etree.fromstring(xml1.strip()) 
tree2 = lxml.etree.fromstring(xml2.strip()) 
tree3 = lxml.etree.fromstring(xml3.strip()) 

import io 

b1 = io.BytesIO() 
b2 = io.BytesIO() 
b3 = io.BytesIO() 

tree1.getroottree().write_c14n(b1) 
tree2.getroottree().write_c14n(b2) 
tree3.getroottree().write_c14n(b3) 

assert b1.getvalue() == b2.getvalue() 
assert b1.getvalue() != b3.getvalue() 

请注意,这个例子假定Python 3。对于Python 3,使用b'''...'''字符串和io.BytesIO是强制性的,而对于Python 2,此方法也适用于普通字符串和io.StringIO

5

这里一个简单的解决方案,转换XML成字典(与xmltodict)和比较字典一起

import json 
import xmltodict 

class XmlDiff(object): 
    def __init__(self, xml1, xml2): 
     self.dict1 = json.loads(json.dumps((xmltodict.parse(xml1)))) 
     self.dict2 = json.loads(json.dumps((xmltodict.parse(xml2)))) 

    def equal(self): 
     return self.dict1 == self.dict2 

单元测试

import unittest 

class XMLDiffTestCase(unittest.TestCase): 

    def test_xml_equal(self): 
     xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats start="1275955200" end="1276041599"> 
     </Stats>""" 
     xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats end="1276041599" start="1275955200" > 
     </Stats>""" 
     self.assertTrue(XmlDiff(xml1, xml2).equal()) 

    def test_xml_not_equal(self): 
     xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats start="1275955200"> 
     </Stats>""" 
     xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats end="1276041599" start="1275955200" > 
     </Stats>""" 
     self.assertFalse(XmlDiff(xml1, xml2).equal()) 

或在简单的Python方法:

import json 
import xmltodict 

def xml_equal(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    return json.loads(json.dumps((xmltodict.parse(a)))) == json.loads(json.dumps((xmltodict.parse(b)))) 
0

什么下面的代码片段吗?能够容易地提高包括attribs还有:

def separator(self): 
    return "[email protected]#$%^&*" # Very ugly separator 

def _traverseXML(self, xmlElem, tags, xpaths): 
    tags.append(xmlElem.tag) 
    for e in xmlElem: 
     self._traverseXML(e, tags, xpaths) 

    text = '' 
    if (xmlElem.text): 
     text = xmlElem.text.strip() 

    xpaths.add("/".join(tags) + self.separator() + text) 
    tags.pop() 

def _xmlToSet(self, xml): 
    xpaths = set() # output 
    tags = list() 
    root = ET.fromstring(xml) 
    self._traverseXML(root, tags, xpaths) 

    return xpaths 

def _areXMLsAlike(self, xml1, xml2): 
    xpaths1 = self._xmlToSet(xml1) 
    xpaths2 = self._xmlToSet(xml2)`enter code here` 

    return xpaths1 == xpaths2