在元组中按元素比较返回集的集合 - python

我是python的新手。有人帮我解决这个问题。我有一个数据集在第一行中的属性和其余行中的记录。在元组中按元素比较返回集的集合 - python

我的要求是将每个记录与其他记录进行比较，并给出不同的元素的属性名称。所以最后，我应该有一套作为输出。

例如，如果我有3列这样的3条记录。

  Col1 Col2 Col3 
tuple1 H C G 
tuple2 H M G 
tuple3 L M S

它应该给我这个样子tuple1，tuple2 = {col2的} tuple1，tuple3 = {Col1中，col2的，COL3} tuple2，tuple3 = {Col1中，COL3}

以及最终输出应该是{{col2的}，{Col1中，col2的，COL3}，{Col1中，COL3}}

这里是我已经尝试了代码，

我现在做的是，阅读每一行到列表中。因此，一个列表中的所有属性（列表名称是list_attr）和行列表列表（列表名称是行）。然后对于每条记录，我正在循环其他记录，比较每个元素并获取不同元素的索引以获取属性名称。然后最终转换它们来设置。我已经给出了下面的代码，但问题是，我有50k条记录和15个属性要处理，所以这个循环需要很长时间才能执行，还有其他方法可以很快完成此操作或提高性能。

dis_sets = [] 
for l in rows: 
    for l1 in rows: 
     if l != l1: 
      i = 0 
      in_sets = [] 
      while(i < length): 
       if l[i] != l1[i]: 
        in_sets.append(list_attr[i]) 
       i = i+1 
      if in_sets != []: 
       dis_sets.append(in_sets) 
skt = set(frozenset(temp) for temp in dis_sets)

来源

2014-09-03 ds_user

它看起来像你理解了要求写。现在尝试编写一个解决方案，当你遇到一个你无法解决的问题时，一个你不能解决的问题 - 在这里发布。 – alfasin 2014-09-03 02:57:43

我试图编写代码并最终编写重复的循环，因此需要时间，因此需要寻找更好的替代方案。 – 2014-09-03 02:58:56

尝试编写解决问题所需的算法（步骤）。然后为每个步骤编写一个函数。在每个步骤单独运行后，尝试将这些步骤组合到一个工作流程中：按顺序从主函数调用每个步骤。试图将所有登录“吞下”到一个巨大的函数/逻辑中对于阅读，调试，测试和维护都是不利的（正如你已经经历的那样）。 – alfasin 2014-09-03 03:02:26

考虑：

>>> tuple1=('H', 'C', 'G') 
>>> tuple2=('H', 'M', 'G') 
>>> tuple3=('L', 'M', 'S')

OK，你的国家，“我的要求是比较与其他记录每一个记录，并给予其不同的元素的属性名称。”

它放入代码：

>>> [i for i, t in enumerate(zip(tuple1, tuple2), 1) if t[0]!=t[1]] 
[2] 
>>> [i for i, t in enumerate(zip(tuple1, tuple3), 1) if t[0]!=t[1]] 
[1, 2, 3] 
>>> [i for i, t in enumerate(zip(tuple2, tuple3), 1) if t[0]!=t[1]] 
[1, 3]

那么你的状态“，最终输出应为{{Col2},{Col1,Col2,Col3},{Col1,Col3}}

因为一套套将失去秩序，这是没有意义的。它应该是：

>>> [[i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair in 
...  [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]] 
[[2], [1, 2, 3], [1, 3]]

如果你真的想套，你可以让他们的子元素;如果你有一套真正的套件，你就失去了哪对的信息。

套清单：

>>> [{i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in 
...  [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]] 
[set([2]), set([1, 2, 3]), set([1, 3])]

而且你几乎相同的期望输出：

>>> [{'Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in 
...  [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]] 
[set(['Col2']), set(['Col2', 'Col3', 'Col1']), set(['Col3', 'Col1'])]

（注意，由于集合是无序的，的字符串顺序有所不同。如果顶层。订单变更，你有什么？）

注意，如果你有一个列表的列表，你是更接近你想要的输出：

>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair 
...  in [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]] 
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]

编辑基于评论

你可以做类似的东西：

def pairs(LoT): 
        # for production code, consider using a deque of tuples... 
    seen=set()  # hold the pair combinations seen 
    while LoT: 
     f=LoT.pop(0) 
     for e in LoT: 
      se=frozenset([f, e]) 
      if se not in seen: 
       seen.add(se) 
       yield se 

>>> list(pairs([('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')])) 
[frozenset([('H', 'M', 'G'), ('H', 'C', 'G')]), frozenset([('L', 'M', 'S'), ('H', 'C', 'G')]), frozenset([('H', 'M', 'G'), ('L', 'M', 'S')])]

然后可以这样使用：

>>> LoT=[('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')] 
>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair 
...  in pairs(LoT)] 
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]

编辑＃2

如果你想有一个头VS的计算值：

>>> theader=['tuple col 1', 'col 2', 'the third' ] 
>>> [[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]] for pair 
...  in pairs(LoT)] 
[['col 2'], ['tuple col 1', 'col 2', 'the third'], ['tuple col 1', 'the third']]

如果你想（我怀疑的右答案）清单列表：

>>> di=[] 
>>> for pair in pairs(LoT):  
... di.append({repr(list(pair)): [theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]]}) 
>>> di 
[{"[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}, {"[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third']}, {"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third']}]

或者，只是列出的直快译通：

>>> di={} 
>>> for pair in pairs(LoT):  
... di[repr(list(pair))]=[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]] 
>>> di 
{"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third'], "[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third'], "[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}

来源

2014-09-03 03:31:35 dawg

嘿。感谢您的回复。但是我需要比较每个元组与其他所有元组，我完全有50k个元组。所以它的字面上不可能像[（tuple1，tuple2），（tuple1，tuple3），（tuple2，tuple3））给出。我不需要返回不同的元素作为输出集，我需要返回不同元素的列名作为设置，所以我想使用不同元素的索引，并从属性列表中获取列（属性）在这里是list_attr）。你能帮忙利用这个吗？ – 2014-09-03 03:59:23

嗨。再次感谢。这看起来很完美。但我需要使用列名称而不是列表中的值。数据集的第一行将具有列名称。例如，如果（H，C，G）的列名是员工，部门，经理。然后，我需要返回相同的。 – 2014-09-03 04:29:37

我想你可以从这里拿走它。 ;-)如果您有问题，请提出一个新问题。 'zip'是你的朋友 – dawg 2014-09-03 04:31:06

在元组中按元素比较返回集的集合 - python

回答

相关问题