2014-12-02 43 views
0

我都有点脑捻线,给出的数据是这样的:高级分组在Python

data = [('topic1', (['apples', 'oranges'], 0.14975108213820515)), 
     ('topic2', (['oranges', 'raisins'], 0.14975108213820515)), 
     ('topic3', (['grapes', 'raisins'], 0.14975108213820515)), 
     ('topic4', (['trees', 'flowers'], 0.14975108213820515))] 

我想连接如果基于主题的阵列在文本中的至少一个(在第1个要素的元组的第二个元素)是共同的。因此,在上述情况下:

topic1 is connected to topic2 
topic2 is connected to topic1 and topic3 
topic3 is connected to topic2 
topic4 is unconnected 

理想情况下,我的输出看起来像:

output = [(topic1,topic2), 
     (topic1,topic2, topic3), 
     (topic3, topic2), 
     (topic4)] 

因此,考虑到像data输入我怎么能得到这样output的输出。我认为itertools可能会以某种方式参与进来,但我确实在这一点上停滞不前。

+0

Topic2具有与Topic1和Topic3相同的元素,但Topic1和Topic3没有任何元素,并且仅因为Topic2而相关。这很重要吗? – MeetTitan 2014-12-02 18:14:47

回答

2

有效的方法是使用set s。

>>> set1= set(['apples', 'oranges']) 
>>> set2= set(['oranges', 'raisins']) 
>>> print len(set1.intersection(set2)) 
1 

因此,基本上:

  • 为每个主题的文本
  • 每个主题的一组,重复对方的话题,并检查其文本的交集len设置

topic_text_sets= {topic:set(text) for topic,(text,_) in data} 
topic_related= {} 
for topic1, text1 in topic_text_sets.iteritems(): 
    related= [topic2 for topic2, text2 in topic_text_sets.iteritems() if topic1!=topic2 and len(text1.intersection(text2))>0] 
    print related 

topic1 ['topic2'] 
topic3 ['topic2'] 
topic2 ['topic1', 'topic3'] 
topic4 [] 
0

将其分解为子问题。首先,你需要获得所有不同的文本,也许使用列表理解(或者设置理解来避免重复)。然后你需要遍历它,并且为每个文本找到data中的每一块,并将它作为它的一部分。你不应该需要使用itertools - 这可能会过度复杂。

2

你会创建一个列表字典来捕捉连接:

connections = {} 
for topic, (conns, some_number) in data: 
    for conn in conns: 
     connections.setdefault(conn, set()).add(topic) 

此连接值映射到主题集。

现在您可以查看反向连接;刚刚获得所有连接值集合的并集,如果顺序并不重要:

output = [tuple(set().union(*(connections[c] for c in conns))) 
      for topic, (conns, some_number) in data] 

演示:

>>> data = [('topic1', (['apples', 'oranges'], 0.14975108213820515)), 
...  ('topic2', (['oranges', 'raisins'], 0.14975108213820515)), 
...  ('topic3', (['grapes', 'raisins'], 0.14975108213820515)), 
...  ('topic4', (['trees', 'flowers'], 0.14975108213820515))] 
>>> connections = {} 
>>> for topic, (conns, some_number) in data: 
...  for conn in conns: 
...   connections.setdefault(conn, set()).add(topic) 
... 
>>> [tuple(set().union(*(connections[c] for c in conns))) 
...    for topic, (conns, some_number) in data] 
[('topic1', 'topic2'), ('topic1', 'topic3', 'topic2'), ('topic3', 'topic2'), ('topic4',)] 
>>> from pprint import pprint 
>>> pprint(_) 
[('topic1', 'topic2'), 
('topic1', 'topic3', 'topic2'), 
('topic3', 'topic2'), 
('topic4',)] 

由该组第一移除它以其他方式移动topic到前面

output = [(topic,) + tuple(set().union(*(connections[c] for c in conns)) - {topic}) 
      for topic, (conns, some_number) in data] 

>>> [(topic,) + tuple(set().union(*(connections[c] for c in conns)) - {topic}) 
...    for topic, (conns, some_number) in data] 
[('topic1', 'topic2'), ('topic2', 'topic1', 'topic3'), ('topic3', 'topic2'), ('topic4',)] 
>>> pprint(_) 
[('topic1', 'topic2'), 
('topic2', 'topic1', 'topic3'), 
('topic3', 'topic2'), 
('topic4',)] 
1

一个简单的两个for循环:

>>> for i in range(len(data)): 
...  x = set(data[i][1][0]) 
...  for j in range(len(data)): 
...   if len(x & set(data[j][1][0]))>=1: 
...    print data[j][0],    # for python 3 use print() 
...  print 
... 
topic1 topic2 
topic1 topic2 topic3 
topic2 topic3 
topic4