2016-03-04 27 views
1

我有一个列表需要根据列表中的字符串进行合并以适合结构。在这种情况下,这将是'日期'和'ID'试图适应'领域'结构。如何基于列表中的公共字符串列出列表,Python

领域:['date', 'id', 'impressions', 'clicks']

前:

[('2015-11-01', 'id123', 'impressions', '8'), ('2015-11-01', 'id123', 
'clicks', '4'), ('2015-11-01', 'id456', 'impressions', '14'), 
('2015-11-01', 'id456', 'clicks', '9')] 

后:

[('2015-11-01', 'id123', '8', '4'), ('2015-11-01', 'id456', '14', '9')] 
+0

我不能明白的结果,可你把在其他的话呢? – RafaelC

+0

结果列表需要遵循“字段”结构。符合'日期'和'身份证'。 “展示次数”和“点击次数”这几个字词是按顺序排列的,因此可以认为“8”是“展示次数”,“4”是点击次数。 – PieCharmed

回答

1
>>> L = [('2015-11-01', 'id123', 'impressions', '8'), ('2015-11-01', 'id123', 
... 'clicks', '4'), ('2015-11-01', 'id456', 'impressions', '14'), 
... ('2015-11-01', 'id456', 'clicks', '9')] 
>>> from collections import defaultdict 
>>> D = defaultdict(list) 
>>> for a, b, c, d in L: 
...  D[a, b].append(d) 
... 
>>> [k + tuple(D[k]) for k in D] 
[('2015-11-01', 'id456', '14', '9'), ('2015-11-01', 'id123', '8', '4')] 

在这种情况下是展示和点击次数不是在一个一致的顺序

>>> L = [('2015-11-01', 'id123', 'impressions', '8'), ('2015-11-01', 'id123', 'clicks', '4'), ('2015-11-01', 'id456', 'clicks', '9'), ('2015-11-01', 'id456', 'impressions', '14')] 
>>> from collections import defaultdict 
>>> D = defaultdict(lambda: [None, None]) 
>>> for a, b, c, d in L: 
...  D[a, b][c == 'clicks'] = d 
... 
>>> [k + tuple(D[k]) for k in D] 
[('2015-11-01', 'id456', '14', '9'), ('2015-11-01', 'id123', '8', '4')] 
+0

这适用于“L”将始终具有“展示次数”和“点击次数”后面的情况。如果“L”看起来像:[('2015-11-01','id123','印象','8'),('2015-11-01','id123' ,'点击','4'),('2015-11-01','id456','点击','9'),('2015-11-01','id456','印象',' 14')]但仍然遵循上面提到的'fields'命令? – PieCharmed

+0

@PieCharmed,我已经添加了一种方法来解决我的问题 –

0

itertools.groupby可以很好地工作在这里,特别是如果真实数据样本数据相匹配(已经排序等等日期/ ID对全部相邻):

import itertools 
from operator import itemgetter 

outlist = [] 
for (date, ID), grp in itertools.groupby(inlist, key=itemgetter(0, 1)): 
    grp = list(grp) # Iterating twice, so convert to sequence 
    impressioncnt = sum(int(cnt) for _, _, typ, cnt in grp if typ == 'impressions') 
    clickcnt = sum(int(cnt) for _, _, typ, cnt in grp if typ == 'clicks') 
    outlist.append((date, ID, str(impressioncnt), str(clickcnt))) 

如果数据尚未按dateID排序,则需要先对inlist进行排序,inlist.sort(key=itemgetter(0, 1))。这可能是昂贵的,如果list是巨大的,在这种情况下,你可能会考虑使用collections.defaultdict代替:

import collections 

dateID_cnts = collections.defaultdict({'impressions': 0, 'clicks': 0}.copy) 
for date, ID, typ, cnt in inlist: 
    dateID_cnts[date, ID][typ] += int(cnt) 

# Convert from defaultdict to desired list of tuples 
outlist = [(date, ID, str(v['impressions']), str(v['counts'])) for (date, ID), v in dateID_cnts.items()] 
+0

看起来好像每个日期/ ID组合可能只有一次展示和点击。如果是这种情况,你可以简化这个很多 –

+0

@JohnLaRooy:是的,错过了每个只有一个,他们是字符串,而不是整数。这是我的“一般情况”解决方案? :-) – ShadowRanger

+1

'outlist = [k +(next(grp)[ - 1],next(grp)[ - 1])for k,grp in itertools.groupby(L,key = itemgetter(0,1))] ' –

0

另一种方式:

data=[('2015-11-01', 'id123', 'impressions', '8'), 
     ('2015-11-01', 'id123','clicks', '4'), 
     ('2015-11-01', 'id456', 'impressions', '14'), 
     ('2015-11-01', 'id456', 'clicks', '9')] 

ddict={} 
for t in data: 
    key=(t[0], t[1]) 
    ddict.setdefault(key, []).append(t[2:]) 

LoT=[]  
for d, id in ddict: 
    impressions, clicks=max(ddict[(d, id)])[1], min(ddict[(d, id)])[1] 
    LoT.append(tuple([d, id, impressions, clicks])) 

>>> LoT 
[('2015-11-01', 'id123', '8', '4'), ('2015-11-01', 'id456', '14', '9')] 

如果您可以假设impressionsclicks已经在顺序,可以消除maxmin并将其替换该行:

impressions, clicks=ddict[(d, id)][0][1], ddict[(d, id)][1][1] 
相关问题