2016-11-10 128 views
1

我有一个数据帧如下:金额加总

country letter    keywords     amount 
    c   y  ['fruits', 'apples', "banana"]  700 
    c   y  ["music", "dance", "banana"]   150 
    c   y  ['loud', "dance", "apples"]   350 

我想总结与每个关键词相关联的量。 注意:countryletters并不总是相同的,如上面所做的数据。此外,keywords的列表大小不同。

我试过几种解决方案。我附上了我下面最快的一个。我也试着解决方案applydefaultdicts ...

keywords_list = [] 
for i in zip(*[df[c] for c in df.columns]): 
    data = list(i[0:2]) 
    for k in i[2]: 
     row = [k] + data + [i[-1]] 
     keywords_list.append(row) 

df_expanded = pd.DataFrame(keywords_list) 
df_expanded.groupby(list(range(3)))[3].sum().reset_index() 

目标

country letter keywords amount 
0  c  y apples 1050 
1  c  y banana  850 
2  c  y dance  500 
3  c  y fruits  700 
4  c  y  loud  350 
5  c  y music  150 

编辑:例如目标


在纠正错误数据

country = list("ccc") 
letters = list("yyy") 
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]] 
amount = [700, 150, 350] 

df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount}) 
df = df[['country', 'letter', 'keywords', 'amount']] 

回答

2

您可以使用:

df1 = pd.DataFrame(df.keywords.values.tolist()) 
     .stack() 
     .reset_index(level=1, drop=True) 
     .rename('keywords') 
print (df1) 
0 fruits 
0 apples 
0 banana 
1  music 
1  dance 
1 banana 
2  loud 
2  dance 
2 apples 
Name: keywords, dtype: object 

print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True)) 
    country letter amount keywords 
0  c  y  700 fruits 
1  c  y  700 apples 
2  c  y  700 banana 
3  c  y  150 music 
4  c  y  150 dance 
5  c  y  150 banana 
6  c  y  350  loud 
7  c  y  350 dance 
8  c  y  350 apples 

另一种解决方案:

df = df.set_index(['country','letter','amount']) 
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \ 
     .stack() \ 
     .reset_index(name='keywords') \ 
     .drop('level_3',axis=1) 
print (df1) 
    country letter amount keywords 
0  c  y  700 fruits 
1  c  y  700 apples 
2  c  y  700 banana 
3  c  y  150 music 
4  c  y  150 dance 
5  c  y  150 banana 
6  c  y  350  loud 
7  c  y  350 dance 
8  c  y  350 apples 

这时需要groupby与aggrega婷sum

print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum()) 
    country letter keywords amount 
0  c  y apples 1050 
1  c  y banana  850 
2  c  y dance  500 
3  c  y fruits  700 
4  c  y  loud  350 
5  c  y music  150 

时序

In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1)) 
1 loop, best of 3: 4.55 s per loop 

In [48]: %timeit (jez1(df3)) 
10 loops, best of 3: 24.8 ms per loop 

In [49]: %timeit (jez2(df3)) 
10 loops, best of 3: 29.7 ms per loop 

代码计时:

df = pd.concat([df]*10000).reset_index(drop=True) 
df3 = df.copy() 
df4 = df.copy()     

def jez1(df): 
    df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords') 
    return df.drop('keywords', axis=1).join(df1).reset_index(drop=True) 

def jez2(df): 
    df = df.set_index(['country','letter','amount']) 
    df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1) 
    return df1 

谢谢MaxU改进与pop - 然后又drop是没有必要的。不幸的是timing失败(KeyError: 'keywords'),所以我无法比较它。

+0

是的,这个解决方案更好。但我认为我们可以改进它甚至更多一点:'df.join(pd.DataFrame(df.pop('keywords')。values.tolist())。stack()。reset_index(level = 1,drop = True).rename('keywords'))' – MaxU

+0

@MaxU - 感谢您的改进。 'pop'很少用,但这里是个好主意;) – jezrael

1

试试这个:

In [76]: (df.set_index(['country','letter','amount']) 
    ...: .keywords 
    ...: .apply(pd.Series) 
    ...: .stack() 
    ...: .reset_index(name='keywords') 
    ...: .drop('level_3',1) 
    ...:) 
    ...: 
Out[76]: 
    country letter amount keywords 
0  c  y  700 fruits 
1  c  y  700 apples 
2  c  y  700 banana 
3  c  y  150 music 
4  c  y  150 dance 
5  c  y  150 banana 
6  c  y  350  loud 
7  c  y  350 dance 
8  c  y  350 apples