2017-06-02 115 views
1

我有一个这样的数据集:优化GROUPBY聚集熊猫

Type Word 

0 N Work 
1 N Rock 
2 N Rock 
3 Adj Rock 
4 V Rock 
5 N Work 
6 V Work 
7 V Rock 
8 Adj Like 
9 N Rock 
10 V Love 
11 V Like 
12 V Rock 
13 Adj Blue 
14 Adv Work 

我要计算每个单词的数量,并获得高层2型每个单词的。 我希望得到的结果是这样的:

Word Top Count 

0 Rock N, V 7 
1 Work N, Adv 4 
2 Like Adj, V 2 
3 Blue Adj  1 
4 Love V  1 

我创造了一些代码行,并得到了结果如我所料。 这里是我的代码:

In [1]: 
import pandas as pd 
df = pd.DataFrame([ 
    ['N','Work'], 
    ['N','Rock'], 
    ['N','Rock'], 
    ['Adj','Rock'], 
    ['V','Rock'], 
    ['N','Work'], 
    ['V','Work'], 
    ['V','Rock'], 
    ['Adj','Like'], 
    ['N','Rock'], 
    ['V','Love'], 
    ['V','Like'], 
    ['V','Rock'], 
    ['Adj','Blue'], 
    ['Adv','Work']], columns=['Type', 'Word']) 

In [2]: #Group by column "Word","Type" and count number of each pair 
df = df.groupby(["Type", "Word"])["Type"].count().reset_index(name="Count") 

In [3]: 
df 
    Type Word Count 
0 Adj Blue 1 
1 Adj Like 1 
2 Adj Rock 1 
3 Adv Work 1 
4 N Rock 3 
5 N Work 2 
6 V Like 1 
7 V Love 1 
8 V Rock 3 
9 V Work 1 

In [4]: #Group by "Word" and sort by "Count" in each group, get top 2 
df1 = df.sort_values(["Word","Count"], ascending=False).groupby("Word").head(2) 
df1 
    Type Word Count 
5 N Work 2 
3 Adv Work 1 
4 N Rock 3 
8 V Rock 3 
7 V Love 1 
1 Adj Like 1 
6 V Like 1 
0 Adj Blue 1 

In [5]: #Groupby "Word" and union "Type" in each group 
df1 = df1.groupby('Word')['Type'].apply(lambda x: "%s" % ', '.join(x)).reset_index(name='Top') 
df1 
    Word Top 
0 Blue Adj 
1 Like Adj, V 
2 Love V 
3 Rock N, V 
4 Work N, Adv 

In [6]: #Compute number of each word, save to a new dataframe 
df_sum = df.groupby('Word').sum().reset_index() 
df_sum 
    Word Count 
0 Blue 1 
1 Like 2 
2 Love 1 
3 Rock 7 
4 Work 4 

In [7]: #Merge to dataframe containing number of each word 
df1.merge(df_sum).sort_values("Count", ascending=False) 
df1 
    Word Top  Count 
3 Rock N, V 7 
4 Work N, Adv 4 
1 Like Adj, V 2 
0 Blue Adj  1 
2 Love V  1 

但是,这个代码似乎不是最佳的。我用了很多groupby,并使用sort_values 2次。如果数据集实际很大,那将会很麻烦。你能优化它吗? 谢谢。

回答

2
df.groupby('Word').agg(dict(
     Type=lambda x: ', '.join(pd.value_counts(x).index[:2]), 
     Word='size' 
    )).rename(columns=dict(Word='Count')).reset_index().sort_values('Count') 

    Word Type Count 
0 Blue  Adj  1 
2 Love  V  1 
1 Like V, Adj  2 
4 Work N, V  4 
3 Rock N, V  7 
0

您可以使用agg后跟Counter以获得最常见的类型,并且len可以统计出现的字数。

import pandas as pd 
from collections import Counter  

group_df = df.groupby('Word') 
df_summary = group_df.agg(
    lambda x: {'Type': [', '.join([e[0] for e in Counter(x.Type).most_common(2)]), len(x)]} 
) 
df_out = df_summary.Type.apply(pd.Series).reset_index().rename(columns={0: 'Top', 1: 'count'}) 
df_out.sort_values('count', ascending=False) # output 

这会给输出数据帧为

Word Top count 
3 Rock N, V 7 
4 Work N, V 4 
1 Like Adj, V 2 
0 Blue Adj 1 
2 Love V 1