我有一个这样的数据集:优化GROUPBY聚集熊猫
Type Word
0 N Work
1 N Rock
2 N Rock
3 Adj Rock
4 V Rock
5 N Work
6 V Work
7 V Rock
8 Adj Like
9 N Rock
10 V Love
11 V Like
12 V Rock
13 Adj Blue
14 Adv Work
我要计算每个单词的数量,并获得高层2型每个单词的。 我希望得到的结果是这样的:
Word Top Count
0 Rock N, V 7
1 Work N, Adv 4
2 Like Adj, V 2
3 Blue Adj 1
4 Love V 1
我创造了一些代码行,并得到了结果如我所料。 这里是我的代码:
In [1]:
import pandas as pd
df = pd.DataFrame([
['N','Work'],
['N','Rock'],
['N','Rock'],
['Adj','Rock'],
['V','Rock'],
['N','Work'],
['V','Work'],
['V','Rock'],
['Adj','Like'],
['N','Rock'],
['V','Love'],
['V','Like'],
['V','Rock'],
['Adj','Blue'],
['Adv','Work']], columns=['Type', 'Word'])
In [2]: #Group by column "Word","Type" and count number of each pair
df = df.groupby(["Type", "Word"])["Type"].count().reset_index(name="Count")
In [3]:
df
Type Word Count
0 Adj Blue 1
1 Adj Like 1
2 Adj Rock 1
3 Adv Work 1
4 N Rock 3
5 N Work 2
6 V Like 1
7 V Love 1
8 V Rock 3
9 V Work 1
In [4]: #Group by "Word" and sort by "Count" in each group, get top 2
df1 = df.sort_values(["Word","Count"], ascending=False).groupby("Word").head(2)
df1
Type Word Count
5 N Work 2
3 Adv Work 1
4 N Rock 3
8 V Rock 3
7 V Love 1
1 Adj Like 1
6 V Like 1
0 Adj Blue 1
In [5]: #Groupby "Word" and union "Type" in each group
df1 = df1.groupby('Word')['Type'].apply(lambda x: "%s" % ', '.join(x)).reset_index(name='Top')
df1
Word Top
0 Blue Adj
1 Like Adj, V
2 Love V
3 Rock N, V
4 Work N, Adv
In [6]: #Compute number of each word, save to a new dataframe
df_sum = df.groupby('Word').sum().reset_index()
df_sum
Word Count
0 Blue 1
1 Like 2
2 Love 1
3 Rock 7
4 Work 4
In [7]: #Merge to dataframe containing number of each word
df1.merge(df_sum).sort_values("Count", ascending=False)
df1
Word Top Count
3 Rock N, V 7
4 Work N, Adv 4
1 Like Adj, V 2
0 Blue Adj 1
2 Love V 1
但是,这个代码似乎不是最佳的。我用了很多groupby
,并使用sort_values
2次。如果数据集实际很大,那将会很麻烦。你能优化它吗? 谢谢。