2016-08-30 135 views
4

我想按单词在熊猫数据框上进行汇总。如何在熊猫数据框中按单词分组统计

基本上有3列与点击/印象计数与相应的短语。我想将这个短语拆分为令牌,然后将它们的点击总结为令牌,以确定哪个令牌相对好/不好。

预期输入:数据帧熊猫如下

click_count impression_count text 
1 10   100     pizza 
2 20   200     pizza italian 
3 1   1     italian cheese 

预期输出:

click_count impression_count token 
1 30   300    pizza  // 30 = 20 + 10, 300 = 200+100   
2 21   201    italian // 21 = 20 + 1 
3 1   1     cheese  // cheese only appeared once in italian cheese 

回答

1
tokens = df.text.str.split(expand=True) 
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])] 
tokens.columns = token_cols 

df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1) 
df1 

enter image description here

df2 = pd.lreshape(df1, {'tokens': token_cols}) 
df2 

enter image description here

df2.groupby('tokens').sum() 

enter image description here

1

这将创建一个新的数据帧像piRSquared的,但令牌堆叠并与原来的合并:

(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True) 
      .to_frame('token').merge(df, left_index=True, right_index=True) 
      .groupby('token')['click_count', 'impression_count'].sum()) 
Out: 
     click_count impression_count 
token         
cheese    1     1 
italian   21    201 
pizza    30    300 

如果你打破下来,它结合了这一点:

df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token') 
Out: 
    token 
1 pizza 
2 pizza 
2 italian 
3 italian 
3 cheese 

with t他原来的DataFrame在他们的指数上。由此产生的DF是:

(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True) 
      .to_frame('token').merge(df, left_index=True, right_index=True)) 
Out: 
    token click_count impression_count   text 
1 pizza   10    100   pizza 
2 pizza   20    200 pizza italian 
2 italian   20    200 pizza italian 
3 italian   1     1 italian cheese 
3 cheese   1     1 italian cheese 

其余的是按标记列分组。

0

你可以做

In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1) 

In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum() 
Out[3092]: 
    token click_count impression_count 
0 pizza   30    300 
1 italian   21    201 
2 cheese   1     1 

详细

In [3093]: df 
Out[3093]: 
    click_count impression_count   text 
1   10    100   pizza 
2   20    200 pizza italian 
3   1     1 italian cheese 

In [3094]: s 
Out[3094]: 
1  pizza 
2  pizza 
2 italian 
3 italian 
3  cheese 
dtype: object