在另一个向量中使用分组值的平均值（numpy/Python）

我想根据另一个向量中的分组信息获取一个向量的平均值。这两个向量长度相同。我已经根据每个用户的平均预测创建了一个最小示例。我如何在NumPy中做到这一点？在另一个向量中使用分组值的平均值（numpy/Python）

 >>> pred 
      [ 0.99 0.23 0.11 0.64 0.45 0.55 0.76 0.72 0.97 ] 
     >>> users 
      ['User2' 'User3' 'User2' 'User3' 'User0' 'User1' 'User4' 'User4' 'User4']

来源

2015-03-24 pir

你的两个数组的长度不同...此外，你寻找与NumPy或熊猫溶液（更简单的解决方案）？ – 2015-03-24 22:24:31

对不起，他们现在是相同的长度。我宁愿留在NumPy中，因为我刚刚学习Python，并决定推迟一段时间的熊猫。 – pir 2015-03-24 22:32:47

'纯粹numpy的' 解决方案可以使用的np.unique和np.bincount组合：

import numpy as np 

pred = [0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97] 
users = ['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 'User4', 
     'User4', 'User4'] 

# assign integer indices to each unique user name, and get the total 
# number of occurrences for each name 
unames, idx, counts = np.unique(users, return_inverse=True, return_counts=True) 

# now sum the values of pred corresponding to each index value 
sum_pred = np.bincount(idx, weights=pred) 

# finally, divide by the number of occurrences for each user name 
mean_pred = sum_pred/counts 

print(unames) 
# ['User0' 'User1' 'User2' 'User3' 'User4'] 

print(mean_pred) 
# [ 0.45  0.55  0.55  0.435  0.81666667]

如果您已安装pandas，DataFrame s有some very nice methods for grouping and summarizing data：

import pandas as pd 

df = pd.DataFrame({'name':users, 'pred':pred}) 

print(df.groupby('name').mean()) 
#   pred 
# name   
# User0 0.450000 
# User1 0.550000 
# User2 0.550000 
# User3 0.435000 
# User4 0.816667

来源

2015-03-24 22:59:06

请在我原来的帖子中看到编辑。 – pir 2015-03-25 12:15:11

我真的不明白你的意思是什么*“每个用户的独特标签”* - 在你的例子中，'User2'似乎有相应的'标签'值都是0和1.另外，在你应该发布后续问题分开（您可以包含一个指向原始问题的链接以提供上下文）。 – 2015-03-25 13:11:52

好的，我会这么做的。谢谢。 – pir 2015-03-25 13:13:16

如果你想坚持到numpy的，最简单的就是使用np.unique和np.bincount：

>>> pred = np.array([0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97]) 
>>> users = np.array(['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 
...     'User4', 'User4', 'User4']) 
>>> unq, idx, cnt = np.unique(users, return_inverse=True, return_counts=True) 
>>> avg = np.bincount(idx, weights=pred)/cnt 
>>> unq 
array(['User0', 'User1', 'User2', 'User3', 'User4'], 
     dtype='|S5') 
>>> avg 
array([ 0.45  , 0.55  , 0.55  , 0.435  , 0.81666667])

来源

2015-03-24 22:54:16 Jaime

的紧凑解决方案是使用numpy_indexed（否认：我其作者），它实现了类似于由Jaime提出的矢量化一个的溶液;但有简洁的界面和更多的测试：

import numpy_indexed as npi 
npi.group_by(users).mean(pred)

来源

2016-04-02 13:30:33

在另一个向量中使用分组值的平均值（numpy/Python）

回答

相关问题