在市场购物篮中计算独特的组合频率

我有一组1000000个市场篮子，每个市场篮子包含1-4个项目。我想计算每个独特组合购买的频率。在市场购物篮中计算独特的组合频率

的数据被组织成这样：

[in] print(training_df.head(n=5)) 

[out]      product_id 
transaction_id      
0000001     [P06, P09] 
0000002   [P01, P05, P06, P09] 
0000003     [P01, P06] 
0000004     [P01, P09] 
0000005     [P06, P09]

在这个例子中[P06，P09]具有2的频率和所有其它组合具有为1的频率。我已经创建了如下的二进制矩阵和计算为这样的各个项目的频率：

# Create a matrix for the transactions 
from sklearn.preprocessing import MultiLabelBinarizer 

product_ids = ['P{:02d}'.format(i+1) for i in range(10)] 

mlb = MultiLabelBinarizer(classes = product_ids) 
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']), 
          columns=mlb.classes_, 
          index=training_df.index)) 

# Calculate the support count for each product (frequency) 
train_product_support = {} 
for column in training_df1.columns: 
    train_product_support[column] = sum(training_df1[column]>0)

如何计算的1-4项存在于所述数据中的每个唯一组合的频率是多少？

来源

2017-08-01 zsad512

那么，既然你不能使用df.groupby('product_id').count()，这是我能想到的最好的。我们使用列表的字符串表示形式作为关键字，并对其中的事件进行计数。

counts = dict() 
for i in df['product_id']: 
    key = i.__repr__() 
    if key in counts: 
     counts[key] += 1 
    else: 
     counts[key] = 1

来源

2017-08-01 20:17:38 jacoblaw

这就是我将如何解决这个问题，但我猜想顺序无关紧要。因此，我会抛出'key = sorted（key）'来进行相同项目的任何排列 –

'defaultdict'可能更适合与https://docs.python.org/3/library/collections.html collections.defaultdict – dashiell

可能还需要一个'frozenset'而不是'str' https://docs.python.org/3/library/stdtypes.html#frozenset – dashiell

也许：

df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1) 
df['frozensets'].value_counts()

创建frozensets从product_ids柱（可哈希，并且忽略排序），然后计数每个唯一值的数目。

来源

2017-08-01 20:42:51 dashiell

这样可以将数据从最高频率排序到最低频率（具有独特组合）。如何根据数字对独特组合进行进一步排序组合中的项目？ – zsad512

在市场购物篮中计算独特的组合频率

回答

相关问题