2016-11-14 182 views
2

两字组我有这个测试表中数据帧的大熊猫创建一个列在大熊猫DF

Leaf_category_id session_id product_id 
0    111   1   987 
3    111   4   987 
4    111   1   741 
1    222   2   654 
2    333   3   321 

这是我刚才的问题,这是由@jazrael回答的延伸。 view answer

因此让中的product_id列中的值(只是一个假设,从我刚才的问题的输出略有不同,

|product_id    | 
    --------------------------- 
    |111,987,741,34,12  | 
    |987,1232     | 
    |654,12,324,465,342,324 | 
    |321,741,987    | 
    |324,654,862,467,243,754 | 
    |6453,123,987,741,34,12 | 

等, 我想创建一个新列后,在其中行中的所有的值应该被制造为具有它的下一个,最后一个没有两字组的行与第一个组合中,例如:

|product_id    |Bigram 
    ------------------------------------------------------------------------- 
    |111,987,741,34,12  |(111,987),**(987,741)**,(741,34),(34,12),(12,111) 
    |987,1232     |(987,1232),(1232,987) 
    |654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654) 
    |321,741,987    |(321,741),**(741,987)**,(987,321) 
    |324,654,862    |(324,654),(654,862),(862,324) 
    |123,987,741,34,12  |(123,987),(987,741),(34,12),(12,123) 

忽略**(I”稍后会告诉你为什么我出演的是)

代码才达到两字组是

for i in df.Leaf_category_id.unique(): 
    print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index()) 

从这个东风,我要考虑二元柱,使一个更加列命名为频率,这给了我两字的频率发生。

Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2. similar is the case with (34,12) it occurs two times, so frequency should be 2

|Bigram 
    --------------- 
    |(111,987), 
    |**(987,741)** 
    |(741,34) 
    |(34,12) 
    |(12,111) 
    |**(741,987)** 
    |(987,321) 
    |(34,12) 
    |(12,123) 

最终的结果应该是。

|Bigram  | frequency | 
    -------------------------- 
    |(111,987) | 1 
    |(987,741) | 2 
    |(741,34)  | 1 
    |(34,12)  | 2 
    |(12,111)  | 1 
    |(987,321) | 1 
    |(12,123)  | 1 

我希望能在这里找到答案,请帮助我,我尽可能详细阐述了它。

+0

你怎么想的频率?在单行中,Bigram列将包含多个元组,因此会有多个频率。 – James

+0

@James:行中的每个元组都应该被创建为一个新行,如第二个最后一个表所示。然后如果有重复的表格,正如我所提到的那样,频率应该相应地改变 – Shubham

+0

所以'Bigram'和'frequency'是在一个单独的数据框中? – James

回答

2

尝试这个代码

from itertools import combinations 
import pandas as pd 

df = pd.DataFrame.from_csv("data.csv") 
#consecutive 
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index() 

df1=pd.DataFrame(grouped_consecutive_product_ids) 
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack() 
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna() 
df2.rename(columns = {0:'Bigram'}, inplace = True) 
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count') 
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index() 
del bigram_frequency_consecutive["index"] 

用于组合(所有可能的双克)

from itertools import combinations 
import pandas as pd 

df = pd.DataFrame.from_csv("data.csv") 
#combinations 
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index() 

df1=pd.DataFrame(grouped_combination_product_ids) 
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack() 
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna() 
df2.rename(columns = {0:'Bigram'}, inplace = True) 
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count') 
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index() 
del bigram_frequency_combinations["index"] 

data.csv其中包含

Leaf_category_id,session_id,product_id 
0,111,1,111 
3,111,4,987 
4,111,1,741 
1,222,2,654 
2,333,3,321 
5,111,1,87 
6,111,1,34 
7,111,1,12 
8,111,1,987 
9,111,4,1232 
10,222,2,12 
11,222,2,324 
12,222,2,465 
13,222,2,342 
14,222,2,32 
15,333,3,321 
16,333,3,741 
17,333,3,987 
18,333,3,324 
19,333,3,654 
20,333,3,862 
21,222,1,123 
22,222,1,987 
23,222,1,741 
24,222,1,34 
25,222,1,12 

所得bigram_frequency_consecutive将为

  Bigram freq 
0  (12, 34)  2 
1  (12, 324)  1 
2  (12, 654)  1 
3  (12, 987)  1 
4  (32, 342)  1 
5  (34, 87)  1 
6  (34, 741)  1 
7  (87, 741)  1 
8 (111, 741)  1 
9 (123, 987)  1 
10 (321, 321)  1 
11 (321, 741)  1 
12 (324, 465)  1 
13 (324, 654)  1 
14 (324, 987)  1 
15 (342, 465)  1 
16 (654, 862)  1 
17 (741, 987)  2 
18 (987, 1232)  1 

所得bigram_frequency_combinations

  Bigram freq 
0  (12, 32)  1 
1  (12, 34)  2 
2  (12, 87)  1 
3  (12, 111)  1 
4  (12, 123)  1 
5  (12, 324)  1 
6  (12, 342)  1 
7  (12, 465)  1 
8  (12, 654)  1 
9  (12, 741)  2 
10 (12, 987)  2 
11 (32, 324)  1 
12 (32, 342)  1 
13 (32, 465)  1 
14 (32, 654)  1 
15  (34, 87)  1 
16 (34, 111)  1 
17 (34, 123)  1 
18 (34, 741)  2 
19 (34, 987)  2 
20 (87, 111)  1 
21 (87, 741)  1 
22 (87, 987)  1 
23 (111, 741)  1 
24 (111, 987)  1 
25 (123, 741)  1 
26 (123, 987)  1 
27 (321, 321)  1 
28 (321, 324)  2 
29 (321, 654)  2 
30 (321, 741)  2 
31 (321, 862)  2 
32 (321, 987)  2 
33 (324, 342)  1 
34 (324, 465)  1 
35 (324, 654)  2 
36 (324, 741)  1 
37 (324, 862)  1 
38 (324, 987)  1 
39 (342, 465)  1 
40 (342, 654)  1 
41 (465, 654)  1 
42 (654, 741)  1 
43 (654, 862)  1 
44 (654, 987)  1 
45 (741, 862)  1 
46 (741, 987)  3 
47 (862, 987)  1 
48 (987, 1232)  1 
在上述情况下

它按两种存储

+0

非常好的答案,1 – jezrael

+0

@先生。有什么不同bigram_frequency_consecutive和bigram_frequency_combinations? – Shubham

+0

在'bigram_frequency_consecutive'如果一组具有产品ID'[27,35,99]'那么你得到双克'[(27,35),(35,99)]'其中,通过组合的形成双字母组是'[(27,35),(27,99),(35,99)]'如果您正在进行任何产品购买分析,您应该使用二元组合。因为我不知道确切的用例,所以我给出了两种解决方案,第一种解决方案按照您提供的代码片段提供,第二种解决方案是最需要的。 –

1

我们将从product_id中提取值,创建bigrams,对其进行排序并进行重复数据删除,并计数它们以获取频率,然后填充数据框。

from collections import Counter 

# assuming your data frame is called 'df' 

bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()] 
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x] 
freq_dict = Counter(bigram_set) 
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq']) 
+0

当我运行** freq_dict =计数器(bigram_set)** 我正在刚开这个错误:** unhashable类型: '名单' ** – Shubham

+0

的'tuple'功能应该采取 – James

+0

类型的护理(bigram_set)=名单。 – Shubham