2016-09-14 56 views
4

考虑下面的数据帧熊猫:如何获取包含值列表的列的唯一值?

df = pd.DataFrame({'name' : [['one two','three four'], ['one'],[], [],['one two'],['three']], 
        'col' : ['A','B','A','B','A','B']})  
df.sort_values(by='col',inplace=True) 

df 
Out[62]: 
    col     name 
0 A [one two, three four] 
2 A      [] 
4 A    [one two] 
1 B     [one] 
3 B      [] 
5 B    [three] 

我想获得一个跟踪列入namecol每个组合的所有唯一字符串的列。

也就是说,预期产量

df 
Out[62]: 
    col     name unique_list 
0 A [one two, three four] [one two, three four] 
2 A      [] [one two, three four] 
4 A    [one two] [one two, three four] 
1 B     [one] [one, three] 
3 B      [] [one, three] 
5 B    [three] [one, three] 

事实上,说为一组,你可以看到,唯一的一组字符串包含在[one two, three four][][one two][one two]

我能获得相应使用的唯一值数量Pandas : how to get the unique number of values in cells when cells contain lists?

df['count_unique']=df.groupby('col')['name'].transform(lambda x: list(pd.Series(x.apply(pd.Series).stack().reset_index(drop=True, level=1).nunique()))) 


df 
Out[65]: 
    col     name count_unique 
0 A [one two, three four]   2 
2 A      []   2 
4 A    [one two]   2 
1 B     [one]   2 
3 B      []   2 
5 B    [three]   2 

,但替换nuniqueunique以上失败。

任何想法? 谢谢!

回答

2

下面是解

df['unique_list'] = df.col.map(df.groupby('col')['name'].sum().apply(np.unique)) 
    df 

enter image description here

+0

有趣。 '总和'字符串?! –

+1

@Noobie它比这更糟糕。它是名单上的太阳。它生成一个连接列表,我在这个连接列表中应用nhe.nif.unique – piRSquared

+0

hehehe。我只是尝试,但似乎你有很好的解决方案失败,当有遗漏值col。在这种情况下,我得到'TypeError:只能连接列表(而不是“int”)到列表。用'fillna('')'或'fillna('[]')替换缺失的值不起作用。有任何想法吗? –

2

尝试:

uniq_df = df.groupby('col')['name'].apply(lambda x: list(set(reduce(lambda y,z: y+z,x)))).reset_index() 
uniq_df.columns = ['col','uniq_list'] 
pd.merge(df,uniq_df, on='col', how='left') 

所需的输出:

col     name    uniq_list 
0 A [one two, three four] [one two, three four] 
1 A      [] [one two, three four] 
2 A    [one two] [one two, three four] 
3 B     [one]   [three, one] 
4 B      []   [three, one] 
5 B    [three]   [three, one] 
+0

感谢@abdou!让我试试 –