2017-02-02 69 views
2

我有以下的数据帧my_df:大熊猫:二进制编码在大熊猫列的一组值

Name  cards 
------------------ 
John  {A,B} 
Mary  {B,C,A} 
Dan  {D,A} 
Peter  {C,A} 
Ed  {A,C,D} 

和我想要做的该组值的二进制编码,即,I所要的输出像:

Name  Card_A Card_B Card_C Card_D 
-------------------------------------------- 
John  1   1   0  0 
Mary  1   1   1  0 
Dan  1   0   0  1 
Peter  1   0   1  0 
Ed  1   0   1  1 

是否有一个现有的Python包?或者什么是实现这个目标的最好方法?谢谢!

回答

3

如果cards柱是set小号

df = pd.DataFrame({'Name':['John','Mary','Dan','Peter','Ed'], 
        'cards':[set(['A','B']), set(['B','C','A']), 
          set(['D','A']), set(['C','A']), set(['A','C','D'])]}) 


df[['Name']].join(
    df.cards.apply(
     lambda x: pd.value_counts(list(x)) 
    ).fillna(0).astype(int).add_prefix('Card_') 
) 

enter image description here


如果cardsstr
只是为了展示与str.extractall

解析与str.extractall分析它,并applyvalue_counts

df[['Name']].join(
    df.cards.str.extractall('([^\{\}, ]+)')[0].groupby(level=0).apply(
     pd.value_counts).unstack(fill_value=0).add_prefix('Card_') 
) 

enter image description here

3

首先将set秒转换为str并且通过strip删除{}

Then str.get_dummies

最后add_prefix

df = pd.DataFrame({'Name':['John','Mary','Dan','Peter','Ed'], 
        'cards':[set(['A','B']), set(['B','C','A']), 
          set(['D','A']), set(['C','A']), set(['A','C','D'])]}) 

print (df) 
    Name  cards 
0 John  {A, B} 
1 Mary {A, C, B} 
2 Dan  {A, D} 
3 Peter  {A, C} 
4  Ed {A, D, C} 

df.cards = df.cards.astype(str).str.strip('{}') 
df = df.set_index('Name').cards.str.get_dummies(', ') 
df.columns = df.columns.str.strip("'") 
df = df.add_prefix('Card_').reset_index() 

print (df) 
    Name Card_A Card_B Card_C Card_D 
0 John  1  1  0  0 
1 Mary  1  1  1  0 
2 Dan  1  0  0  1 
3 Peter  1  0  1  0 
4  Ed  1  0  1  1 

另一种替代的解决方案:

def f(category_list): 
    n_categories = len(category_list) 
    return pd.Series(dict(zip(category_list, [1]*n_categories))) 

df1 = df.set_index('Name').cards 
     .apply(f) 
     .add_prefix('Card_') 
     .fillna(0) 
     .astype(int) 
     .reset_index() 

print (df1) 
    Name Card_A Card_B Card_C Card_D 
0 John  1  1  0  0 
1 Mary  1  1  1  0 
2 Dan  1  0  0  1 
3 Peter  1  0  1  0 
4  Ed  1  0  1  1