2017-07-31 28 views
2

我有一个数据集,看起来像这样 -熊猫 - 转换一个类别列至二值编码形式

 yyyy  month  tmax   tmin 
0 1908 January   5.0   -1.4 
1 1908 February   7.3   1.9 
2 1908  March   6.2   0.3 
3 1908  April   7.4   2.1 
4 1908  May  16.5   7.7 
5 1908  June  17.7   8.7 
6 1908  July  20.1   11.0 
7 1908  August  17.5   9.7 
8 1908 September  16.3   8.4 
9 1908 October  14.6   8.0 
10 1908 November   9.6   3.4 
11 1908 December   5.8   -0.3 
12 1909 January   5.0   0.1 
13 1909 February   5.5   -0.3 
14 1909  March   5.6   -0.3 
15 1909  April  12.2   3.3 
16 1909  May  14.7   4.8 
17 1909  June  15.0   7.5 
18 1909  July  17.3   10.8 
19 1909  August  18.8   10.7 
20 1909 September  14.5   8.1 
21 1909 October  12.9   6.9 
22 1909 November   7.5   1.7 
23 1909 December   5.3   0.4 
24 1910 January   5.2   -0.5 
... 

它有四个变量 - yyyymonthtmax(最高温度)和tmin

我想在预测时使用月份列作为变量,因此想将其转换为其二进制编码版本。本质上,我想将12个变量添加到名为January的数据集中,直到December,并且如果特定行的月份为“1月”,则January列应该标记为1,其余新添加的11列应为0

我看着数据透视表,但这并没有帮助我的原因。任何想法如何以简单优雅的方式做到这一点?

回答

5

我想你需要get_dummies

df = pd.get_dummies(df['month']) 

如果需要pop添加新列到原来并删除month使用join

df2 = df.join(pd.get_dummies(df.pop('month'))) 
print (df2.head()) 
    yyyy tmax tmin April August December February January July June \ 
0 1908 5.0 -1.4  0  0   0   0  1  0  0 
1 1908 7.3 1.9  0  0   0   1  0  0  0 
2 1908 6.2 0.3  0  0   0   0  0  0  0 
3 1908 7.4 2.1  1  0   0   0  0  0  0 
4 1908 16.5 7.7  0  0   0   0  0  0  0 

    March May November October September 
0  0 0   0  0   0 
1  0 0   0  0   0 
2  1 0   0  0   0 
3  0 0   0  0   0 
4  0 1   0  0   0 

如果不需要删除列month

df2 = df.join(pd.get_dummies(df['month'])) 
print (df2.head()) 
    yyyy  month tmax tmin April August December February January \ 
0 1908 January 5.0 -1.4  0  0   0   0  1 
1 1908 February 7.3 1.9  0  0   0   1  0 
2 1908  March 6.2 0.3  0  0   0   0  0 
3 1908  April 7.4 2.1  1  0   0   0  0 
4 1908  May 16.5 7.7  0  0   0   0  0 

    July June March May November October September 
0  0  0  0 0   0  0   0 
1  0  0  0 0   0  0   0 
2  0  0  1 0   0  0   0 
3  0  0  0 0   0  0   0 
4  0  0  0 1   0  0   0 

如果需要排序的列有多个可能的解决方案 - 使用reindexreindex_axis

months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December'] 
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 

df1 = pd.get_dummies(df['month']).reindex(columns=months) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 

或转换列monthordered categorical

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True)) 
print (df1.head()) 
    January February March April May June July August September \ 
0  1   0  0  0 0  0  0  0   0 
1  0   1  0  0 0  0  0  0   0 
2  0   0  1  0 0  0  0  0   0 
3  0   0  0  1 0  0  0  0   0 
4  0   0  0  0 1  0  0  0   0 

    October November December 
0  0   0   0 
1  0   0   0 
2  0   0   0 
3  0   0   0 
4  0   0   0 
+1

感谢。 –

3

IIUC,

你可以使用assign**拆包操作者,和pd.get_dummies

df.assign(**pd.get_dummies(df['month'])) 

输出:

yyyy  month tmax tmin April August December February January \ 
0 1908 January 5.0 -1.4  0  0   0   0  1 
1 1908 February 7.3 1.9  0  0   0   1  0 
2 1908  March 6.2 0.3  0  0   0   0  0 
3 1908  April 7.4 2.1  1  0   0   0  0 
4 1908  May 16.5 7.7  0  0   0   0  0 
5 1908  June 17.7 8.7  0  0   0   0  0 
6 1908  July 20.1 11.0  0  0   0   0  0 
7 1908  August 17.5 9.7  0  1   0   0  0 
8 1908 September 16.3 8.4  0  0   0   0  0 
9 1908 October 14.6 8.0  0  0   0   0  0 
10 1908 November 9.6 3.4  0  0   0   0  0 
11 1908 December 5.8 -0.3  0  0   1   0  0 
12 1909 January 5.0 0.1  0  0   0   0  1 
13 1909 February 5.5 -0.3  0  0   0   1  0 
14 1909  March 5.6 -0.3  0  0   0   0  0 
15 1909  April 12.2 3.3  1  0   0   0  0 
16 1909  May 14.7 4.8  0  0   0   0  0 
17 1909  June 15.0 7.5  0  0   0   0  0 
18 1909  July 17.3 10.8  0  0   0   0  0 
19 1909  August 18.8 10.7  0  1   0   0  0 
20 1909 September 14.5 8.1  0  0   0   0  0 
21 1909 October 12.9 6.9  0  0   0   0  0 
22 1909 November 7.5 1.7  0  0   0   0  0 
23 1909 December 5.3 0.4  0  0   1   0  0 
24 1910 January 5.2 -0.5  0  0   0   0  1 

    July June March May November October September 
0  0  0  0 0   0  0   0 
1  0  0  0 0   0  0   0 
2  0  0  1 0   0  0   0 
3  0  0  0 0   0  0   0 
4  0  0  0 1   0  0   0 
5  0  1  0 0   0  0   0 
6  1  0  0 0   0  0   0 
7  0  0  0 0   0  0   0 
8  0  0  0 0   0  0   1 
9  0  0  0 0   0  1   0 
10  0  0  0 0   1  0   0 
11  0  0  0 0   0  0   0 
12  0  0  0 0   0  0   0 
13  0  0  0 0   0  0   0 
14  0  0  1 0   0  0   0 
15  0  0  0 0   0  0   0 
16  0  0  0 1   0  0   0 
17  0  1  0 0   0  0   0 
18  1  0  0 0   0  0   0 
19  0  0  0 0   0  0   0 
20  0  0  0 0   0  0   1 
21  0  0  0 0   0  1   0 
22  0  0  0 0   1  0   0 
23  0  0  0 0   0  0   0 
24  0  0  0 0   0  0   0