OneHotEncoder对分类特征的问题

我想对我的数据集中的10个特征中的3个分类特征进行编码。我使用sklearn.preprocessingpreprocessing如下面这样做：OneHotEncoder对分类特征的问题

from sklearn import preprocessing 
cat_features = ['color', 'director_name', 'actor_2_name'] 
enc = preprocessing.OneHotEncoder(categorical_features=cat_features) 
enc.fit(dataset.values)

但是，我无法继续，因为我得到这个错误：

array = np.array(array, dtype=dtype, order=order, copy=copy) 
ValueError: could not convert string to float: PG

我很奇怪为什么它抱怨串因为它应该转换它！我在这里错过了什么吗？

来源

2017-04-24 Medo

之前，使用这些功能的 LabelEncoder

如果您阅读OneHotEncoder的文档，您会看到fit的输入是“输入int类型的数组”。所以，你需要做两个步骤为你的一个热点编码数据

from sklearn import preprocessing 
cat_features = ['color', 'director_name', 'actor_2_name'] 
enc = preprocessing.LabelEncoder() 
enc.fit(cat_features) 
new_cat_features = enc.transform(cat_features) 
print new_cat_features # [1 2 0] 
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape 
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read 
print ohe.fit_transform(new_cat_features)

输出：

[[ 0. 1. 0.] 
[ 0. 0. 1.] 
[ 1. 0. 0.]]

来源

2017-04-24 13:16:45 ncfirth

从文档：

categorical_features : “all” or array of indices or mask 
Specify what features are treated as categorical. 
‘all’ (default): All features are treated as categorical. 
array of indices: Array of categorical feature indices. 
mask: Array of length n_features and with dtype=bool.

大熊猫据帧的列名都不行。如果你类别特征是列数0,2和6用途：

from sklearn import preprocessing 
cat_features = [0, 2, 6] 
enc = preprocessing.OneHotEncoder(categorical_features=cat_features) 
enc.fit(dataset.values)

还必须指出的是，如果这些类别特征没有标签编码，您需要使用OneHotEncoder

来源

2017-04-24 13:16:06

非常感谢。 – Medo

您可以同时应用转换（从文字类整数类别，然后从整数类别到使用LabelBinarizer类一次性拍摄：

cat_features = ['color', 'director_name', 'actor_2_name'] 
encoder = LabelBinarizer() 
new_cat_features = encoder.fit_transform(cat_features) 
new_cat_features

请注意，此返回是默认密集的NumPy数组。您可以通过将 sparse_output = True传递给LabelBinarizer构造函数来获得稀疏矩阵。

源Hands-On Machine Learning with Scikit-Learn and TensorFlow

来源

2017-07-21 23:21:54

如果数据集是在数据大熊猫帧，使用

pandas.get_dummies

会更简单。

*从pandas.get_getdummies更正为pandas.get_dummies

来源

2017-11-27 09:05:14 HappyCoding

@Medo，

我遇到了同样的行为，并发现它令人沮丧。正如其他人指出的那样，在Scikit-Learn要求选择categorical_features参数中提供的列之前，Scikit-Learn要求所有数据都是数字。

具体地，列选择由_transform_selected()方法在/sklearn/preprocessing/data.py处理，该方法的第一行是

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)。

如果任何的数据在所提供的数据帧X中无法成功转换为浮点数，则此检查将失败。

我同意sklearn.preprocessing.OneHotEncoder的文档在这方面非常具有误导性。

来源

2018-02-15 00:03:35

OneHotEncoder对分类特征的问题

回答

相关问题