2016-07-09 38 views
-3

我使用Cloudera 5.2 VM和pandas 0.18.0 我想将kmeans应用于我的数据框。但我有很多专栏。pandas kmeans如何使用分类属性

我的数据帧是

adClicksPerTime.head(n=5) 
Out[50]: 
      timestamp adCategory userId totalAdClicks 
0 2016-05-26 15:00:00 automotive  355    1 
1 2016-05-26 15:00:00  clothing 1027    1 
2 2016-05-26 15:00:00 computers 1821    1 
3 2016-05-26 15:00:00 computers 2139    1 
4 2016-05-26 15:00:00 electronics  253    1 

for col in adClicksPerTime: 
    print(col) 
    print(type(adClicksPerTime[col][1])) 


timestamp 
<class 'pandas.tslib.Timestamp'> 
adCategory 
<class 'str'> 
userId 
<class 'numpy.int64'> 
totalAdClicks 
<class 'numpy.int64'> 

当我执行k均值我得到

ValueError: could not convert string to float: 'automotive' 

我想我的字符串转换为明确的类型,之后分配数字代码

adClicksPerTime.adCategory = pd.Categorical.from_array(adClicksPerTime.adCategory)  

adClicksPerTime.head(n=5) 
Out[54]: 
      timestamp adCategory userId totalAdClicks 
0 2016-05-26 15:00:00 automotive  355    1 
1 2016-05-26 15:00:00  clothing 1027    1 
2 2016-05-26 15:00:00 computers 1821    1 
3 2016-05-26 15:00:00 computers 2139    1 
4 2016-05-26 15:00:00 electronics  253    1 

for col in adClicksPerTime: 
    print(col) 
    print(type(adClicksPerTime[col][1])) 


timestamp 
<class 'pandas.tslib.Timestamp'> 
adCategory 
<class 'str'> 
userId 
<class 'numpy.int64'> 
totalAdClicks 
<class 'numpy.int64'> 
错误

如何将kmeans应用到str字段?

+0

k-means仅用于**连续**变量。不要在这类数据上使用它! –

回答

1

获取假人会将类别更改为假人。

dummies = pd.get_dummies(adClicksPerTime[adCategory]) 
del dummies['automotive'] 
print dummies.columns 

然后将这个DataFrame与adClicksPerTime dataFrame合并,最后应用Kmeans。

adClicksPerTime.info()会给你dtypes。