2017-09-03 53 views
1

完成我的算法的培训和验证后,如何正确显示'one-hot-encoded'功能的名称?我想整齐地显示每个功能的名称及其重要性。下面是我已经试过:“功能重要性”的'one-hot-encoded'变量的显示名称

显示功能的重要性:

grid_search.best_estimator_.feature_importances_ 
array([ 7.67359589e-02, 7.20731884e-02, 4.38667330e-02, 
     1.69222269e-02, 1.51816327e-02, 1.66947835e-02, 
     1.56858183e-02, 3.43347923e-01, 5.95555727e-02, 
     7.65422356e-02, 1.11224727e-01, 1.02677088e-02, 
     1.32720377e-01, 1.06447326e-04, 4.45207929e-03, 
     4.62258699e-03]) 

获得一个热的类别名称:

cat_one_hot_attribs = list(encoder.classes_) 
print(cat_one_hot_attribs) 
['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'] 

获取名称的其余部分(其他类) :

num_attribs = list(X_train) 

['longitude', 
'latitude', 
'housing_median_age', 
'total_rooms', 
'total_bedrooms', 
'population', 
'households', 
'median_income', 
'rooms_per_household', 
'bedrooms_per_household', 
'population_per_household', 
0, 
1, 
2, 
3, 
4] 

现在我做到以下几点:

attributes = num_attribs + cat_one_hot_attribs 

print(pd.DataFrame(sorted(zip(feature_importance, attributes), reverse=True))) 

但我得到以下几点:

  0       1 
0 0.343348    median_income 
1 0.132720       1 
2 0.111225 population_per_household 
3 0.076736     longitude 
4 0.076542 bedrooms_per_household 
5 0.072073     latitude 
6 0.059556  rooms_per_household 
7 0.043867  housing_median_age 
8 0.016922    total_rooms 
9 0.016695    population 
10 0.015686    households 
11 0.015182   total_bedrooms 
12 0.010268       0 
13 0.004623       4 
14 0.004452       3 
15 0.000106       2 

我曾尝试其他方法很好,但都失败了。

有人可以请建议一种方法来正确显示此显示?谢谢。

编辑:

从@cᴏʟᴅsᴘᴇᴇᴅ的回答,我试过如下:如上述

feature_importance = grid_search.best_estimator_.feature_importances_ 

cat_one_hot_attribs = list(encoder.classes_) 

num_attribs = list(X_train) 
attributes = num_attribs + cat_one_hot_attribs 

vals = sorted(zip(feature_importance, attributes), key=lambda x: x[0], reverse=True) 
df = pd.DataFrame(vals) 
print(df) 

仍然得到输出。

+0

你想如何排序? –

+0

从高到低会是最好的。 – JohnWayne360

回答

2

分解它。先按键排序。确保只考虑了feature_importance

设置:

import pandas as pd 
import numpy as np 

feature_importance = np.array([ 7.67359589e-02, 7.20731884e-02, 4.38667330e-02, 
    1.69222269e-02, 1.51816327e-02, 1.66947835e-02, 
    1.56858183e-02, 3.43347923e-01, 5.95555727e-02, 
    7.65422356e-02, 1.11224727e-01, 1.02677088e-02, 
    1.32720377e-01, 1.06447326e-04, 4.45207929e-03, 
    4.62258699e-03]) 

cat_one_hot_attribs = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'] 

num_attribs = ['longitude', 
'latitude', 
'housing_median_age', 
'total_rooms', 
'total_bedrooms', 
'population', 
'households', 
'median_income', 
'rooms_per_household', 
'bedrooms_per_household', 
'population_per_household', 
0, 
1, 
2, 
3, 
4] 

attributes = num_attribs 

通过feature_importance获取的vals排序列表。

vals = sorted(zip(feature_importance, attributes), key=lambda x: x[0], reverse=True) 
df = pd.DataFrame(vals) 

然后,使用.replace与值替换的编码中cat_one_hot_attribs

df.iloc[:, -1] = df.iloc[:, -1].replace({i : k for i, k in enumerate(cat_one_hot_attribs)}) 
df 

      0       1 
0 0.343348    median_income 
1 0.132720     INLAND 
2 0.111225 population_per_household 
3 0.076736     longitude 
4 0.076542 bedrooms_per_household 
5 0.072073     latitude 
6 0.059556  rooms_per_household 
7 0.043867  housing_median_age 
8 0.016922    total_rooms 
9 0.016695    population 
10 0.015686    households 
11 0.015182   total_bedrooms 
12 0.010268     <1H OCEAN 
13 0.004623    NEAR OCEAN 
14 0.004452     NEAR BAY 
15 0.000106     ISLAND 
+0

对不起,这没有奏效。我仍然得到编码的名字... – JohnWayne360

+0

@ JohnWayne360我认为你应该在做这件事之前做'attributes = cat_one_hot_attribs + num_attribs'。你能否也请包括你的预期输出?如果这没有解决,这将有所帮助。 –

+0

我是,但它仍然是窃听。它可能是我的环境吗? – JohnWayne360