2015-09-10 95 views
6

我使用如何探索使用scikit学习

clf = tree.DecisionTreeClassifier() 
clf = clf.fit(X_train, Y_train) 

这一切工作正常建设决策树构建决策树。但是,如何才能探索决策树?

例如,如何从X_train中找到哪些条目出现在特定的叶中?

+3

碰到类似的问题。你可能会发现我的答案[在这里](和http://cn.tankoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/42227468#42227468)那里提到的演练)有帮助。它使用0.18版本中的方法'decision_path'。如果对观看训练样本感兴趣,可在几个地方用'X_train'替代'X_test'。 – Kevin

回答

3

下面的代码应该产生的十大功能一个情节:

import numpy as np 
import matplotlib.pyplot as plt 

importances = clf.feature_importances_ 
std = np.std(clf.feature_importances_,axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
print("Feature ranking:") 

for f in range(10): 
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])) 

# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(10), importances[indices], 
     color="r", yerr=std[indices], align="center") 
plt.xticks(range(10), indices) 
plt.xlim([-1, 10]) 
plt.show() 

here取出并略作修改,以适应DecisionTreeClassifier

这并不完全有助于您探索树,但它确实会告诉您关于树的信息。

+0

谢谢,但我希望看到哪些培训数据落入每片叶子,例如。目前,我必须绘制决策树,写下规则,编写脚本以使用这些规则过滤数据。这不可能是正确的方式! – eleanora

+0

您的数据是否足够小以便通过手动或电子表格运行这些计算?我假设这是为了一个类,在这种情况下,最好不要运行该算法并复制结构。也就是说,我想有一些方法可以从sci-kit获得树的结构。以下是DecisionTreeClassifier的源代码:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –

+0

这不适合上课!我有大约1000000项,所以我通过编写一个单独的python脚本来完成。然而,我甚至不知道如何自动提取每个叶子的规则。有没有办法? – eleanora

5

您需要使用预测方法。

训练树后,您输入X值以预测其输出。

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(random_state=0) 
iris = load_iris() 
tree = clf.fit(iris.data, iris.target) 
tree.predict(iris.data) 

输出:

>>> tree.predict(iris.data) 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 

要得到的树结构的详细信息,我们就可以使用tree_。 有getstate()

树结构转换成一个 “ASCII艺术” 图片

   0 
     _____________ 
     1   2 
       ______________ 
       3   12 
      _______  _______ 
      4  7  13 16 
      ___ ______  _____ 
      5 6 8 9  14 15 
         _____ 
         10 11 

树结构的数组。

In [38]: tree.tree_.__getstate__()['nodes'] 
Out[38]: 
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0), 
     (-1, -1, -2, -2.0, 0.0, 50, 50.0), 
     (3, 12, 3, 1.75, 0.5, 100, 100.0), 
     (4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0), 
     (5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0), 
     (-1, -1, -2, -2.0, 0.0, 47, 47.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0), 
     (-1, -1, -2, -2.0, 0.0, 3, 3.0), 
     (10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0), 
     (14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (-1, -1, -2, -2.0, 0.0, 43, 43.0)], 
     dtype=[('left_child', '<i8'), ('right_child', '<i8'), 
      ('feature', '<i8'), ('threshold', '<f8'), 
      ('impurity', '<f8'), ('n_node_samples', '<i8'), 
      ('weighted_n_node_samples', '<f8')]) 

其中:

  • 第一节点[0]是根节点。
  • 内部节点具有left_child和right_child指向具有正值且大于当前节点的节点。
  • 叶子左侧和右侧子节点的值为-1。
  • 节点1,5,6,8,10,11,14,15,16是树叶。
  • 节点结构是使用深度优先搜索算法构建的。
  • 特征字段告诉我们在节点中使用了哪些iris.data特征来确定此示例的路径。
  • 阈值告诉我们用于评估基于特征的方向的值。
  • 杂质在叶子处达到0 ...因为一旦到达叶子,所有样品都处于同一级别。
  • n_node_samples告诉我们每个叶子有多少个样本。

使用这些信息,我们可以通过遵循脚本上的分类规则和阈值,将每个样本X简单地跟踪到它最终着陆的叶子。此外,n_node_samples将允许我们执行单元测试,以确保每个节点都获得正确数量的样本。然后使用树的输出。预测,我们可以将每片叶子映射到关联的类。

+0

谢谢。这告诉我这个类,但不是每个项目所在的决策树的哪一个叶。如果我能够以某种方式提取需要到达每个叶的规则,我可以重新运行这些规则来处理数据。 – eleanora

+0

当你说你想看到叶子,你的意思是你想看到树在每个节点上使用的规则? 如果是这种情况,那么这可能会有所帮助:http:// stackoverflow。com/questions/20224526 /如何从scikit-learn-decision-tree中提取决策规则 – PabTorre

+0

对于给定的叶子,我希望看到决策树将放置在该叶子上的训练数据。换句话说,每个叶片都与一系列规则(比较)相关联。如果您应用这些规则,我希望看到您获得的数据的子集。 – eleanora

5

注意:这不是一个答案,只是提示可能的解决方案。

我最近在我的项目中遇到了类似的问题。我的目标是为某些特定样本提取相应的决策链。我认为你的问题是我的一个子集,因为你只需要记录决策链的最后一步。

到目前为止,似乎唯一可行的解​​决方案是在Python中编写定制的predict方法来跟踪一路上的决策。原因是scikit-learn提供的方法predict无法做到这一点(据我所知)。更糟糕的是,它是C实现的包装器,很难定制。

自定义对我的问题很好,因为我正在处理一个不平衡的数据集,我关心的样本(正数)很少。所以我可以先使用sklearn predict将它们过滤出来,然后使用我的定制获得决策链。

但是,如果您有大型数据集,这可能对您无效。因为如果你解析树并在Python中进行预测,它将以Python速度运行并且不会(容易)缩放。您可能需要回退以定制C实现。

+0

尽可能多地包含研究的部分答案仍然可以接受。 – Xufox

+0

谢谢。没有时间来实现这个想法。希望有代码的人很快会出现。 – zaxliu

3

此代码将完全按照您的要求进行操作。这里,nX_train中的数字意见。最后,(n,number_of_leaves)大小的数组leaf_observations在每列中保存用于索引到X_train中的布尔值以获得每个叶中的观察值。 leaf_observations的每一列对应于leaves中的一个元素,它具有树叶的节点ID。

# get the nodes which are leaves 
leaves = clf.tree_.children_left == -1 
leaves = np.arange(0,clf.tree_.node_count)[leaves] 

# loop through each leaf and figure out the data in it 
leaf_observations = np.zeros((n,len(leaves)),dtype=bool) 
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
thistree = [clf.tree_.feature.tolist()] 
thistree.append(clf.tree_.threshold.tolist()) 
thistree.append(clf.tree_.children_left.tolist()) 
thistree.append(clf.tree_.children_right.tolist()) 
# get the decision rules for each leaf node & apply them 
for (ind,nod) in enumerate(leaves): 
    # get the decision rules in numeric list form 
    rules = [] 
    RevTraverseTree(thistree, nod, rules) 
    # convert & apply to the data by sequentially &ing the rules 
    thisnode = np.ones(n,dtype=bool) 
    for rule in rules: 
     if rule[1] == 1: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2]) 
     else: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2]) 
    # get the observations that obey all the rules - they are the ones in this leaf node 
    leaf_observations[:,ind] = thisnode 

这需要这里定义的辅助功能,它递归从指定节点遍历树开始建立决策规则。

def RevTraverseTree(tree, node, rules): 
    ''' 
    Traverase an skl decision tree from a node (presumably a leaf node) 
    up to the top, building the decision rules. The rules should be 
    input as an empty list, which will be modified in place. The result 
    is a nested list of tuples: (feature, direction (left=-1), threshold). 
    The "tree" is a nested list of simplified tree attributes: 
    [split feature, split threshold, left node, right node] 
    ''' 
    # now find the node as either a left or right child of something 
    # first try to find it as a left node 
    try: 
     prevnode = tree[2].index(node) 
     leftright = -1 
    except ValueError: 
     # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
     prevnode = tree[3].index(node) 
     leftright = 1 
    # now let's get the rule that caused prevnode to -> node 
    rules.append((tree[0][prevnode],leftright,tree[1][prevnode])) 
    # if we've not yet reached the top, go up the tree one more step 
    if prevnode != 0: 
     RevTraverseTree(tree, prevnode, rules) 
1

我认为一个简单的选择是使用训练的决策树的apply方法。从返回指数训​​练树,应用traindata,并建立一个查找表:

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

# apply training data to decision tree 
leaf_indices = clf.apply(iris.data) 
lookup = {} 

# build lookup table 
for i, leaf_index in enumerate(leaf_indices): 
    try: 
     lookup[leaf_index].append(iris.data[i]) 
    except KeyError: 
     lookup[leaf_index] = [] 
     lookup[leaf_index].append(iris.data[i]) 

# test 
unkown_sample = [[4., 3.1, 6.1, 1.2]] 
index = clf.apply(unkown_sample) 
print(lookup[index[0]]) 
0

您是否尝试过反倾销的D​​ecisionTree成graphviz的” .DOT文件[1],然后用graph_tool加载[2]。 :

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 
from graph_tool.all import * 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

tree.export_graphviz(clf,out_file='tree.dot') 

#load graph with graph_tool and explore structure as you please 
g = load_graph('tree.dot') 

for v in g.vertices(): 
    for e in v.out_edges(): 
     print(e) 
    for w in v.out_neighbours(): 
     print(w) 

[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

[2] https://graph-tool.skewed.de/

+0

你可以这样美丽吗?在http://scikit-learn.org/stable/_images/iris.svg? – eleanora

+0

一旦输出与export_graphiz类似的东西可以实现点-Tpng tree.dot -o tree.png。 –

2

我已经改变了一点什么德鲁博士公布。
下面的代码,给定一个数据帧和被装配后的决策树,返回:

  • rules_list:条目的列表对于每个(条目:规则
  • values_path列表类通过路径去)

    import numpy as np 
    import pandas as pd 
    from sklearn.tree import DecisionTreeClassifier 
    
    def get_rules(dtc, df): 
        rules_list = [] 
        values_path = [] 
        values = dtc.tree_.value 
    
        def RevTraverseTree(tree, node, rules, pathValues): 
         ''' 
         Traverase an skl decision tree from a node (presumably a leaf node) 
         up to the top, building the decision rules. The rules should be 
         input as an empty list, which will be modified in place. The result 
         is a nested list of tuples: (feature, direction (left=-1), threshold). 
         The "tree" is a nested list of simplified tree attributes: 
         [split feature, split threshold, left node, right node] 
         ''' 
         # now find the node as either a left or right child of something 
         # first try to find it as a left node    
    
         try: 
          prevnode = tree[2].index(node)   
          leftright = '<=' 
          pathValues.append(values[prevnode]) 
         except ValueError: 
          # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
          prevnode = tree[3].index(node) 
          leftright = '>' 
          pathValues.append(values[prevnode]) 
    
         # now let's get the rule that caused prevnode to -> node 
         p1 = df.columns[tree[0][prevnode]]  
         p2 = tree[1][prevnode]  
         rules.append(str(p1) + ' ' + leftright + ' ' + str(p2)) 
    
         # if we've not yet reached the top, go up the tree one more step 
         if prevnode != 0: 
          RevTraverseTree(tree, prevnode, rules, pathValues) 
    
        # get the nodes which are leaves 
        leaves = dtc.tree_.children_left == -1 
        leaves = np.arange(0,dtc.tree_.node_count)[leaves] 
    
        # build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
        thistree = [dtc.tree_.feature.tolist()] 
        thistree.append(dtc.tree_.threshold.tolist()) 
        thistree.append(dtc.tree_.children_left.tolist()) 
        thistree.append(dtc.tree_.children_right.tolist()) 
    
        # get the decision rules for each leaf node & apply them 
        for (ind,nod) in enumerate(leaves): 
    
         # get the decision rules 
         rules = [] 
         pathValues = [] 
         RevTraverseTree(thistree, nod, rules, pathValues) 
    
         pathValues.insert(0, values[nod])  
         pathValues = list(reversed(pathValues)) 
    
         rules = list(reversed(rules)) 
    
         rules_list.append(rules) 
         values_path.append(pathValues) 
    
        return (r, values_path) 
    

它遵循一个例子:

df = pd.read_csv('df.csv') 

X = df[df.columns[:-1]] 
y = df['classification'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

dtc = DecisionTreeClassifier(max_depth=2) 
dtc.fit(X_train, y_train) 

决策树安装已经生成以下树:Decision Tree with width 2

在这一点上,只是调用该函数:

get_rules(df, dtc) 

这是该函数返回的内容:

rules = [ 
    ['first <= 63.5', 'first <= 43.5'], 
    ['first <= 63.5', 'first > 43.5'], 
    ['first > 63.5', 'second <= 19.700000762939453'], 
    ['first > 63.5', 'second > 19.700000762939453'] 
] 

values = [ 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])], 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])] 
] 

显然,在值中,对于每条路径,也有叶子值。