如何探索使用scikit学习

3

下面的代码应该产生的十大功能一个情节：

import numpy as np 
import matplotlib.pyplot as plt 

importances = clf.feature_importances_ 
std = np.std(clf.feature_importances_,axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
print("Feature ranking:") 

for f in range(10): 
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])) 

# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(10), importances[indices], 
     color="r", yerr=std[indices], align="center") 
plt.xticks(range(10), indices) 
plt.xlim([-1, 10]) 
plt.show()

从here取出并略作修改，以适应DecisionTreeClassifier。

这并不完全有助于您探索树，但它确实会告诉您关于树的信息。

来源

2015-09-10 17:23:32

+0

谢谢，但我希望看到哪些培训数据落入每片叶子，例如。目前，我必须绘制决策树，写下规则，编写脚本以使用这些规则过滤数据。这不可能是正确的方式！ – eleanora

+0

您的数据是否足够小以便通过手动或电子表格运行这些计算？我假设这是为了一个类，在这种情况下，最好不要运行该算法并复制结构。也就是说，我想有一些方法可以从sci-kit获得树的结构。以下是DecisionTreeClassifier的源代码：https：//github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –

+0

这不适合上课！我有大约1000000项，所以我通过编写一个单独的python脚本来完成。然而，我甚至不知道如何自动提取每个叶子的规则。有没有办法？ – eleanora

5

您需要使用预测方法。

训练树后，您输入X值以预测其输出。

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(random_state=0) 
iris = load_iris() 
tree = clf.fit(iris.data, iris.target) 
tree.predict(iris.data)

输出：

>>> tree.predict(iris.data) 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

要得到的树结构的详细信息，我们就可以使用tree_。 有getstate（）

树结构转换成一个 “ASCII艺术” 图片

   0 
     _____________ 
     1   2 
       ______________ 
       3   12 
      _______  _______ 
      4  7  13 16 
      ___ ______  _____ 
      5 6 8 9  14 15 
         _____ 
         10 11

树结构的数组。

In [38]: tree.tree_.__getstate__()['nodes'] 
Out[38]: 
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0), 
     (-1, -1, -2, -2.0, 0.0, 50, 50.0), 
     (3, 12, 3, 1.75, 0.5, 100, 100.0), 
     (4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0), 
     (5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0), 
     (-1, -1, -2, -2.0, 0.0, 47, 47.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0), 
     (-1, -1, -2, -2.0, 0.0, 3, 3.0), 
     (10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0), 
     (14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0), 
     (-1, -1, -2, -2.0, 0.0, 2, 2.0), 
     (-1, -1, -2, -2.0, 0.0, 1, 1.0), 
     (-1, -1, -2, -2.0, 0.0, 43, 43.0)], 
     dtype=[('left_child', '<i8'), ('right_child', '<i8'), 
      ('feature', '<i8'), ('threshold', '<f8'), 
      ('impurity', '<f8'), ('n_node_samples', '<i8'), 
      ('weighted_n_node_samples', '<f8')])

其中：

第一节点[0]是根节点。
内部节点具有left_child和right_child指向具有正值且大于当前节点的节点。
叶子左侧和右侧子节点的值为-1。
节点1,5,6,8,10,11,14,15,16是树叶。
节点结构是使用深度优先搜索算法构建的。
特征字段告诉我们在节点中使用了哪些iris.data特征来确定此示例的路径。
阈值告诉我们用于评估基于特征的方向的值。
杂质在叶子处达到0 ...因为一旦到达叶子，所有样品都处于同一级别。
n_node_samples告诉我们每个叶子有多少个样本。

使用这些信息，我们可以通过遵循脚本上的分类规则和阈值，将每个样本X简单地跟踪到它最终着陆的叶子。此外，n_node_samples将允许我们执行单元测试，以确保每个节点都获得正确数量的样本。然后使用树的输出。预测，我们可以将每片叶子映射到关联的类。

来源

2015-09-10 17:36:01 PabTorre

+0

谢谢。这告诉我这个类，但不是每个项目所在的决策树的哪一个叶。如果我能够以某种方式提取需要到达每个叶的规则，我可以重新运行这些规则来处理数据。 – eleanora

+0

当你说你想看到叶子，你的意思是你想看到树在每个节点上使用的规则？如果是这种情况，那么这可能会有所帮助：http：// stackoverflow。com/questions/20224526 /如何从scikit-learn-decision-tree中提取决策规则 – PabTorre

+0

对于给定的叶子，我希望看到决策树将放置在该叶子上的训练数据。换句话说，每个叶片都与一系列规则（比较）相关联。如果您应用这些规则，我希望看到您获得的数据的子集。 – eleanora

5

注意：这不是一个答案，只是提示可能的解决方案。

我最近在我的项目中遇到了类似的问题。我的目标是为某些特定样本提取相应的决策链。我认为你的问题是我的一个子集，因为你只需要记录决策链的最后一步。

到目前为止，似乎唯一可行的解决方案是在Python中编写定制的predict方法来跟踪一路上的决策。原因是scikit-learn提供的方法predict无法做到这一点（据我所知）。更糟糕的是，它是C实现的包装器，很难定制。

自定义对我的问题很好，因为我正在处理一个不平衡的数据集，我关心的样本（正数）很少。所以我可以先使用sklearn predict将它们过滤出来，然后使用我的定制获得决策链。

但是，如果您有大型数据集，这可能对您无效。因为如果你解析树并在Python中进行预测，它将以Python速度运行并且不会（容易）缩放。您可能需要回退以定制C实现。

来源

2015-09-11 08:49:33 zaxliu

+0

尽可能多地包含研究的部分答案仍然可以接受。 – Xufox

+0

谢谢。没有时间来实现这个想法。希望有代码的人很快会出现。 – zaxliu

3

此代码将完全按照您的要求进行操作。这里，n是X_train中的数字意见。最后，（n，number_of_leaves）大小的数组leaf_observations在每列中保存用于索引到X_train中的布尔值以获得每个叶中的观察值。 leaf_observations的每一列对应于leaves中的一个元素，它具有树叶的节点ID。

# get the nodes which are leaves 
leaves = clf.tree_.children_left == -1 
leaves = np.arange(0,clf.tree_.node_count)[leaves] 

# loop through each leaf and figure out the data in it 
leaf_observations = np.zeros((n,len(leaves)),dtype=bool) 
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
thistree = [clf.tree_.feature.tolist()] 
thistree.append(clf.tree_.threshold.tolist()) 
thistree.append(clf.tree_.children_left.tolist()) 
thistree.append(clf.tree_.children_right.tolist()) 
# get the decision rules for each leaf node & apply them 
for (ind,nod) in enumerate(leaves): 
    # get the decision rules in numeric list form 
    rules = [] 
    RevTraverseTree(thistree, nod, rules) 
    # convert & apply to the data by sequentially &ing the rules 
    thisnode = np.ones(n,dtype=bool) 
    for rule in rules: 
     if rule[1] == 1: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2]) 
     else: 
      thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2]) 
    # get the observations that obey all the rules - they are the ones in this leaf node 
    leaf_observations[:,ind] = thisnode

这需要这里定义的辅助功能，它递归从指定节点遍历树开始建立决策规则。

def RevTraverseTree(tree, node, rules): 
    ''' 
    Traverase an skl decision tree from a node (presumably a leaf node) 
    up to the top, building the decision rules. The rules should be 
    input as an empty list, which will be modified in place. The result 
    is a nested list of tuples: (feature, direction (left=-1), threshold). 
    The "tree" is a nested list of simplified tree attributes: 
    [split feature, split threshold, left node, right node] 
    ''' 
    # now find the node as either a left or right child of something 
    # first try to find it as a left node 
    try: 
     prevnode = tree[2].index(node) 
     leftright = -1 
    except ValueError: 
     # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
     prevnode = tree[3].index(node) 
     leftright = 1 
    # now let's get the rule that caused prevnode to -> node 
    rules.append((tree[0][prevnode],leftright,tree[1][prevnode])) 
    # if we've not yet reached the top, go up the tree one more step 
    if prevnode != 0: 
     RevTraverseTree(tree, prevnode, rules)

来源

2016-03-10 08:55:40

1

我认为一个简单的选择是使用训练的决策树的apply方法。从返回指数训练树，应用traindata，并建立一个查找表：

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

# apply training data to decision tree 
leaf_indices = clf.apply(iris.data) 
lookup = {} 

# build lookup table 
for i, leaf_index in enumerate(leaf_indices): 
    try: 
     lookup[leaf_index].append(iris.data[i]) 
    except KeyError: 
     lookup[leaf_index] = [] 
     lookup[leaf_index].append(iris.data[i]) 

# test 
unkown_sample = [[4., 3.1, 6.1, 1.2]] 
index = clf.apply(unkown_sample) 
print(lookup[index[0]])

来源

2016-10-02 11:39:33 maltesar

0

您是否尝试过反倾销的DecisionTree成graphviz的” .DOT文件[1]，然后用graph_tool加载[2]。：

import numpy as np 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import load_iris 
from graph_tool.all import * 

iris = load_iris() 
clf = DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

tree.export_graphviz(clf,out_file='tree.dot') 

#load graph with graph_tool and explore structure as you please 
g = load_graph('tree.dot') 

for v in g.vertices(): 
    for e in v.out_edges(): 
     print(e) 
    for w in v.out_neighbours(): 
     print(w)

[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

[2] https://graph-tool.skewed.de/

来源

2017-04-03 16:09:38

+0

你可以这样美丽吗？在http://scikit-learn.org/stable/_images/iris.svg？ – eleanora

+0

一旦输出与export_graphiz类似的东西可以实现点-Tpng tree.dot -o tree.png。 –

2

我已经改变了一点什么德鲁博士公布。
下面的代码，给定一个数据帧和被装配后的决策树，返回：

rules_list：条目的列表对于每个（条目：规则

values_path列表类通过路径去）

import numpy as np 
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier 

def get_rules(dtc, df): 
    rules_list = [] 
    values_path = [] 
    values = dtc.tree_.value 

    def RevTraverseTree(tree, node, rules, pathValues): 
     ''' 
     Traverase an skl decision tree from a node (presumably a leaf node) 
     up to the top, building the decision rules. The rules should be 
     input as an empty list, which will be modified in place. The result 
     is a nested list of tuples: (feature, direction (left=-1), threshold). 
     The "tree" is a nested list of simplified tree attributes: 
     [split feature, split threshold, left node, right node] 
     ''' 
     # now find the node as either a left or right child of something 
     # first try to find it as a left node    

     try: 
      prevnode = tree[2].index(node)   
      leftright = '<=' 
      pathValues.append(values[prevnode]) 
     except ValueError: 
      # failed, so find it as a right node - if this also causes an exception, something's really f'd up 
      prevnode = tree[3].index(node) 
      leftright = '>' 
      pathValues.append(values[prevnode]) 

     # now let's get the rule that caused prevnode to -> node 
     p1 = df.columns[tree[0][prevnode]]  
     p2 = tree[1][prevnode]  
     rules.append(str(p1) + ' ' + leftright + ' ' + str(p2)) 

     # if we've not yet reached the top, go up the tree one more step 
     if prevnode != 0: 
      RevTraverseTree(tree, prevnode, rules, pathValues) 

    # get the nodes which are leaves 
    leaves = dtc.tree_.children_left == -1 
    leaves = np.arange(0,dtc.tree_.node_count)[leaves] 

    # build a simpler tree as a nested list: [split feature, split threshold, left node, right node] 
    thistree = [dtc.tree_.feature.tolist()] 
    thistree.append(dtc.tree_.threshold.tolist()) 
    thistree.append(dtc.tree_.children_left.tolist()) 
    thistree.append(dtc.tree_.children_right.tolist()) 

    # get the decision rules for each leaf node & apply them 
    for (ind,nod) in enumerate(leaves): 

     # get the decision rules 
     rules = [] 
     pathValues = [] 
     RevTraverseTree(thistree, nod, rules, pathValues) 

     pathValues.insert(0, values[nod])  
     pathValues = list(reversed(pathValues)) 

     rules = list(reversed(rules)) 

     rules_list.append(rules) 
     values_path.append(pathValues) 

    return (r, values_path)

它遵循一个例子：

df = pd.read_csv('df.csv') 

X = df[df.columns[:-1]] 
y = df['classification'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

dtc = DecisionTreeClassifier(max_depth=2) 
dtc.fit(X_train, y_train)

决策树安装已经生成以下树：Decision Tree with width 2

在这一点上，只是调用该函数：

get_rules(df, dtc)

这是该函数返回的内容：

rules = [ 
    ['first <= 63.5', 'first <= 43.5'], 
    ['first <= 63.5', 'first > 43.5'], 
    ['first > 63.5', 'second <= 19.700000762939453'], 
    ['first > 63.5', 'second > 19.700000762939453'] 
] 

values = [ 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])], 
    [array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])], 
    [array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])] 
]

显然，在值中，对于每条路径，也有叶子值。

来源

2018-01-19 10:32:16

如何探索使用scikit学习

回答

相关问题