clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
这一切工作正常建设决策树构建决策树。但是,如何才能探索决策树?
例如,如何从X_train中找到哪些条目出现在特定的叶中?
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
这一切工作正常建设决策树构建决策树。但是,如何才能探索决策树?
例如,如何从X_train中找到哪些条目出现在特定的叶中?
下面的代码应该产生的十大功能一个情节:
import numpy as np
import matplotlib.pyplot as plt
importances = clf.feature_importances_
std = np.std(clf.feature_importances_,axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(10), indices)
plt.xlim([-1, 10])
plt.show()
从here取出并略作修改,以适应DecisionTreeClassifier。
这并不完全有助于您探索树,但它确实会告诉您关于树的信息。
谢谢,但我希望看到哪些培训数据落入每片叶子,例如。目前,我必须绘制决策树,写下规则,编写脚本以使用这些规则过滤数据。这不可能是正确的方式! – eleanora
您的数据是否足够小以便通过手动或电子表格运行这些计算?我假设这是为了一个类,在这种情况下,最好不要运行该算法并复制结构。也就是说,我想有一些方法可以从sci-kit获得树的结构。以下是DecisionTreeClassifier的源代码:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py –
这不适合上课!我有大约1000000项,所以我通过编写一个单独的python脚本来完成。然而,我甚至不知道如何自动提取每个叶子的规则。有没有办法? – eleanora
您需要使用预测方法。
训练树后,您输入X值以预测其输出。
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
tree = clf.fit(iris.data, iris.target)
tree.predict(iris.data)
输出:
>>> tree.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
要得到的树结构的详细信息,我们就可以使用tree_。 有getstate()
树结构转换成一个 “ASCII艺术” 图片
0
_____________
1 2
______________
3 12
_______ _______
4 7 13 16
___ ______ _____
5 6 8 9 14 15
_____
10 11
树结构的数组。
In [38]: tree.tree_.__getstate__()['nodes']
Out[38]:
array([(1, 2, 3, 0.800000011920929, 0.6666666666666667, 150, 150.0),
(-1, -1, -2, -2.0, 0.0, 50, 50.0),
(3, 12, 3, 1.75, 0.5, 100, 100.0),
(4, 7, 2, 4.949999809265137, 0.16803840877914955, 54, 54.0),
(5, 6, 3, 1.6500000953674316, 0.04079861111111116, 48, 48.0),
(-1, -1, -2, -2.0, 0.0, 47, 47.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(8, 9, 3, 1.5499999523162842, 0.4444444444444444, 6, 6.0),
(-1, -1, -2, -2.0, 0.0, 3, 3.0),
(10, 11, 2, 5.449999809265137, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(13, 16, 2, 4.850000381469727, 0.042533081285444196, 46, 46.0),
(14, 15, 1, 3.0999999046325684, 0.4444444444444444, 3, 3.0),
(-1, -1, -2, -2.0, 0.0, 2, 2.0),
(-1, -1, -2, -2.0, 0.0, 1, 1.0),
(-1, -1, -2, -2.0, 0.0, 43, 43.0)],
dtype=[('left_child', '<i8'), ('right_child', '<i8'),
('feature', '<i8'), ('threshold', '<f8'),
('impurity', '<f8'), ('n_node_samples', '<i8'),
('weighted_n_node_samples', '<f8')])
其中:
使用这些信息,我们可以通过遵循脚本上的分类规则和阈值,将每个样本X简单地跟踪到它最终着陆的叶子。此外,n_node_samples将允许我们执行单元测试,以确保每个节点都获得正确数量的样本。然后使用树的输出。预测,我们可以将每片叶子映射到关联的类。
谢谢。这告诉我这个类,但不是每个项目所在的决策树的哪一个叶。如果我能够以某种方式提取需要到达每个叶的规则,我可以重新运行这些规则来处理数据。 – eleanora
当你说你想看到叶子,你的意思是你想看到树在每个节点上使用的规则? 如果是这种情况,那么这可能会有所帮助:http:// stackoverflow。com/questions/20224526 /如何从scikit-learn-decision-tree中提取决策规则 – PabTorre
对于给定的叶子,我希望看到决策树将放置在该叶子上的训练数据。换句话说,每个叶片都与一系列规则(比较)相关联。如果您应用这些规则,我希望看到您获得的数据的子集。 – eleanora
注意:这不是一个答案,只是提示可能的解决方案。
我最近在我的项目中遇到了类似的问题。我的目标是为某些特定样本提取相应的决策链。我认为你的问题是我的一个子集,因为你只需要记录决策链的最后一步。
到目前为止,似乎唯一可行的解决方案是在Python中编写定制的predict
方法来跟踪一路上的决策。原因是scikit-learn提供的方法predict
无法做到这一点(据我所知)。更糟糕的是,它是C实现的包装器,很难定制。
自定义对我的问题很好,因为我正在处理一个不平衡的数据集,我关心的样本(正数)很少。所以我可以先使用sklearn predict
将它们过滤出来,然后使用我的定制获得决策链。
但是,如果您有大型数据集,这可能对您无效。因为如果你解析树并在Python中进行预测,它将以Python速度运行并且不会(容易)缩放。您可能需要回退以定制C实现。
此代码将完全按照您的要求进行操作。这里,n
是X_train
中的数字意见。最后,(n,number_of_leaves)大小的数组leaf_observations
在每列中保存用于索引到X_train
中的布尔值以获得每个叶中的观察值。 leaf_observations
的每一列对应于leaves
中的一个元素,它具有树叶的节点ID。
# get the nodes which are leaves
leaves = clf.tree_.children_left == -1
leaves = np.arange(0,clf.tree_.node_count)[leaves]
# loop through each leaf and figure out the data in it
leaf_observations = np.zeros((n,len(leaves)),dtype=bool)
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [clf.tree_.feature.tolist()]
thistree.append(clf.tree_.threshold.tolist())
thistree.append(clf.tree_.children_left.tolist())
thistree.append(clf.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules in numeric list form
rules = []
RevTraverseTree(thistree, nod, rules)
# convert & apply to the data by sequentially &ing the rules
thisnode = np.ones(n,dtype=bool)
for rule in rules:
if rule[1] == 1:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] > rule[2])
else:
thisnode = np.logical_and(thisnode,X_train[:,rule[0]] <= rule[2])
# get the observations that obey all the rules - they are the ones in this leaf node
leaf_observations[:,ind] = thisnode
这需要这里定义的辅助功能,它递归从指定节点遍历树开始建立决策规则。
def RevTraverseTree(tree, node, rules):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = -1
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = 1
# now let's get the rule that caused prevnode to -> node
rules.append((tree[0][prevnode],leftright,tree[1][prevnode]))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules)
我认为一个简单的选择是使用训练的决策树的apply方法。从返回指数训练树,应用traindata,并建立一个查找表:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
# apply training data to decision tree
leaf_indices = clf.apply(iris.data)
lookup = {}
# build lookup table
for i, leaf_index in enumerate(leaf_indices):
try:
lookup[leaf_index].append(iris.data[i])
except KeyError:
lookup[leaf_index] = []
lookup[leaf_index].append(iris.data[i])
# test
unkown_sample = [[4., 3.1, 6.1, 1.2]]
index = clf.apply(unkown_sample)
print(lookup[index[0]])
您是否尝试过反倾销的DecisionTree成graphviz的” .DOT文件[1],然后用graph_tool加载[2]。 :
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from graph_tool.all import *
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
tree.export_graphviz(clf,out_file='tree.dot')
#load graph with graph_tool and explore structure as you please
g = load_graph('tree.dot')
for v in g.vertices():
for e in v.out_edges():
print(e)
for w in v.out_neighbours():
print(w)
[1] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
你可以这样美丽吗?在http://scikit-learn.org/stable/_images/iris.svg? – eleanora
一旦输出与export_graphiz类似的东西可以实现点-Tpng tree.dot -o tree.png。 –
我已经改变了一点什么德鲁博士公布。
下面的代码,给定一个数据帧和被装配后的决策树,返回:
values_path列表类通过路径去)
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
def get_rules(dtc, df):
rules_list = []
values_path = []
values = dtc.tree_.value
def RevTraverseTree(tree, node, rules, pathValues):
'''
Traverase an skl decision tree from a node (presumably a leaf node)
up to the top, building the decision rules. The rules should be
input as an empty list, which will be modified in place. The result
is a nested list of tuples: (feature, direction (left=-1), threshold).
The "tree" is a nested list of simplified tree attributes:
[split feature, split threshold, left node, right node]
'''
# now find the node as either a left or right child of something
# first try to find it as a left node
try:
prevnode = tree[2].index(node)
leftright = '<='
pathValues.append(values[prevnode])
except ValueError:
# failed, so find it as a right node - if this also causes an exception, something's really f'd up
prevnode = tree[3].index(node)
leftright = '>'
pathValues.append(values[prevnode])
# now let's get the rule that caused prevnode to -> node
p1 = df.columns[tree[0][prevnode]]
p2 = tree[1][prevnode]
rules.append(str(p1) + ' ' + leftright + ' ' + str(p2))
# if we've not yet reached the top, go up the tree one more step
if prevnode != 0:
RevTraverseTree(tree, prevnode, rules, pathValues)
# get the nodes which are leaves
leaves = dtc.tree_.children_left == -1
leaves = np.arange(0,dtc.tree_.node_count)[leaves]
# build a simpler tree as a nested list: [split feature, split threshold, left node, right node]
thistree = [dtc.tree_.feature.tolist()]
thistree.append(dtc.tree_.threshold.tolist())
thistree.append(dtc.tree_.children_left.tolist())
thistree.append(dtc.tree_.children_right.tolist())
# get the decision rules for each leaf node & apply them
for (ind,nod) in enumerate(leaves):
# get the decision rules
rules = []
pathValues = []
RevTraverseTree(thistree, nod, rules, pathValues)
pathValues.insert(0, values[nod])
pathValues = list(reversed(pathValues))
rules = list(reversed(rules))
rules_list.append(rules)
values_path.append(pathValues)
return (r, values_path)
它遵循一个例子:
df = pd.read_csv('df.csv')
X = df[df.columns[:-1]]
y = df['classification']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier(max_depth=2)
dtc.fit(X_train, y_train)
决策树安装已经生成以下树:Decision Tree with width 2
在这一点上,只是调用该函数:
get_rules(df, dtc)
这是该函数返回的内容:
rules = [
['first <= 63.5', 'first <= 43.5'],
['first <= 63.5', 'first > 43.5'],
['first > 63.5', 'second <= 19.700000762939453'],
['first > 63.5', 'second > 19.700000762939453']
]
values = [
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 284., 57.]])],
[array([[ 1568., 1569.]]), array([[ 636., 241.]]), array([[ 352., 184.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 645., 620.]])],
[array([[ 1568., 1569.]]), array([[ 932., 1328.]]), array([[ 287., 708.]])]
]
显然,在值中,对于每条路径,也有叶子值。
碰到类似的问题。你可能会发现我的答案[在这里](和http://cn.tankoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/42227468#42227468)那里提到的演练)有帮助。它使用0.18版本中的方法'decision_path'。如果对观看训练样本感兴趣,可在几个地方用'X_train'替代'X_test'。 – Kevin