2015-10-24 52 views
0

我是scikit学习的新手,我正在撞墙。我已经使用了真实世界和测试数据,并且scikit算法在预测任何事情时都没有超出机会水平。我试过knn,决策树,svc和朴素贝叶斯。Scikit学习算法表现极差

基本上,我做了一个由0和1组成的列的测试数据集,其中所有的0都具有介于0和.5之间的特征,并且所有的1都具有介于.5和1之间的特征值。非常容易,并提供接近100%的准确性。但是,没有一种算法的性能高于机会级别。精度范围从45%到55%。我已经尝试调整每个算法的一大堆参数,但注意到帮助。我认为我的实施有一些根本性的错误。

请帮我一把。这里是我的代码:

from sklearn.cross_validation import train_test_split 
from sklearn import preprocessing 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.metrics import accuracy_score 
import sklearn 
import pandas 
import numpy as np 


df=pandas.read_excel('Test.xlsx') 



# Make data into np arrays 
y = np.array(df[1]) 
y=y.astype(float) 
y=y.reshape(399) 

x = np.array(df[2]) 
x=x.astype(float) 
x=x.reshape(399, 1) 



# Creating training and test data 

labels_train, labels_test = train_test_split(y) 
features_train, features_test = train_test_split(x) 

##################################################################### 
# PERCEPTRON 
##################################################################### 

from sklearn import linear_model 

perceptron=linear_model.Perceptron() 

perceptron.fit(features_train, labels_train) 

perc_pred=perceptron.predict(features_test) 

print sklearn.metrics.accuracy_score(labels_test, perc_pred, normalize=True, sample_weight=None) 
print 'perceptron' 

##################################################################### 
# KNN classifier 
##################################################################### 
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier() 
knn.fit(features_train, labels_train) 


knn_pred = knn.predict(features_test) 


# Accuraatheid 

print sklearn.metrics.accuracy_score(labels_test, knn_pred, normalize=True, sample_weight=None) 
print 'knn' 


##################################################################### 
## SVC 
##################################################################### 

from sklearn.svm import SVC 
from sklearn import svm 
svm2 = SVC(kernel="linear") 


svm2 = svm.SVC() 
svm2.fit(features_train, labels_train) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, 
gamma=1.0, kernel='linear', max_iter=-1, probability=False, 
random_state=None, 
shrinking=True, tol=0.001, verbose=False) 



svc_pred = svm2.predict(features_test) 

print sklearn.metrics.accuracy_score(labels_test, svc_pred, normalize=True, 
sample_weight=None) 

##################################################################### 
# Decision tree 
##################################################################### 
from sklearn import tree 
clf = tree.DecisionTreeClassifier() 
clf = clf.fit(features_train, labels_train) 

tree_pred=clf.predict(features_test) 

# Accuraatheid 

print sklearn.metrics.accuracy_score(labels_test, tree_pred, normalize=True, 
sample_weight=None) 
print 'tree' 

##################################################################### 
# Naive bayes 
##################################################################### 


import sklearn 
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB() 
clf.fit(features_train, labels_train) 

print "training time:", round(time()-t0, 3), "s" 


GaussianNB() 
bayes_pred = clf.predict(features_test) 



print sklearn.metrics.accuracy_score(labels_test, bayes_pred, 
normalize=True, sample_weight=None) 

回答

5

您似乎错误地使用train_test_split

labels_train, labels_test = train_test_split(y) 
features_train, features_test = train_test_split(x) 

您的标签和数据的拆分不是必须的。一个简单的方法来分割你的数据是这样的:

randomvec=np.random.rand(len(data)) 
randomvec=randomvec>0.5 
train_data=data[randomvec] 
train_label=labels[randomvec] 
test_data=data[np.logical_not(randomvec)] 
test_label=labels[np.logical_not(randomvec)] 

或使用适当的方法scikit:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=42) 
+0

感谢你这么多!我用了最后一种方法,效果很好。你帮了我很多。 – Sander

+0

不客气。 – Cedias

+0

@Sander你应该接受这个答案。 –