0
我试图聚集在培训期间没有看到的新数据,只包含测试数据。培训文件有5个类别,而测试数据有7个类别(5 +2),其中2个是新类别。现在,我想运行k-均值来为新添加的类找到适当的群集,或者为它们创建新的群集(如果它们不接近任何群集)。不同的聚类标签
这是我的代码的一部分:
print("Reading training data...")
#mydata = pd.read_csv('.\KDDTrain.csv', header=0)
mydata = pd.read_csv('.\PTraining.csv', header=0)
# select all but the last column as data
X_train = mydata.ix[1:, :-1]
X_train = np.array(X_train)
n_samples, n_features = np.shape(X_train)
# print np.shape(X_train)
# select last column as target/class
y_train = mydata.ix[1:, n_features]
y_train = np.array(y_train)
# encode target labels with numeric values from 0 to no of classes
# print "Encoding class labels..."
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y_train)
# print list(label_encoder.classes_)
# print 'total no of classes in dataset=' + str(len(label_encoder.classes_))
y_train = label_encoder.transform(y_train)
# n_samples, n_features = data.shape
n_digits = len(np.unique(y_train))
print("Training data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits, n_samples, n_features))
sample_size = 300
# Read test data
mytestdata = pd.read_csv('.\KDDTest+.csv', header=0)
print("Reading test data...")
# select all but the last column as data
X_test = mytestdata.ix[1:, :-1]
X_test = np.array(X_test)
# print np.shape(X_test)
# select last column as target/class
y_test = mytestdata.ix[1:, n_features]
# print "actual labels"
# print y_test
y_test = label_encoder.transform(y_test)
# print "Encoded labels"
# print y_test
y_test = np.array(y_test)
n_samples_test, n_features_test = np.shape(X_test)
n_digits_test = len(np.unique(y_test))
print("Test data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits_test, n_samples_test, n_features_test))
print(79 * '_')
File "C:/Users/aalsham4/PycharmProjects/clusteringtask/clustering.py", line 87, in <module>
y_test = label_encoder.transform(y_test)
File "C:\Users\aalsham4\AppData\Local\Continuum\Miniconda3\lib\site-packages\sklearn\preprocessing\label.py", line 153, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['calss6' 'class7' ]
现在,我不知道如果我这样做这正确地将聚类标记的类或不聚类。
任何建议
欢迎来到StackOverflow。请阅读并遵守帮助文档中的发布准则。 [最小,完整,可验证的示例](http://stackoverflow.com/help/mcve)适用于此处。在发布您的MCVE代码并准确描述问题之前,我们无法为您提供有效的帮助。特别是,我们不能在没有数据文件的情况下重现问题。 – Prune
我要一个包含训练数据的文件,另一个包含测试数据,测试数据文件有7个类,训练数据文件只有5个类。我想应用k-means聚类来查找这两个类是否与我的模块训练的5个类中的任何类相似。 这是否适用? – Adel
如果你有类,使用分类器,而不是k-means。您的问题错误的工具。并且不要使用KDDCup99数据,这是有缺陷的。 –