2016-05-06 22 views
0

我在电子邮件(垃圾邮件/非垃圾邮件)数据集上创建了高斯朴素贝叶斯分类器,并能够成功运行它。我对数据进行了矢量化,将它分为训练集和测试集,然后计算精度,这些特征存在于sklearn-Gaussian朴素贝叶斯分类器中。如何使用sklearn中经过训练的NB分类器预测电子邮件的标签?

现在我希望能够使用此分类器来预测新电子邮件的“标签” - 无论它们是否是垃圾邮件。 例如说我有一封电子邮件。我想把它提供给我的分类器,并预测它是否是垃圾邮件。我怎样才能做到这一点?请帮忙。

分类器文件的代码。

#!/usr/bin/python 
 

 
import sys 
 
from time import time 
 
import logging 
 

 
# Display progress logs on stdout 
 
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s') 
 

 
sys.path.append("../DatasetProcessing/") 
 
from vectorize_split_dataset import preprocess 
 

 
### features_train and features_test are the features 
 
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels 
 
features_train, features_test, labels_train, labels_test = preprocess() 
 

 
######################################################### 
 
from sklearn.naive_bayes import GaussianNB 
 
clf = GaussianNB() 
 
t0 = time() 
 
clf.fit(features_train, labels_train) 
 
pred = clf.predict(features_test) 
 
print("training time:", round(time() - t0, 3), "s") 
 
print(clf.score(features_test, labels_test)) 
 

 
## Printing Metrics 
 
for Training and Testing 
 
print("No. of Testing Features:" + str(len(features_test))) 
 
print("No. of Testing Features Label:" + str(len(labels_test))) 
 
print("No. of Training Features:" + str(len(features_train))) 
 
print("No. of Training Features Label:" + str(len(labels_train))) 
 
print("No. of Predicted Features:" + str(len(pred))) 
 

 
## Calculating Classifier Performance 
 
from sklearn.metrics import classification_report 
 
y_true = labels_test 
 
y_pred = pred 
 
labels = ['0', '1'] 
 
target_names = ['class 0', 'class 1'] 
 
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels)) 
 

 
# How to predict label of a new text 
 
new_text = "You won a lottery at UK lottery commission. Reply to claim it"

代码矢量

#!/usr/bin/python 
 

 
import os 
 
import pickle 
 
import numpy 
 
numpy.random.seed(42) 
 

 
path = os.path.dirname(os.path.abspath(__file__)) 
 

 
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand 
 
feature_data_file = path + "./createdDataset/dataSet.pkl" 
 
label_data_file = path + "./createdDataset/dataLabel.pkl" 
 

 
feature_data = pickle.load(open(feature_data_file, "rb")) 
 
label_data = pickle.load(open(label_data_file, "rb")) 
 

 
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations 
 
for compatibility with### classifier functions in versions 0.15.2 and earlier 
 
from sklearn import cross_validation 
 
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42) 
 

 
from sklearn.feature_extraction.text import TfidfVectorizer 
 
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english') 
 
features_train = vectorizer.fit_transform(features_train) 
 
features_test = vectorizer.transform(features_test)#.toarray() 
 

 
## feature selection to reduce dimensionality 
 
from sklearn.feature_selection import SelectPercentile, f_classif 
 
selector = SelectPercentile(f_classif, percentile = 5) 
 
selector.fit(features_train, labels_train) 
 
features_train_transformed_reduced = selector.transform(features_train).toarray() 
 
features_test_transformed_reduced = selector.transform(features_test).toarray() 
 

 
features_train = features_train_transformed_reduced 
 
features_test = features_test_transformed_reduced 
 

 
def preprocess(): 
 
    return features_train, features_test, labels_train, labels_test

代码数据集生成

#!/usr/bin/python 
 

 
import os 
 
import pickle 
 
import re 
 
import sys 
 

 
# sys.path.append("../tools/") 
 

 

 
"" 
 
" 
 
    Starter code to process the texts of accuate and inaccurate category to extract 
 
    the features and get the documents ready for classification. 
 

 
    The list of all the texts from accurate category are in the accurate_files list 
 
    likewise for texts of inaccurate category are in (inaccurate_files) 
 

 
    The data is stored in lists and packed away in pickle files at the end. 
 
" 
 
"" 
 

 

 
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r") 
 
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r") 
 

 
label_data = [] 
 
feature_data = [] 
 

 
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker 
 
temp_counter = 0 
 

 

 
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]: 
 
    for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset 
 
temp_counter = 1 
 
if temp_counter < 200: 
 
    path = os.path.join('..', path[: -1]) 
 
print(path) 
 
text = open(path, "r") 
 
line = text.readline() 
 
while line: ###use a 
 
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text) 
 
stem_text = text.readline().strip() 
 
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data 
 
feature_data.append(stem_text)### append a 0 to label_data 
 
if text is from Sara, and 1 
 
if text is from Chris 
 
if (name == "accurate"): 
 
    label_data.append("0") 
 
elif(name == "inaccurate"): 
 
    label_data.append("1") 
 

 
line = text.readline() 
 

 
text.close() 
 

 
print("texts processed") 
 
accurate_files.close() 
 
inaccurate_files.close() 
 

 
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb")) 
 
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))

此外,我想知道我是否可以渐进地训练分类器的含义,从而使用更新的数据重新训练创建的模型以便随着时间的推移改进模型?

如果有人能帮我解决这个问题,我会很高兴。我真的被困在这一点上。

+0

正如你已经完成train_test分割已经标记的集合,然后计算精度。对于新的测试数据,您必须将新的测试数据集加载到features_test变量中。对于预测,你可以做两件事,每次你有一个新的测试数据时,fit_transform NB,或者保存NB模型(使用sklearn.externals.joblib.dump/load,并且对于每个新的测试集,加载你的模型并使用预测,你可以逐步训练分类器,但旧的分类器将不得不被替换。 – pmaniyan

回答

1

您已经在使用您的模型来预测测试集中电子邮件的标签。这就是pred = clf.predict(features_test)所做的。如果您想查看这些标签,请执行print pred

但是,也许你知道如何预测未来发现的电子邮件标签,而这些电子邮件目前不在您的测试集中?如果是这样,您可以将您的新电子邮件视为新的测试集。与您之前的测试集一样,您需要对数据执行几个关键处理步骤:

1)您需要做的第一件事是为您的新电子邮件数据生成功能。功能生成步骤不包含在上面的代码中,但需要发生。

2)您正在使用Tfidf矢量化器,它可根据术语频率和逆文档频率将文档集合转换为Tfidf特征矩阵。您需要通过符合训练数据的矢量化工具将新的电子邮件测试功能数据。

3)然后,您的新电子邮件测试功能数据将需要通过使用适合您的训练数据的相同selector来降低维度降低。

4)最后,运行预测你的新测试数据。如果您想查看新标签,请使用print pred

为了回应你关于反复重新训练你的模型的最终问题,是的,你绝对可以做到这一点。这只是选择频率,生成一个脚本,用于扩展数据集与传入数据,然后重新运行所有步骤,从预处理到Tfidf矢量化,降维到拟合和预测。

+0

谢谢你的解决方案Jason。是的,这正是我想问的问题,如何为新的电子邮件数据生成功能。您可以在此处详细说明一下吗?在此先感谢 – harshlal028

+0

Hi @ user2168281,所有功能设计步骤都发生在上面张贴的代码之外,因此无法说明。 feature_data_file = path +“./createdDataset/dataSet.pkl”',这里'feature_data = pickle.load(open(feature_data_file,“rb”))'如果你自己没有做特征工程,你需要最少跟踪源代码来查看这些功能是什么以及它们是如何构建的,以便您可以重新为您的新数据做好准备。对不起,我忍不住了。如果您发现功能生成的源代码,请告诉我们。 – user6275647

+0

我应该说特征数据实际上只是电子邮件文本本身,而且它是Tfidf矢量化工具,可以将此原始电子邮件文本数据转换为特征。如果是这种情况,那么您的新电子邮件数据的特征生成将发生在上述的Tfidf步骤中。但是我不能肯定地说,因为我们在这个步骤中已经导入'feature_data = pickle.load(open(feature_data_file,“rb”))''时,并不知道'features_data'看起来是什么样子。 – user6275647

相关问题