支持向量机在训练集上工作，但不在R中的测试集上（使用e1071）

我为我的文档分类任务使用支持向量机！它将所有我的文章分类到训练集中，但是没有对我的测试集中的文章进行分类！ trainDTM是我训练集的文档术语矩阵。 testDTM是测试集的一部分。这里是我的（不是很漂亮）代码：支持向量机在训练集上工作，但不在R中的测试集上（使用e1071）

# create data.frame with labelled sentences 

labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) 

# create training set and test set 
traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")]) 
testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")]) 

# Vector, Source Transformation 
trainvector <- as.vector(traindata$"ARTICLE") 
testvector <- as.vector(testdata$"ARTICLE") 
trainsource <- VectorSource(trainvector) 
testsource <- VectorSource(testvector) 

# CREATE CORPUS FOR DATA 
traincorpus <- Corpus(trainsource) 
testcorpus <- Corpus(testsource) 

# my own stopwords 
sw <- c("i", "me", "my") 

## CLEAN TEXT 

# FUNCTION FOR CLEANING 
cleanCorpus <- function(corpus){ 
    corpus.tmp <- tm_map(corpus, removePunctuation) 
    corpus.tmp <- tm_map(corpus.tmp,stripWhitespace) 
    corpus.tmp <- tm_map(corpus.tmp,tolower) 
    corpus.tmp <- tm_map(corpus.tmp, removeWords, sw) 
    corpus.tmp <- tm_map(corpus.tmp, removeNumbers) 
    corpus.tmp <- tm_map(corpus.tmp, stemDocument, language="en") 
    return(corpus.tmp)} 

# CLEAN CORP WITH ABOVE FUNCTION 
traincorpus.cln <- cleanCorpus(traincorpus) 
testcorpus.cln <- cleanCorpus(testcorpus) 

## CREATE N-GRAM DOCUMENT TERM MATRIX 
# CREATE N-GRAM TOKENIZER 

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1)) 

# CREATE DTM 
trainmatrix.cln.bi <- DocumentTermMatrix(traincorpus.cln, control = list(tokenize = BigramTokenizer)) 
testmatrix.cln.bi <- DocumentTermMatrix(testcorpus.cln, control = list(tokenize = BigramTokenizer)) 

# REMOVE SPARSE TERMS 
trainDTM <- removeSparseTerms(trainmatrix.cln.bi, 0.98) 
testDTM <- removeSparseTerms(testmatrix.cln.bi, 0.98) 

# train the model 
SVM <- svm(as.matrix(trainDTM), as.factor(traindata$CLASS)) 

# get classifications for training-set 
results.train <- predict(SVM, as.matrix(trainDTM)) # works fine! 

# get classifications for test-set 
results <- predict(SVM,as.matrix(testDTM)) 

Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", : 
    length of 'center' must equal the number of columns of 'x'

我不明白这个错误。什么是'中心'？

谢谢！

来源

2014-03-03 cptn

你为什么认为这是一个过度配合的问题？即使模型是过度配置，我应该能够分类新数据.. – cptn

训练和测试数据必须在相同的特征空间;以这种方式构建两个分离的DTM无法工作。

使用RTextTools A液：

DocTermMatrix <- create_matrix(labeled, language="english", removeNumbers=TRUE, stemWords=TRUE, ...) 
container <- create_container(DocTermMatrix, labels, trainSize=1:700, testSize=701:1000, virgin=FALSE) 
models <- train_models(container, "SVM") 
results <- classify_models(container, models)

或者，要回答你的问题（与e1071），你可以指定词汇（ '功能'）在投影（DocumentTermMatrix）：

DocTermMatrixTrain <- DocumentTermMatrix(Corpus(VectorSource(trainDoc))); 
Features <- DocTermMatrixTrain$dimnames$Terms; 

DocTermMatrixTest <- DocumentTermMatrix(Corpus(VectorSource(testDoc)),control=list(dictionary=Features));

来源

2014-03-03 14:55:03 brobertie

我添加了完整的代码，所以你可以看到我如何构建TDM – cptn

@ brobertie：谢谢你的方法工作！我不太喜欢使用RTextTools，因为我没有太多的控制权（或者我认为我没有控制权）对像stopword-removal，negation handling，n-grams等预处理步骤。顺便说一句。如果我使用朴素贝叶斯分类器，将文章分成两个独立的TDM似乎可行。但我无法使用SVM以某种方式工作。 – cptn

@ brobertie：你知道如何使用e1071构建SVM吗？ – cptn

支持向量机在训练集上工作，但不在R中的测试集上（使用e1071）

回答

相关问题