2015-12-14 153 views
0

我试图做文本分类中的Weka兼容的错误,但我有一个很大的问题得到测试开始了工作。这是我的训练集(这是短,因为我刚开始学习秧鸡!):培训和测试集不是秧鸡

@relation sentiment 
@attribute phrase string 
@attribute value {pos, neg} 
@data 
'That was really unlucky', neg 
'The car crashed horribly', neg 
'The culpirit got away',neg 
'Fortunally everyone made it out', pos 
'She was glad noone was hurt',pos 
'And the sun was at least shining',pos 

我再上一套使用StringToWordVector,然后应用NumericToBinary。这是训练集的最终结果是:

@relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary' 

@attribute value {pos,neg} 
@attribute And_binarized {0,1} 
@attribute Fortunally_binarized {0,1} 
@attribute She_binarized {0,1} 
@attribute at_binarized {0,1} 
@attribute everyone_binarized {0,1} 
@attribute glad_binarized {0,1} 
@attribute hurt_binarized {0,1} 
@attribute it_binarized {0,1} 
@attribute least_binarized {0,1} 
@attribute made_binarized {0,1} 
@attribute noone_binarized {0,1} 
@attribute out_binarized {0,1} 
@attribute shining_binarized {0,1} 
@attribute sun_binarized {0,1} 
@attribute the_binarized {0,1} 
@attribute was_binarized {0,1} 
@attribute That_binarized {0,1} 
@attribute The_binarized {0,1} 
@attribute away_binarized {0,1} 
@attribute car_binarized {0,1} 
@attribute crashed_binarized {0,1} 
@attribute culpirit_binarized {0,1} 
@attribute got_binarized {0,1} 
@attribute horribly_binarized {0,1} 
@attribute really_binarized {0,1} 
@attribute unlucky numeric 

@data 
{0 neg,16 1,17 1,25 1,26 1} 
{0 neg,18 1,20 1,21 1,24 1} 
{0 neg,18 1,19 1,22 1,23 1} 
{2 1,5 1,8 1,10 1,12 1} 
{3 1,6 1,7 1,11 1,16 1} 
{1 1,4 1,9 1,13 1,14 1,15 1,16 1} 

我现在开始测试集,这是工作:

@relation sentiment 
@attribute phrase string 
@data 
'That was really unlucky' 
'The car crashed horribly' 
'The culpirit got away' 

我的希望是,秧鸡可以在这个文本为“负”分类。为了使它们兼容,我使用与我在训练集(StringToWordVector和NumericToBinary)上相同的过滤器。这是测试集的最终结果是:

@relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary' 

@attribute That_binarized {0,1} 
@attribute The_binarized {0,1} 
@attribute away_binarized {0,1} 
@attribute car_binarized {0,1} 
@attribute crashed_binarized {0,1} 
@attribute culpirit_binarized {0,1} 
@attribute got_binarized {0,1} 
@attribute horribly_binarized {0,1} 
@attribute really_binarized {0,1} 
@attribute unlucky_binarized {0,1} 
@attribute was numeric 

@data 
{0 1,8 1,9 1,10 1} 
{1 1,3 1,4 1,7 1} 
{1 1,2 1,5 1,6 1} 

但是,它给我的错误,训练集和测试集不兼容,而且我真的不能找出原因。这直观地看起来像weka应该理解的东西。

感谢您的帮助!

回答

1

你的训练和测试的测试应该有相同的标题。现在他们是不同的。

阅读下面的链接for an example for text classification.。这是另一个link,它显示了解决此问题的其他方法。