2014-04-01 73 views
1

这是训练数据集的标量代码。问题是什么?当我使用Standford TMT运行LDA时,我总是得到这个错误,“java.lang.UnsupportedOperationException:empty.max”

val tokenizer = { 
    SimpleEnglishTokenizer() ~>   // tokenize on space and punctuation 
    CaseFolder() ~>      // lowercase everything 
    WordsAndNumbersOnlyFilter() ~>   // ignore non-words and non-numbers 
    //MinimumLengthFilter(1) ~>    // take terms with >=3 characters 
    PorterStemmer() //~> 
    //StopWordFilter("en") 
} 

val text = { 
    source ~>        // read from the source file 
    Columns(4,6) ~> 
    Join(" ") ~>       // select column containing text 
    TokenizeWith(tokenizer) ~>    // tokenize with tokenizer above 
    TermCounter() //~>      // collect counts (needed below) 
    TermMinimumDocumentCountFilter(0) ~> // filter terms in <4 docs 
    TermDynamicStopListFilter(0) ~> // filter out 30 most common terms 
    TermMinimumDocumentCountFilter(0) // take only docs with >=5 terms 
} 

// define fields from the dataset we are going to slice against 
val labels = { 
    source ~>        // read from the source file 
    Column(5) ~>       // take column two, the year 
    TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array 
    TermCounter() //~>      // collect label counts 
    TermMinimumDocumentCountFilter(0)  // filter labels in < 10 docs 
} 

val dataset = LabeledLDADataset(text, labels); 

// define the model parameters 
val modelParams = LabeledLDAModelParams(dataset); 

// Name of the output model folder to generate 
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature); 

// Trains the model, writing to the given output path 
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000); 

回答

0

行是错误的TermDynamicStopListFilter(0) ~> // filter out 30 most common terms

应该 TermDynamicStopListFilter(30) 过滤掉词出现30余次,为注释。

相关问题