2017-06-22 52 views
1

在Mallet中训练数据时,处理由于OutOfMemoryError而停止。 bin/mallet中的属性MEMORY已被设置为3GB。培训文件output.mallet的大小仅为31 MB。我试图减少训练数据的大小。但它仍然抛出了同样的错误:Mallet:OutOfMemoryError:Java堆空间

[email protected]:~/dev/test_models/Mallet$ bin/mallet train-classifier --input output.mallet --trainer NaiveBayes --training-portion 0.0001 --num-trials 10 
Training portion = 1.0E-4 
Unlabeled training sub-portion = 0.0 
Validation portion = 0.0 
Testing portion = 0.9999 

-------------------- Trial 0 -------------------- 

Trial 0 Training NaiveBayesTrainer with 7 instances 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 
     at cc.mallet.types.Multinomial$Estimator.setAlphabet(Multinomial.java:309) 
     at cc.mallet.classify.NaiveBayesTrainer.setup(NaiveBayesTrainer.java:251) 
     at cc.mallet.classify.NaiveBayesTrainer.trainIncremental(NaiveBayesTrainer.java:200) 
     at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:193) 
     at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:59) 
     at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:415) 

我想任何并欣赏帮助或见解这个问题

编辑:这是我的斌/槌文件。

#!/bin/bash 


malletdir=`dirname $0` 
malletdir=`dirname $malletdir` 

cp=$malletdir/class:$malletdir/lib/mallet-deps.jar:$CLASSPATH 
#echo $cp 

MEMORY=10g 

CMD=$1 
shift 

help() 
{ 
cat <<EOF 
Mallet 2.0 commands: 

    import-dir   load the contents of a directory into mallet instances (one per file) 
    import-file  load a single file into mallet instances (one per line) 
    import-svmlight load SVMLight format data files into Mallet instances 
    info    get information about Mallet instances 
    train-classifier train a classifier from Mallet data files 
    classify-dir  classify data from a single file with a saved classifier 
    classify-file  classify the contents of a directory with a saved classifier 
    classify-svmlight classify data from a single file in SVMLight format 
    train-topics  train a topic model from Mallet data files 
    infer-topics  use a trained topic model to infer topics for new documents 
    evaluate-topics estimate the probability of new documents under a trained model 
    prune    remove features based on frequency or information gain 
    split    divide data into testing, training, and validation portions 
    bulk-load   for big input files, efficiently prune vocabulary and import docs 

Include --help with any option for more information 
EOF 
} 

CLASS= 

case $CMD in 
     import-dir) CLASS=cc.mallet.classify.tui.Text2Vectors;; 
     import-file) CLASS=cc.mallet.classify.tui.Csv2Vectors;; 
     import-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Vectors;; 
     info) CLASS=cc.mallet.classify.tui.Vectors2Info;; 
     train-classifier) CLASS=cc.mallet.classify.tui.Vectors2Classify;; 
     classify-dir) CLASS=cc.mallet.classify.tui.Text2Classify;; 
     classify-file) CLASS=cc.mallet.classify.tui.Csv2Classify;; 
     classify-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Classify;; 
     train-topics) CLASS=cc.mallet.topics.tui.TopicTrainer;; 
     infer-topics) CLASS=cc.mallet.topics.tui.InferTopics;; 
     evaluate-topics) CLASS=cc.mallet.topics.tui.EvaluateTopics;; 
     prune) CLASS=cc.mallet.classify.tui.Vectors2Vectors;; 
     split) CLASS=cc.mallet.classify.tui.Vectors2Vectors;; 
     bulk-load) CLASS=cc.mallet.util.BulkLoader;; 
     run) CLASS=$1; shift;; 
     *) echo "Unrecognized command: $CMD"; help; exit 1;; 
esac 

java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "[email protected]" 

还值得一提的是,我的原始培训文件有60,000项。当我减少项目数量(20,000个实例)时,培训将像正常一样运行,但使用大约10GB RAM。

+0

您准确更改了哪个文件? bin/mallet或bin/mallet.sh? – mikep

+0

bin/mallet和bin/mallet.bat –

+0

是在其中一个对java的调用? – mikep

回答

1

检查bin/mallet中对Java的调用,并添加标志-Xmx3g,确保其中没有另一个Xmx;如果是的话,编辑一个)。