I'm在STM模式工作（topicmodelling）和我倒是喜欢评估和验证模型，但我不确定如何做到这一点。我的代码是：评估STM模式

Corpus.STM <- readCorpus(dtm, type = "slam")

型号选择：

BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 

str(BestM1.) 
plot.searchK(BestM1.) 
plot.STM(BestM2) 
plot.searchK(BestM3.) 
#27 seems to be a good choice 
#Heldout 
set.seed(1) 
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1) 
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100) 
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing) 
heldout.evaluation 
#evaluation heldout 
labelTopics(stm.mod1) 
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25) 
cloud(stm.mod1, topic=5) 
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)

我不确定如何解释 “eval.heldout” 的输出。另外我想确保模型不会过度适应，但我不确定它是如何工作的。

来源

2017-01-02 S.Weigel

eval.heldout（）计算使用文档完成持有了数似然。你想要的数字是持有的..evaluation $ expected.heldout，它是每个文档的外延对数似然值的平均值。不幸的是，这个模型是否“过度使用”并没有明确的标准。 plot.searchK（）调用你会给你一个关于K的不同值的持续对数似然图，当然如果这个数字随着K的增加而下降，那么一个解释就是过度拟合。

对不起，没有更明确的答案，但遗憾的是没有硬性规定在这里。

来源

2018-01-08 18:24:15 bstewart

评估STM模式

型号选择：

回答

相关问题