在ngrams上训练朴素贝叶斯分类器

我一直在使用Ruby Classifier library来classify privacy policies。我得出的结论是，这个库中内置的简单的单词袋方法是不够的。为了提高我的分类准确度，我想除了单词之外还要训练n-gram的分类器。在ngrams上训练朴素贝叶斯分类器

我想知道是否有一个库用于预处理文档以获得相关n-gram（并正确处理标点符号）。一个想法是，我可以预处理的文件和饲料伪的n-gram与Ruby的分类，如：

wordone_wordtwo_wordthree

或者，也许有更好的方式来这样做，比如有一个图书馆从getgo构建的基于ngram的朴素贝叶斯分类。如果他们完成了这项工作，我很乐于使用Ruby以外的其他语言（如果需要的话，Python似乎是一个很好的候选人）。

来源

2012-04-09 babonk

如果你确定蟒蛇，我会说nltk将是完美的你。

例如：

>>> import nltk 
>>> s = "This is some sample data. Nltk will use the words in this string to make ngrams. I hope that this is useful.".split() 
>>> model = nltk.NgramModel(2, s) 
>>> model._ngrams 
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that' 
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), (' 
in', 'this'), ('This', 'is')])

你甚至有一个方法nltk.NaiveBayesClassifier

来源

2012-04-09 20:21:11

很棒的答案+1 – Yavar 2012-04-09 20:39:41

与许多Ruby相比，NLTK看起来很棒。 Python获胜了，谢谢！ – babonk 2012-04-09 21:49:47

@babonk我的荣幸。我发现nltk是一个使用和令人难以置信的强大的快乐，希望你有它的乐趣：D – 2012-04-09 21:50:43

>> s = "She sells sea shells by the sea shore" 
=> "She sells sea shells by the sea shore" 
>> s.split(/ /).each_cons(2).to_a.map {|x,y| x + ' ' + y} 
=> ["She sells", "sells sea", "sea shells", "shells by", "by the", "the sea", "sea shore"]

红宝石可枚举有一个方法叫enum_cons将返回所有的n个连续的项目从枚举。用这种方法生成ngram是一个简单的单行程。

来源

2012-04-10 04:24:06

Thx。必须使用'each_cons'而不是'enum_cons'。 – Dru 2013-01-20 15:57:21

Dru：似乎enum_cons已被弃用。用我的答案中的each_cons替换它。谢谢！ – 2013-01-20 17:09:10

在ngrams上训练朴素贝叶斯分类器

回答

相关问题