MATLAB - 如何获取字符串中每个单词的出现次数？

-1

假设我们想通过MATLAB来检查特定文本文件中出现任何单词的次数，我们该怎么做？现在，由于我正在检查单词是SPAM单词还是HAM单词（正在进行内容过滤），因此我正在查找单词的概率是垃圾邮件，即n（垃圾邮件发生次数）/ n（总发生次数）将给出概率。MATLAB - 如何获取字符串中每个单词的出现次数？

提示？

来源

2014-08-28 Priyam Soneji

我们可以假定文本文件已经导入到一个字符串？或者这些单词已经在字符串的单元数组中分开了？ – 2014-08-28 19:45:23

不是单元格的字符串数组，认为它已经从文本文件中导入 – 2014-08-28 19:47:32

那么你可以导入为单元数组或字符数组。 – Divakar 2014-08-28 19:52:18

可以使用正则表达式来找到一个词的出现次数..

例如：

txt = fileread(fileName); 
tokens = regexp(txt, string, 'tokens');

字符串就是你正在寻找一个..

来源

2014-08-28 19:47:01 lakesh

该字符串可以一次一个字符串的所有单元格吗？这就是我正在寻找我想要 – 2014-08-28 19:49:59

@PriyamSoneji - 是的，它可以。 'regexp'通过使用单个字符串或字符串的单元数组来工作。 – rayryeng 2014-08-29 06:05:55

顺便说一句这是一个答案。您有正确的机制来搜索字符串中的特定模式。你没有逻辑去计算单词出现的次数。不过，朝着正确的方向努力。 – rayryeng 2014-08-29 06:07:14

举个例子，请考虑一个名为text.txt的文本文件，其中包含以下文字：

这两个与所有句子一样，句子包含单词。其中一些词重复;但不是所有的。

一种可能的方法如下：

s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string 
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split 
%// into words. Make sure there's always at least a final punctuation sign. 
%// You may want to extend the list of separators (between the brackets) 
%// I have made this case insensitive using "lower" 
words = words(1:end-1); %// remove last "word", which will always be empty 
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part: 
%// get unique words and an integer label for each one 
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label

结果是和count：

uniqueWords = 
    'all' 'are' 'but' 'contain' 'like' 'not' 'of' 'repeated' 
    'sentences' 'some' 'these' 'those' 'two' 'words'  

count = 
     2 1 1 1 1 1 1 1 2 1 1 1 1 2

来源

2014-08-28 20:59:29

+1 - 非常好的具体例子 – rayryeng 2014-08-29 06:06:19

MATLAB - 如何获取字符串中每个单词的出现次数？

回答

相关问题