单词出现次数的计数

我正在寻找更好的SAS方法来计算某个单词出现在字符串中的次数。例如，搜索字符串中的“木”：单词出现次数的计数

how much wood could a woodchuck chuck if a woodchuck could chuck wood

...将返回2结果。

这是我通常会做，但它的很多代码：

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 
    found_count = 0; 

    cnt=1; 
    word = scan(sentence,cnt); 
    do while (word ne ''); 
    num_times_found = sum(num_times_found, word eq search_term); 
    cnt = cnt + 1; 
    word = scan(sentence,cnt); 
    end; 

    put num_times_found=; 

run;

我可以把这个变成一个fcmp功能，使其更加优雅，但我仍然觉得自己必须有更友好，更简洁的代码。

来源

2016-02-12 Robert Penridge

我在这里发布了这个而不是codereview，因为我不认为codereview会有任何SAS受众。 –

这不就是countW么？ –

@data_null_不 - 这是我第一次想到的，但'countw（）'只是计算单词的总数，而不是特定单词出现的次数。 –

从Code Review的角度来看，以上可以有所改进。 do循环可以处理cnt增量，如果将其切换为until，则不必执行初始分配。你也有一个无关的变量found_count，不知道那是什么。否则，我认为这是合理的，至少对于非复杂的解决方案而言。

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 

    do cnt=1 by 1 until (word eq ''); 
    word = scan(sentence,cnt); 
    num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    put num_times_found=; 

run;

它也相当快 - 1e6迭代在我的盒子上不到9秒。当o被添加到字符串选项时，PRX解决方案需要更少的时间（6秒），所以在使用非常大的数据集或大量变量时可能更可取，但我相信与I/O时间相比，增加的时间将会很重要。 FCMP解决方案与此解决方案具有相同的时间顺序（大约8-9秒）。最后，FINDW解决方案是最快的，大约2秒。

来源

2016-02-12 16:35:35 Joe

尝试用prxchange掉落木头，然后countw。

data _null_; 
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' '); 
put _all_; 
run;

来源

2016-02-12 16:35:01

从技术上讲，这当然会将'土拨鼠'翻译为'卡盘'，但这并不影响结果。 – Joe

而这正是我所说的'错综复杂的解决方案' - 不是因为它错了，而是它不那么直截了当，并且可以根据这个原则避免（因为其他人很难看到你是什么这样做）。 – Joe

您可以将'o'选项添加到您的prx中，否则运行多次迭代需要相当长的时间。 – Joe

以及物品是否完整，这是作为一个钙镁磷肥功能：

钙镁磷肥定义：

options cmplib=work.temp.temp; 

proc fcmp outlib=work.temp.temp; 

    function word_freq(sentence $, search_term $) ;  
    length sentence word $200; 

    do cnt=1 by 1 until (word eq ''); 
     word = scan(sentence,cnt); 
     num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    return (num_times_found); 
    endsub; 

run;

用法：

data _null_; 
    num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); 
    put num_times_found=; 
run;

结果：

num_times_found=2

来源

2016-02-12 17:11:43

当FINDW将有效扫描您时，没有理由扫描所有单词。

33   data _null_; 
34   length sentence search_term $200; 
35   sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
36   search_term = 'wood'; 
37   cnt=0; 
38   do s=findw(sentence,strip(search_term),1) by 0 while(s); 
39    cnt+1; 
40    s=findw(sentence,strip(search_term),s+1); 
41    end; 
42   put cnt= search_term=; 
43   stop; 
44   run; 

cnt=2 search_term=wood

来源

2016-02-12 17:59:12

绝对比SCAN方法快很多。 – Joe

单词出现次数的计数

回答

相关问题