2016-02-12 127 views
2

我正在寻找更好的SAS方法来计算某个单词出现在字符串中的次数。例如,搜索字符串中的“木”:单词出现次数的计数

how much wood could a woodchuck chuck if a woodchuck could chuck wood 

...将返回2结果。

这是我通常会做,但它的很多代码:

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 
    found_count = 0; 

    cnt=1; 
    word = scan(sentence,cnt); 
    do while (word ne ''); 
    num_times_found = sum(num_times_found, word eq search_term); 
    cnt = cnt + 1; 
    word = scan(sentence,cnt); 
    end; 

    put num_times_found=; 

run; 

我可以把这个变成一个fcmp功能,使其更加优雅,但我仍然觉得自己必须有更友好,更简洁的代码。

+0

我在这里发布了这个而不是codereview,因为我不认为codereview会有任何SAS受众。 –

+0

这不就是countW么? –

+0

@data_null_不 - 这是我第一次想到的,但'countw()'只是计算单词的总数,而不是特定单词出现的次数。 –

回答

3

从Code Review的角度来看,以上可以有所改进。 do循环可以处理cnt增量,如果将其切换为until,则不必执行初始分配。你也有一个无关的变量found_count,不知道那是什么。否则,我认为这是合理的,至少对于非复杂的解决方案而言。

data _null_; 
    length sentence word $200; 

    sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
    search_term = 'wood'; 

    do cnt=1 by 1 until (word eq ''); 
    word = scan(sentence,cnt); 
    num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    put num_times_found=; 

run; 

它也相当快 - 1e6迭代在我的盒子上不到9秒。当o被添加到字符串选项时,PRX解决方案需要更少的时间(6秒),所以在使用非常大的数据集或大量变量时可能更可取,但我相信与I/O时间相比,增加的时间将会很重要。 FCMP解决方案与此解决方案具有相同的时间顺序(大约8-9秒)。最后,FINDW解决方案是最快的,大约2秒。

2

尝试用prxchange掉落木头,然后countw。

data _null_; 
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' '); 
put _all_; 
run; 
+0

从技术上讲,这当然会将'土拨鼠'翻译为'卡盘',但这并不影响结果。 – Joe

+0

而这正是我所说的'错综复杂的解决方案' - 不是因为它错了,而是它不那么直截了当,并且可以根据这个原则避免(因为其他人很难看到你是什么这样做)。 – Joe

+0

您可以将'o'选项添加到您的prx中,否则运行多次迭代需要相当长的时间。 – Joe

2

以及物品是否完整,这是作为一个钙镁磷肥功能:

钙镁磷肥定义:

options cmplib=work.temp.temp; 

proc fcmp outlib=work.temp.temp; 

    function word_freq(sentence $, search_term $) ;  
    length sentence word $200; 

    do cnt=1 by 1 until (word eq ''); 
     word = scan(sentence,cnt); 
     num_times_found = sum(num_times_found, word eq search_term); 
    end; 

    return (num_times_found); 
    endsub; 

run; 

用法:

data _null_; 
    num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); 
    put num_times_found=; 
run; 

结果:

num_times_found=2 
3

当FINDW将有效扫描您时,没有理由扫描所有单词。

33   data _null_; 
34   length sentence search_term $200; 
35   sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 
36   search_term = 'wood'; 
37   cnt=0; 
38   do s=findw(sentence,strip(search_term),1) by 0 while(s); 
39    cnt+1; 
40    s=findw(sentence,strip(search_term),s+1); 
41    end; 
42   put cnt= search_term=; 
43   stop; 
44   run; 

cnt=2 search_term=wood 
+0

绝对比SCAN方法快很多。 – Joe

相关问题