的Perl从字符串

我使用这个脚本，以消除在Perl无效搜索字词删除无效搜索字词，我在Windows上运行，我无法找到的兼容版本：的Perl从字符串

Lingua::EN::StopWordList 
Lingua::StopWords qw(getStopWords)

我有一个停止词的数组，但是一旦我使用下面的REGEX，我就会失去导致词语冲突的关键空格。请注意，Stop-Word数组中的每个单词都有两个空格，一个在右侧，一个在左侧。

如何在不丢失关键空白的情况下有效移除停用词？

use strict; 
use warnings; 
use utf8; 
use IO::File; 
use String::Util 'trim'; 

my $inFile = "C:\\Users\\David\\Downloads\\InfoRet\\Explore the ways to get better grades.txt"; 
my $inFh = new IO::File $inFile, "r"; 
my $lineNum = 0; 
my $line = undef; 
my $loc = undef; 
my $str = undef; 

my @stopList = (" the ", " a ", " an ", " of ", " and ", " on ", " in ", " by ", " with ", " at ", " after ", " into ", " their ", " is ", " that ", " they ", " for ", " to ", " it ", " them ", " which "); 

for(my $i = 1; $i <= 4; $i++) { 
    <$inFh> 
} 

while($line = <$inFh>) { 
    $lineNum++; 
    chomp $line; 
    $line =~ s/[\$#@~!&*()\[\];.,:?^`\\\/]+//g; 

    for my $planet (@stopList) { 
     $loc = index($line, $planet); 
     if($loc!=(-1)) { 
      #$line =~ s/$str//g; 
      $line =~ s/$planet//g; 
     } 
    } 
    print "$line\n"; 
}

来源

2014-11-08 David Faiz

一个想法是不删除空白。而不是循环停止列表，使用停用词作为键和它们的值''“'做一个散列。然后执行＃（\ w +）＃$ hash {lc（$ 1）} // $ 1＃g'注意你必须使用defined或'//'，因为'“”'是一个假值。另请注意，您必须从停用词列表中删除空格。 – TLP 2014-11-08 18:24:29

my @stopList = ("the", "a", "an", "of", ..); 
my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b/, @stopList;

后来，

$line =~ s/$rx//g;

来源

2014-11-08 18:26:00

的Perl从字符串

回答

相关问题