2013-10-25 61 views
2
some text and some text too bad, 
some too  bad again some bad 
and other words bad, it is too  bad 

我试图取代所有字的“坏”到“好”,但有例外:替换词另一

若词“太”之前的“坏”,“坏“不应该被更改为‘好’, 二者之间可以用一个或微尘空白‘太’与‘坏’,甚至HTML空白”  “

后,所以正则表达式处理文本应该是

some text and some text too bad, 
    some too  bad again some good 
    and other words good, it is too  bad 

试过这样的事情,但它不能正常工作。

$text ~= s/(too(\s+|\s* \s*))bad/good/ig; 

请帮

+2

虽然正则表达式的专家可以创造奇迹,在最后总要有人理解和维护这样的代码。 –

回答

-1

你可以尝试解码html空格,并应用正则表达式的计算,如果前面的字符串是too

#!/usr/bin/env perl; 

use strict; 
use warnings; 
use HTML::Entities; 

while (<DATA>) { 
    _decode_entities($_, { nbsp => "\xA0" }); 
    s/(\w+)(\s+)bad/$1 eq 'too' ? $& : "$1$2good"/eg; 
    encode_entities($_); 
    print $_; 
} 

__DATA__ 
some text and some text too bad, 
some too&nbsp; bad again some bad 
and other words bad, it is too  bad 

运行它喜欢:

perl script.pl 

国债收益率:

some text and some text too bad, 
some too&nbsp; bad again some good 
and other words good, it is too  bad 
+0

那么一个不可破坏的空间变得易碎? – Borodin

+0

@Borodin:谢谢你注意到这个bug。我已经添加了'encode_entities()'函数来修复它。 – Birei

+0

感谢Borodin和@Birei,它真的帮了我很大的忙 –

1

我不相信这可以方便地使用正则表达式来完成。它变得更加复杂,因为单词的想法尚不清楚:例如,您想将“bad”作为单词“bad”来对待。

该程序通过将字符串标记为单词和分隔符,然后将所有出现的“坏”改变为“好”,除非它们前面有“太”(忽略大写和小写)。我在可能的分隔符列表中包含了逗号,冒号和分号。你可能想调整这个来获得你期望的结果。

use strict; 
use warnings; 

my $text = <<END; 
some text and some text too bad, 
some too&nbsp; bad again some bad 
and other words bad, it is too  bad 
END 

my @tokens = split /((?:[\s,;.:]|&nbsp;)+)/, $text; 

for my $i (grep { lc $tokens[$_] eq 'bad' } 1 .. $#tokens) { 
    $tokens[$i] = 'good' unless lc $tokens[$i-2] eq 'too'; 
} 

print join '', @tokens; 

输出

some text and some text too bad, 
some too&nbsp; bad again some good 
and other words good, it is too  bad