2013-06-12 127 views
4

我试图根据停用词的列表将Ruby中的字符串拆分为更小的子字符串或短语。当我直接定义正则表达式模式时,split方法起作用;然而,当我试图通过在split方法本身内进行评估来定义模式时,它不起作用。使用正则表达式在Ruby中分割字符串中的字符串

实际上,我想读取停用词的外部文件并用它来分割我的句子。所以,我希望能够从外部文件构建模式,而不是直接指定它。我还注意到,当我使用'pp'与'puts'时,我得到了非常不同的行为,我不知道为什么。我在Windows上使用Ruby 2.0和Notepad ++。

require 'pp' 
str = "The force be with you."  
pp str.split(/(?:\bthe\b|\bwith\b)/i) 
=> ["", " force be ", " you."] 
pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?) 
=> ["force be", "you."] 

上面的最后一个数组是我期望的结果。然而,这并不以下工作:

require 'pp' 
stop_array = ["the", "with"] 
str = "The force be with you." 
pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")" 
puts pattern 
=> (?thwit) 
puts str.split(/#{pattern}/i) 
=> The force be with you. 
pp pattern 
=> "(?:\bthe\b|\bwith\b)" 
pp str.split(/#{pattern}/i) 
=> ["The force be with you."] 

更新:使用下面的评论,我修改了原来的脚本。我也创建了一个方法来分割字符串。

require 'pp' 

class String 
     def splitstop(stopwords=[]) 
     stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i 
     return split(stopwords_regex).collect(&:strip).reject(&:empty?) 
     end 
end 

stop_array = ["the", "with", "over"] 

pp "The force be with you.".splitstop stop_array 
=> ["force be", "you."] 
pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array 
=> ["quick brown fox jumps", "lazy dog."] 
+1

'/(?:\的意见书\ C | \ bwith \ B)/'比较好写的'/ \ B(:该|用?)\ B /'。 –

回答

3

我会做这种方式:

/(?:#{ Regexp.union(stop_array) })/i 
=> /(?:(?-mix:the|with))/i 

嵌入式(?-mix:圈:当使用Regexp.union

str = "The force be with you."  
stop_array = %w[the with] 
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i 
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."] 

,它要提防所产生的实际模式是非常重要的关闭模式内的不区分大小写的标志,这可以打破模式,导致它抓住错误的东西。相反,你必须告诉引擎只返回样式,无标志:

/(?:#{ Regexp.union(stop_array).source })/i 
=> /(?:the|with)/i 

这也是为什么pattern = "(?:\bthe\b|\bwith\b)"不起作用:

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i 

红宝石看到"\b"作为退格字符。而是使用:

pattern = "(?:\\bthe\\b|\\bwith\\b)" 
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i 
0

你要掩盖反斜线:

"\\b#{i}\\b" 

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")" 

和次要改进/简化:

pattern = "\\b(?:" + stop_array.join("|") + ")\\b" 

然后:

str.split(/#{pattern}/i) # => ["", " force be ", " you."] 

如果您的停止名单很短,我认为这是正确的做法。

+0

使用生成的模式,显示这将如何解决OP的问题。 –

0
stop_array = ["the", "with"] 
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i}) 

"The force be with you.".split(re) # => 
[ 
    "", 
    "force be", 
    "you." 
] 
0
s = "the force be with you." 
stop_words = %w|the with is| 
# dynamically create a case-insensitive regexp 
regexp = Regexp.new stop_words.join('|'), true 
result = [] 
while(match = regexp.match(s)) 
    word = match.pre_match unless match.pre_match.empty? 
    result << word 
    s = match.post_match 
end 
# the last unmatched content, if any 
result << s 
result.compact!.map(&:strip!) 

pp result 
=> ["force be", "you."] 
相关问题