2013-08-24 23 views
0

我删除重复的文字:从多个字符串

a = "This is Product A with property B and propery C. Buy it now!" 
b = "This is Product B with property X and propery Y. Buy it now!" 
c = "This is Product C having no properties. Buy it now!" 

我正在寻找一种算法,可以这样做:

> magic(a, b, c) 
=> ['A with property B and propery C', 
    'B with property X and propery Y', 
    'C having no properties'] 

我必须找到在1000+文本重复。超级表演不是必须的,但会很好。

- 更新

我正在寻找单词序列。所以,如果:

d = 'This is Product D with text engraving: "Buy". Buy it now!' 

第一个“卖”不应该重复。我猜测我必须使用n之后的字眼,以便看作是重复的。

+2

问题不明确?如何定义重复的文本? –

+1

为什么“有财产”在重复时不重复? :D – fl00r

+1

1)如果有第四个字符串“Bumblebee zebra”。 '魔术(a,b,c,d)'会被期望返回所有四个未修改的字符串? 2)预期如何使用位置信息,例如“魔术师”示例删除了“立即购买!”尽管事实上这是字符串的不同部分。可能你正在寻找一个'diff'函数? –

回答

3
def common_prefix_length(*args) 
    first = args.shift 
    (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } } 
end 

def magic(*args) 
    i = common_prefix_length(*args) 
    args = args.map { |a| a[i..-1].reverse } 
    i = common_prefix_length(*args) 
    args.map { |a| a[i..-1].reverse } 
end 

a = "This is Product A with property B and propery C. Buy it now!" 
b = "This is Product B with property X and propery Y. Buy it now!" 
c = "This is Product C having no properties. Buy it now!" 

magic(a,b,c) 
# => ["A with property B and propery C", 
#  "B with property X and propery Y", 
#  "C having no properties"] 
+0

我喜欢你的解决方案看序列而不是单个单词! – Willian

3

你的数据

sentences = [ 
    "This is Product A with property B and propery C. Buy it now!", 
    "This is Product B with property X and propery Y. Buy it now!", 
    "This is Product C having no properties. Buy it now!" 
] 

你的魔法

def magic(data) 
    prefix, postfix = 0, -1 
    data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break while true 
    data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break while true 
    data.map{ |d| d[prefix..postfix] } 
end 

你的输出

magic(sentences) 
#=> [ 
#=> "A with property B and propery C", 
#=> "B with property X and propery Y", 
#=> "C having no properties" 
#=> ] 

或者你可以使用loop代替while true

def magic(data) 
    prefix, postfix = 0, -1 
    loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break } 
    loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break } 
    data.map{ |d| d[prefix..postfix] } 
end 
+0

当'data'碰巧是一串相同的字符串时,你的'magic'不会终止。你必须检查'prefix'和'postfix'索引,这个位置的'd'中的字符存在。 – sawa

+0

好抓,@sawa!固定 – fl00r

-1

编辑:此代码有错误。只是留下我的回答供参考,因为如果人们在被降低评分后删除答案,我不喜欢它。每个人都会犯错误:-)

我喜欢@filttru的方法,但觉得代码不必要的复杂。这里是我的尝试:

def common_prefix_length(strings) 
    i = 0 
    i += 1 while strings.map{|s| s[i] }.uniq.size == 1 
    i 
end 

def common_suffix_length(strings) 
    common_prefix_length(strings.map(&:reverse)) 
end 

def uncommon_infixes(strings) 
    pl = common_prefix_length(strings) 
    sl = common_suffix_length(strings) 
    strings.map{|s| s[pl...-sl] } 
end 

由于OP可关注业绩,我做了一个快速基准:

require 'fruity' 
require 'securerandom' 

prefix = 'PREFIX ' 
suffix = ' SUFFIX' 
test_data = Array.new(1000) do 
    prefix + SecureRandom.hex + suffix 
end 

def fl00r_meth(data) 
    prefix, postfix = 0, -1 
    data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break while true 
    data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break while true 
    data.map{ |d| d[prefix..postfix] } 
end 

def falsetru_common_prefix_length(*args) 
    first = args.shift 
    (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } } 
end 

def falsetru_meth(*args) 
    i = falsetru_common_prefix_length(*args) 
    args = args.map { |a| a[i..-1].reverse } 
    i = falsetru_common_prefix_length(*args) 
    args.map { |a| a[i..-1].reverse } 
end 

def padde_common_prefix_length(strings) 
    i = 0 
    i += 1 while strings.map{|s| s[i] }.uniq.size == 1 
    i 
end 

def padde_common_suffix_length(strings) 
    padde_common_prefix_length(strings.map(&:reverse)) 
end 

def padde_meth(strings) 
    pl = padde_common_prefix_length(strings) 
    sl = padde_common_suffix_length(strings) 
    strings.map{|s| s[pl...-sl] } 
end 

compare do 
    fl00r do 
    fl00r_meth(test_data.dup) 
    end 

    falsetru do 
    falsetru_meth(*test_data.dup) 
    end 

    padde do 
    padde_meth(test_data.dup) 
    end 
end 

这些结果如下:

Running each test once. Test will take about 1 second. 
fl00r is similar to padde 
padde is faster than falsetru by 30.000000000000004% ± 10.0% 
+1

愿意解雇他的反对者吗? –

+1

当数据碰巧是一个相同字符串的数组时,您的代码将不会终止。你必须检查'i'索引,该位置字符串中的字符存在。 – sawa

+0

您的代码与我的第一版答案类似。我改为当前版本,因为我认为创建/删除中间数组('map {..} .uniq.size')可能会导致性能下降。根据你的基准,我错了。 ;) – falsetru