编辑Haskell中的距离算法 - 性能调优

我试图在Haskell中实现levenshtein距离（或编辑距离），但当字符串长度增加时，其性能迅速下降。编辑Haskell中的距离算法 - 性能调优

我对Haskell还是比较新的，所以如果你能给我一些关于如何改进算法的建议，那将会很好。我已经尝试过“预先计算”值（inits），但是因为它没有改变任何东西，所以我恢复了这个改变。

我知道在Hackage上已经有一个editDistance的实现，但我需要它在任意标记列表上工作，而不一定是字符串。另外，我觉得它有点复杂，至少与我的版本相比。

所以，这里是代码：

 
-- standard levenshtein distance between two lists 
editDistance  :: Eq a => [a] -> [a] -> Int 
editDistance s1 s2 = editDistance' 1 1 1 s1 s2 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance'  :: Eq a => Int -> Int -> Int -> [a] -> [a] -> Int 
editDistance' _ _ ins s1 [] = ins * length s1 
editDistance' _ _ ins [] s2 = ins * length s2 
editDistance' del sub ins s1 s2 
    | last s1 == last s2 = editDistance' del sub ins (init s1) (init s2) 
    | otherwise   = minimum [ editDistance' del sub ins s1 (init s2)  + del -- deletion 
            , editDistance' del sub ins (init s1) (init s2) + sub -- substitution 
            , editDistance' del sub ins (init s1) s2  + ins -- insertion 
            ]

这似乎是一个正确的实现，至少它给了完全相同的结果，因为这online tool。

在此先感谢您的帮助！如果您需要任何其他信息，请告诉我。

问候， BZN

来源

2011-04-01 bzn

忽略，这是一个糟糕的算法（应memoizing，我到达那个第二）...

使用O（1）原语，而不是为O（n）

一个问题是你使用了一大堆调用O（n）的列表（haskell列表是单独链接列表）。一个更好的数据结构会给你O（1）操作，我用Vector：

import qualified Data.Vector as V 

-- standard levenshtein distance between two lists 
editDistance  :: Eq a => [a] -> [a] -> Int 
editDistance s1 s2 = editDistance' 1 1 1 (V.fromList s1) (V.fromList s2) 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance'  :: Eq a => Int -> Int -> Int -> V.Vector a -> V.Vector a -> Int 
editDistance' del sub ins s1 s2 
    | V.null s2 = ins * V.length s1 
    | V.null s1 = ins * V.length s2 
    | V.last s1 == V.last s2 = editDistance' del sub ins (V.init s1) (V.init s2) 
    | otherwise   = minimum [ editDistance' del sub ins s1 (V.init s2)  + del -- deletion 
            , editDistance' del sub ins (V.init s1) (V.init s2) + sub -- substitution 
            , editDistance' del sub ins (V.init s1) s2  + ins -- insertion 
            ]

是O（N）的列表包括初始化，length的操作，和last（虽然INIT能够偷懒的最小）。所有这些操作都是使用Vector的O（1）。

虽然真正的标杆应该使用Criterion，一个快速和肮脏的基准：

str2 = replicate 15 'a' ++ replicate 25 'b' 
str1 = replicate 20 'a' ++ replicate 20 'b' 
main = print $ editDistance str1 str2

显示矢量版本需要0.09秒而串采取1.6秒，所以我们节省了大约一个数量级，甚至没有看你的editDistance算法。

现在怎么样记忆结果？

更大的问题显然是需要记忆。我将此作为了解monad-memo包裹的机会 - 我的上帝真的太棒了！对于一个额外的约束条件（您需要Ord a），您基本上不费力气就可以进行记忆。代码：

import qualified Data.Vector as V 
import Control.Monad.Memo 

-- standard levenshtein distance between two lists 
editDistance  :: (Eq a, Ord a) => [a] -> [a] -> Int 
editDistance s1 s2 = startEvalMemo $ editDistance' (1, 1, 1, (V.fromList s1), (V.fromList s2)) 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance' :: (MonadMemo (Int, Int, Int, V.Vector a, V.Vector a) Int m, Eq a) => (Int, Int, Int, V.Vector a, V.Vector a) -> m Int 
editDistance' (del, sub, ins, s1, s2) 
    | V.null s2 = return $ ins * V.length s1 
    | V.null s1 = return $ ins * V.length s2 
    | V.last s1 == V.last s2 = memo editDistance' (del, sub, ins, (V.init s1), (V.init s2)) 
    | otherwise = do 
     r1 <- memo editDistance' (del, sub, ins, s1, (V.init s2)) 
     r2 <- memo editDistance' (del, sub, ins, (V.init s1), (V.init s2)) 
     r3 <- memo editDistance' (del, sub, ins, (V.init s1), s2) 
     return $ minimum [ r1 + del -- deletion 
         , r2 + sub -- substitution 
         , r3 + ins -- insertion 
            ]

您会看到memoization是如何需要一个“键”（请参阅MonadMemo类）？我将所有参数打包成一个很大的丑陋元组。它也需要一个“价值”，这是你的结果Int。然后，只需使用“备忘录”功能即可即插即用您想要记忆的值。

对于基准我用较短，但较大的距离，字符串：

$ time ./so # the memoized vector version 
12 

real 0m0.003s 

$ time ./so3 # the non-memoized vector version 
12 

real 1m33.122s

千万别想运行非memoized字符串版本，我想，这将需要大约15分钟在最低限度。至于我，我现在喜欢monad-memo - 感谢Eduard的包装！

编辑：String和Vector之间的差异在memoized版本中没有那么多，但当距离达到200左右时仍然增长到2倍，所以仍然值得。

编辑：也许我应该解释为什么更大的问题是“明显”记忆结果。好吧，如果你看一下原始算法的心脏：

[ editDistance' ... s1   (V.init s2) + del 
, editDistance' ... (V.init s1) (V.init s2) + sub 
, editDistance' ... (V.init s1) s2   + ins]

这是相当清楚的editDistance' s1 s2结果在3调用editDistance'一个电话......每一个来电editDistance'三次......还有三个时间......和AHHH！指数爆炸！幸运的是，大多数电话是相同的！例如（使用-->的“电话”和eD为editDistance'）：

eD s1 s2 --> eD s1 (init s2)    -- The parent 
      , eD (init s1) s2 
      , eD (init s1) (init s2) 
eD (init s1) s2 --> eD (init s1) (init s2)   -- The first "child" 
        , eD (init (init s1)) s2 
        , eD (init (init s1)) (init s2) 
eD s1 (init s2) --> eD s1 (init (init s2)) 
        , eD (init s1) (init s2) 
        , eD (init s1) (init (init s2))

只需通过考虑父母和两个孩子立即可以看到通话ed (init s1) (init s2)做三次。另一个孩子与父母共享呼叫，所有的孩子都与另一个孩子（和他们的孩子，提示Monty Python skit）共享许多呼叫。

这将是一个有趣的，也许有启发性的练习，使runMemo类似的函数返回所使用的缓存结果的数量。

来源

2011-04-01 15:20:32

哇，这是伟大的。我以前听说过莫诺化，但我从来没有想到这很容易！当你说“忽略这是一个不好的算法（应该记忆，我到那一秒）......”，你是指算法的整体结构还是仅仅是应该使用记忆的事实？对我来说，算法本身看起来不错。 :) – bzn 2011-04-01 21:14:57

bzn：我只是认为这不是记忆的事实。如果您之前没有看过记忆，那么请参阅[Haskell wiki]（http://www.haskell.org/haskellwiki/Memoization），CS算法手册，或两者。如果没有记忆，你可以多次计算大部分值，但是记忆只能计算一次，否则就会查找以前计算的结果。例如，要计算列表的第一个元素（'editDist s1（init s2）'），函数最终将计算'editDist（init s1）（init s2）'。这是调用者列表中的第二个元素，并且是被调用者列表中的第三个元素！ – 2011-04-01 22:09:07

@bzn我添加了一个编辑，谈论为什么这个问题是“显然”memoization。 – 2011-04-01 22:39:07

您需要记忆editDistance'。有很多方法可以做到这一点，例如递归定义的数组。

来源

2011-04-01 15:42:10 augustss

当我投票赞成你时，为什么会出现独角兽和气球？如果您在ICFP上发布太多论文，会发生什么？ – 2011-04-01 16:06:50

我希望我可以对独角兽要求任何责任。 – augustss 2011-04-01 18:41:39

@TomMD这是一个SO愚人节礼物。 – sclv 2011-04-01 20:26:35

我知道有已经是editDistance实现上Hackage，但我需要它在任意标记列表的操作，不一定字符串

是否有令牌的数量有限？我建议你试着简单地设计一个从令牌到角色的映射。毕竟有10,646 characters at your disposal。

来源

2011-04-01 16:37:30

谢谢，但现在我要使用我的解决方案，因为像TomMD提出的那样调整它，应该使其速度够快 - 毕竟这是我所需要的。：P – bzn 2011-04-01 21:20:47

如前所述，memoization是你所需要的。此外，您正在查看从右到左的编辑距离，这对字符串来说效率并不高，无论方向如何，编辑距离都是相同的。那就是：editDistance (reverse a) (reverse b) == editDistance a b

为了解决备忘录部分有很多库可以帮助你。在我的例子中，我选择了MemoTrie，因为它很容易使用并且在这里表现很好。

import Data.MemoTrie(memo2) 

editDistance' del sub ins = memf 
    where 
    memf = memo2 f 
    f s1  []  = ins * length s1 
    f []  s2  = ins * length s2 
    f (x:xs) (y:ys) 
    | x == y = memf xs ys 
    | otherwise = minimum [ del + memf xs (y:ys), 
          sub + memf (x:xs) ys, 
          ins + memf xs ys]

正如你所看到的你所需要的是添加记忆。其余的都是一样的，只不过我们从最后的名单开始。

来源

2011-04-05 07:36:36 HaskellElephant

感谢您的提示。 – bzn 2011-04-07 19:47:19

备忘录+1。这个真棒！ – Rotsor 2011-07-16 22:25:33

但为什么你在'f（x：xs）（y：ys）'的第一个方程中使用'f'而不是'memf'？ – Rotsor 2011-07-16 22:28:40

这个版本比那些记忆的版本快得多，但我仍然希望它更快。适用于100个字符长的字符串。我是用其他距离编写的（改变初始化函数和成本），并使用经典的动态编程数组技巧。长长的一行可以转换成一个单独的函数，顶部'做'，但我喜欢这种方式。

import Data.Array.IO 
import System.IO.Unsafe 

editDistance = dist ini med 

dist :: (Int -> Int -> Int) -> (a -> a -> Int) -> [a] -> [a] -> Int 
dist i f a b = unsafePerformIO $ distM i f a b 

-- easy to create other distances 
ini i 0 = i 
ini 0 j = j 
ini _ _ = 0 
med a b = if a == b then 0 else 2 


distM :: (Int -> Int -> Int) -> (a -> a -> Int) -> [a] -> [a] -> IO Int 
distM ini f a b = do 
     let la = length a 
     let lb = length b 

     arr <- newListArray ((0,0),(la,lb)) [ini i j | i<- [0..la], j<-[0..lb]] :: IO (IOArray (Int,Int) Int) 

-- all on one line 
     mapM_ (\(i,j) -> readArray arr (i-1,j-1) >>= \ld -> readArray arr (i-1,j) >>= \l -> readArray arr (i,j-1) >>= \d-> writeArray arr (i,j) $ minimum [l+1,d+1, ld + (f (a !! (i-1)) (b !! (j-1))) ]) [(i,j)| i<-[1..la], j<-[1..lb]] 

     readArray arr (la,lb)

来源

2012-03-16 23:43:19 tolysz

All on一行不是很好的布局......并且unsafePerformIO并不是真的必要，绝对不可取 - 代码可以用ST monad进行重写，只需进行很少的更改。 – Oliver 2012-05-29 11:12:08

人们推荐你使用通用的记忆化库，但定义Levenshtein距离普通动态规划的简单的任务是绰绰有余。一个非常简单的多态基于列表的实现：

distance s t = 
    d !!(length s)!!(length t) 
    where d = [ [ dist m n | n <- [0..length t] ] | m <- [0..length s] ] 
      dist i 0 = i 
      dist 0 j = j 
      dist i j = minimum [ d!!(i-1)!!j+1 
          , d!!i!!(j-1)+1 
          , d!!(i-1)!!(j-1) + (if s!!(i-1)==t!!(j-1) 
                then 0 else 1) 
          ]

或者，如果你需要长时间序列实际速度，你可以使用一个可变数组：

import Data.Array 
import qualified Data.Array.Unboxed as UA 
import Data.Array.ST 
import Control.Monad.ST 


-- Mutable unboxed and immutable boxed arrays 
distance :: Eq a => [a] -> [a] -> Int 
distance s t = d UA.! (ls , lt) 
    where s' = array (0,ls) [ (i,x) | (i,x) <- zip [0..] s ] 
      t' = array (0,lt) [ (i,x) | (i,x) <- zip [0..] t ] 
      ls = length s 
      lt = length t 
      (l,h) = ((0,0),(length s,length t)) 
      d = runSTUArray $ do 
       m <- newArray (l,h) 0 
       for_ [0..ls] $ \i -> writeArray m (i,0) i 
       for_ [0..lt] $ \j -> writeArray m (0,j) j 
       for_ [1..lt] $ \j -> do 
           for_ [1..ls] $ \i -> do 
            let c = if s'!(i-1)==t'! (j-1) 
              then 0 else 1 
            x <- readArray m (i-1,j) 
            y <- readArray m (i,j-1) 
            z <- readArray m (i-1,j-1) 
            writeArray m (i,j) $ minimum [x+1, y+1, z+c ] 
       return m 

for_ xs f = mapM_ f xs

来源

2012-03-17 14:58:35

编辑Haskell中的距离算法 - 性能调优

回答

相关问题