2014-09-10 70 views
1

我试图实现真正高效的Clojure函数来计算Damerau-Levenshtein distance。我决定使用this algorithm(附带的源代码应该是C++)来计算Levenshtein距离并添加一些行以使其适用于DLD。Damerau-Levenshtein距离的高效实现

以下是我在的Common Lisp(我希望它可以帮助)创建:(?函数式)

(defun damerau-levenshtein (x y) 
    (declare (type string x y) 
      #.*std-opts*) 
    (let* ((x-len (length x)) 
     (y-len (length y)) 
     (v0 (apply #'vector (mapa-b #'identity 0 y-len))) 
     (v1 (make-array (1+ y-len) :element-type 'integer)) 
     (v* (make-array (1+ y-len) :element-type 'integer))) 
    (do ((i 0 (1+ i))) 
     ((= i x-len) (aref v0 y-len)) 
     (setf (aref v1 0) (1+ i)) 
     (do ((j 0 (1+ j))) 
      ((= j y-len)) 
     (let* ((x-i (char x i)) 
       (y-j (char y j)) 
       (cost (if (char-equal x-i y-j) 0 1))) 
      (setf (aref v1 (1+ j)) (min (1+ (aref v1 j)) 
             (1+ (aref v0 (1+ j))) 
             (+ (aref v0 j) cost))) 
      (when (and (plusp i) (plusp j)) 
      (let ((x-i-1 (char x (1- i))) 
        (y-j-1 (char y (1- j))) 
        (val (+ (aref v* (1- j)) cost))) 
       (when (and (char-equal x-i y-j-1) 
         (char-equal x-i-1 y-j) 
         (< val (aref v1 (1+ j)))) 
       (setf (aref v1 (1+ j)) val)))))) 
     (rotatef v* v0 v1)))) 

现在,我怕我不能把它翻译成真正有效和地道的Clojure代码。我非常感谢任何建议,我认为它对未来的许多读者也可能非常有用。

P.S.我发现this implementation,但如果它是有效的,我怀疑它使用一些过时的contrib功能(deep-merge-withbool-to-binary):

(defn damerau-levenshtein-distance 
    [a b] 
    (let [m (count a) 
     n (count b) 
     init (apply deep-merge-with (fn [a b] b) 
        (concat 
        ;;deletion 
        (for [i (range 0 (+ 1 m))] 
         {i {0 i}}) 
        ;;insertion 
        (for [j (range 0 (+ 1 n))] 
         {0 {j j}}))) 
     table (reduce 
       (fn [d [i j]] 
       (deep-merge-with 
        (fn [a b] b) 
        d 
        (let [cost (bool-to-binary (not (= (nth a (- i 1)) 
              (nth b (- j 1))))) 
         x 
          (min 
          (+ ((d (- i 1)) 
           j) 1) ;;deletion 
          (+ ((d i) 
           (- j 1)) 1) ;;insertion 
          (+ ((d (- i 1)) 
           (- j 1)) cost)) ;;substitution)) 
         val (if (and (> i 1) 
           (> j 1) 
           (= (nth a (- i 1)) 
            (nth b (- j 2))) 
           (= (nth a (- i 2)) 
            (nth b (- j 1)))) 
         (min x (+ ((d (- i 2)) 
            (- j 2)) ;;transposition 
            cost)) 
         x)] 
        {i {j val}}))) 
       init 
       (for [j (range 1 (+ 1 n)) 
        i (range 1 (+ 1 m))] [i j]))] 
    ((table m) n))) 

回答

1

OK,这应该做的伎俩(基于KIMA's answer):

(defn da-lev [str1 str2] 
    (let [l1 (count str1) 
     l2 (count str2) 
     mx (new-matrix :ndarray (inc l1) (inc l2))] 
    (mset! mx 0 0 0) 
    (dotimes [i l1] 
    (mset! mx (inc i) 0 (inc i))) 
    (dotimes [j l2] 
    (mset! mx 0 (inc j) (inc j))) 
    (dotimes [i l1] 
    (dotimes [j l2] 
     (let [i+ (inc i) j+ (inc j) 
      i- (dec i) j- (dec j) 
      cost (if (= (.charAt str1 i) 
         (.charAt str2 j)) 
        0 1)] 
     (mset! mx i+ j+ 
       (min (inc (mget mx i j+)) 
        (inc (mget mx i+ j)) 
        (+ (mget mx i j) cost))) 
     (if (and (pos? i) (pos? j) 
        (= (.charAt str1 i) 
        (.charAt str2 j-)) 
        (= (.charAt str1 i-) 
        (.charAt str2 j))) 
      (mset! mx i+ j+ 
        (min (mget mx i+ j+) 
         (+ (mget mx i- j-) cost))))))) 
    (mget mx l1 l2))) 

请注意,您需要core.matrix库,它是不是标准(尽管它的名字)。

[net.mikera/core.matrix "0.29.1"] 

库生活在命名空间clojure.core.matrix:人们可以用Leiningen这种方式安装。要按原样使用这个解决方案,你应该从命名空间中'添加'符号到命名空间中。

2

最近,我用Clojure写一个高效的Levenshtein距离的函数来计算之间的编辑地面实况文本和ocr引擎结果。 递归实现的性能不足以快速计算两个整页之间的levenshtein距离,所以我的实现使用动态编程。 它使用core.matrix来处理矩阵的东西,而不是下降到Java 2D数组。 为damerau-levenshtein添加换位材料应该不难。

(defn lev [str1 str2] 
    (let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2))) 
     len1 (count str1) len2 (count str2)] 
    (mset! mat 0 0 0) 
    (dotimes [i lein1] 
    (mset! mat (inc i) 0 (inc i))) 
    (dotimes [j len2] 
    (mset! mat 0 (inc j) (inc j))) 
    (dotimes [dj len2] 
    (dotimes [di len1] 
     (let [j (inc dj) i (inc di)] 
     (mset! mat i j 
       (cond 
       (= (.charAt ^String str1 di) (.charAt ^String str2 dj)) 
       (mget mat di dj) 
       :else 
       (min (inc (mget mat di j)) (inc (mget mat i dj)) 
        (inc (mget mat di dj)))))))) 
    (mget mat len1 len2)))) 

希望这有助于