如何提高Clojure中的文本处理性能？

我正在用Clojure编写一个简单的桌面搜索引擎，作为了解更多关于该语言的一种方式。到目前为止，我的程序在文本处理阶段的表现非常糟糕。如何提高Clojure中的文本处理性能？

在文本处理我已经到：

清理不需要的字符;
将字符串转换为小写;
拆分文档以获取单词列表;
构建将每个单词与其文档中的出现相关联的映射。

下面是代码：

(ns txt-processing.core 
    (:require [clojure.java.io :as cjio]) 
    (:require [clojure.string :as cjstr]) 
    (:gen-class)) 

(defn all-files [path] 
    (let [entries (file-seq (cjio/file path))] 
    (filter (memfn isFile) entries))) 

(def char-val 
    (let [value #(Character/getNumericValue %)] 
    {:a (value \a) :z (value \z) 
    :A (value \A) :Z (value \Z) 
    :0 (value \0) :9 (value \9)})) 

(defn is-ascii-alpha-num [c] 
    (let [n (Character/getNumericValue c)] 
    (or (and (>= n (char-val :a)) (<= n (char-val :z))) 
     (and (>= n (char-val :A)) (<= n (char-val :Z))) 
     (and (>= n (char-val :0)) (<= n (char-val :9)))))) 

(defn is-valid [c] 
    (or (is-ascii-alpha-num c) 
     (Character/isSpaceChar c) 
     (.equals (str \newline) (str c)))) 

(defn lower-and-replace [c] 
    (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c))) 

(defn tokenize [content] 
    (let [filtered (filter is-valid content) 
     lowered (map lower-and-replace filtered)] 
    (cjstr/split (apply str lowered) #"\s+"))) 

(defn process-content [content] 
    (let [words (tokenize content)] 
    (loop [ws words i 0 hmap (hash-map)] 
     (if (empty? ws) 
     hmap 
     (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i))))))) 

(defn -main [& args] 
    (doseq [file (all-files (first args))] 
    (let [content (slurp file) 
      oc-list (process-content content)] 
     (println "File:" (.getPath file) 
       "| Words to be indexed:" (count oc-list)))))

当我有这个问题的another implementation在Haskell，我比较两者，你可以在下面的输出看到。

Clojure的版本：

$ lein uberjar 
Compiling txt-processing.core 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar 
Including txt-processing-0.1.0-SNAPSHOT.jar 
Including clojure-1.5.1.jar 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar 
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 2m2.164s 
user 2m3.868s 
sys  0m0.978s

哈斯克尔版本：

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main    (txt-processing.hs, txt-processing.o) 
Linking txt-processing ... 
$ time ./txt-processing ../data/ +RTS -K12m 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 0m9.086s 
user 0m8.591s 
sys  0m0.463s

我认为（串 - >懒惰序列）在Clojure的实现转换查杀性能。我该如何改进它？

P.S：这些测试中使用的所有代码和数据可以下载here。

来源

2013-04-27 luisgabriel

什么jar是JVM的启动时间吗？ Haskell有类似的开销吗？ – georgek 2013-04-27 23:40:36

我机器中的JVM启动时间大约为1秒。我认为Haskell由于运行时系统（RTS）有一些开销，但我应该比JVM低得多。 – luisgabriel 2013-04-28 00:41:41

s /我应该/它应该/ – luisgabriel 2013-04-28 01:18:43

有些事情可以做，可能会加速这一代码了：

1）代替字符之间的映射你的chars到char-val，只是做直接价值的比较。这在Java中速度更快的原因会更快。

2）重复使用str可将单字符值转换为完整字符串。再次考虑直接使用字符值。同样，对象创建速度也很慢，与Java中的相同。

3）您应该用clojure.core/frequencies替换process-content。也许检查frequencies来源，看看它是如何更快。

4）如果您必须在循环中更新(hash-map)，请使用transient。请参阅：http://clojuredocs.org/clojure_core/clojure.core/transient

还要注意的是(hash-map)返回PersistentArrayMap，所以你每次调用update-in创建新实例 - 因此慢，为什么你应该使用瞬变。

5）这是你的朋友：(set! *warn-on-reflection* true) - 你有相当多的反思，可以从type hints

Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved. 
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved. 
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.

来源

2013-04-28 02:11:10 noahlz

完美！只是把类型提示总时间减少到7秒！ ;） – luisgabriel 2013-04-28 04:53:34

不错！ 8-）很高兴帮助！ – noahlz 2013-04-28 04:57:34

我不能使用'clojure.core /频率'，因为我需要单词位置来进一步阶段，如索引和查询。 – luisgabriel 2013-04-28 04:59:09

受益只是为了比较的缘故，这里是一个基于正则表达式Clojure的版本

(defn re-index 
    "Returns lazy sequence of vectors of regexp matches and their start index" 
    [^java.util.regex.Pattern re s] 
    (let [m (re-matcher re s)] 
     ((fn step [] 
      (when (. m (find)) 
      (cons (vector (re-groups m)(.start m)) (lazy-seq (step)))))))) 

(defn group-by-keep 
    "Returns a map of the elements of coll keyed by the result of 
    f on each element. The value at each key will be a vector of the 
    results of r on the corresponding elements." 
    [f r coll] 
    (persistent! 
    (reduce 
     (fn [ret x] 
     (let [k (f x)] 
      (assoc! ret k (conj (get ret k []) (r x))))) 
     (transient {}) coll))) 

(defn word-indexed 
    [s] 
    (group-by-keep 
    (comp clojure.string/lower-case first) 
    second 
    (re-index #"\w+" s)))

来源

2013-04-29 12:48:47 NielsK

如何提高Clojure中的文本处理性能？

回答

相关问题