2013-04-27 33 views
8

我正在用Clojure编写一个简单的桌面搜索引擎,作为了解更多关于该语言的一种方式。到目前为止,我的程序在文本处理阶段的表现非常糟糕。如何提高Clojure中的文本处理性能?

在文本处理我已经到:

  • 清理不需要的字符;
  • 将字符串转换为小写;
  • 拆分文档以获取单词列表;
  • 构建将每个单词与其文档中的出现相关联的映射。

下面是代码:

(ns txt-processing.core 
    (:require [clojure.java.io :as cjio]) 
    (:require [clojure.string :as cjstr]) 
    (:gen-class)) 

(defn all-files [path] 
    (let [entries (file-seq (cjio/file path))] 
    (filter (memfn isFile) entries))) 

(def char-val 
    (let [value #(Character/getNumericValue %)] 
    {:a (value \a) :z (value \z) 
    :A (value \A) :Z (value \Z) 
    :0 (value \0) :9 (value \9)})) 

(defn is-ascii-alpha-num [c] 
    (let [n (Character/getNumericValue c)] 
    (or (and (>= n (char-val :a)) (<= n (char-val :z))) 
     (and (>= n (char-val :A)) (<= n (char-val :Z))) 
     (and (>= n (char-val :0)) (<= n (char-val :9)))))) 

(defn is-valid [c] 
    (or (is-ascii-alpha-num c) 
     (Character/isSpaceChar c) 
     (.equals (str \newline) (str c)))) 

(defn lower-and-replace [c] 
    (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c))) 

(defn tokenize [content] 
    (let [filtered (filter is-valid content) 
     lowered (map lower-and-replace filtered)] 
    (cjstr/split (apply str lowered) #"\s+"))) 

(defn process-content [content] 
    (let [words (tokenize content)] 
    (loop [ws words i 0 hmap (hash-map)] 
     (if (empty? ws) 
     hmap 
     (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i))))))) 

(defn -main [& args] 
    (doseq [file (all-files (first args))] 
    (let [content (slurp file) 
      oc-list (process-content content)] 
     (println "File:" (.getPath file) 
       "| Words to be indexed:" (count oc-list))))) 

当我有这个问题的another implementation在Haskell,我比较两者,你可以在下面的输出看到。

Clojure的版本:

$ lein uberjar 
Compiling txt-processing.core 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar 
Including txt-processing-0.1.0-SNAPSHOT.jar 
Including clojure-1.5.1.jar 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar 
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 2m2.164s 
user 2m3.868s 
sys  0m0.978s 

哈斯克尔版本:

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main    (txt-processing.hs, txt-processing.o) 
Linking txt-processing ... 
$ time ./txt-processing ../data/ +RTS -K12m 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 0m9.086s 
user 0m8.591s 
sys  0m0.463s 

我认为( - >懒惰序列)在Clojure的实现转换查杀性能。我该如何改进它?

P.S:这些测试中使用的所有代码和数据可以下载here

+1

什么jar是JVM的启动时间吗? Haskell有类似的开销吗? – georgek 2013-04-27 23:40:36

+0

我机器中的JVM启动时间大约为1秒。我认为Haskell由于运行时系统(RTS)有一些开销,但我应该比JVM低得多。 – luisgabriel 2013-04-28 00:41:41

+0

s /我应该/它应该/ – luisgabriel 2013-04-28 01:18:43

回答

4

有些事情可以做,可能会加速这一代码了:

1)代替字符之间的映射你的charschar-val,只是做直接价值的比较。这在Java中速度更快的原因会更快。

2)重复使用str可将单字符值转换为完整字符串。再次考虑直接使用字符值。同样,对象创建速度也很慢,与Java中的相同。

3)您应该用clojure.core/frequencies替换process-content。也许检查frequencies来源,看看它是如何更快。

4)如果您必须在循环中更新(hash-map),请使用transient。请参阅:http://clojuredocs.org/clojure_core/clojure.core/transient

还要注意的是(hash-map)返回PersistentArrayMap,所以你每次调用update-in创建新实例 - 因此慢,为什么你应该使用瞬变。

5)这是你的朋友:(set! *warn-on-reflection* true) - 你有相当多的反思,可以从type hints

Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved. 
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved. 
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved. 
+0

完美!只是把类型提示总时间减少到7秒! ;) – luisgabriel 2013-04-28 04:53:34

+0

不错! 8-)很高兴帮助! – noahlz 2013-04-28 04:57:34

+0

我不能使用'clojure.core /频率',因为我需要单词位置来进一步阶段,如索引和查询。 – luisgabriel 2013-04-28 04:59:09

0

受益只是为了比较的缘故,这里是一个基于正则表达式Clojure的版本

(defn re-index 
    "Returns lazy sequence of vectors of regexp matches and their start index" 
    [^java.util.regex.Pattern re s] 
    (let [m (re-matcher re s)] 
     ((fn step [] 
      (when (. m (find)) 
      (cons (vector (re-groups m)(.start m)) (lazy-seq (step)))))))) 

(defn group-by-keep 
    "Returns a map of the elements of coll keyed by the result of 
    f on each element. The value at each key will be a vector of the 
    results of r on the corresponding elements." 
    [f r coll] 
    (persistent! 
    (reduce 
     (fn [ret x] 
     (let [k (f x)] 
      (assoc! ret k (conj (get ret k []) (r x))))) 
     (transient {}) coll))) 

(defn word-indexed 
    [s] 
    (group-by-keep 
    (comp clojure.string/lower-case first) 
    second 
    (re-index #"\w+" s)))