优化Haskell代码

我正在努力学习Haskell，并且在关于马尔可夫文本链的reddit中的一篇文章中，我决定首先在Python中实现Markov文本生成，现在在Haskell中。但是我注意到我的python实现比Haskell版本快，即使Haskell编译为本地代码。我想知道我应该怎么做才能让Haskell代码运行得更快，现在我相信由于使用Data.Map而不是hashmaps，所以速度非常慢，但我不确定优化Haskell代码

我将发布Python代码还有Haskell。使用相同的数据，Python需要大约3秒，而Haskell接近16秒。

不言而喻，我会采取任何建设性的批评:)。

import random 
import re 
import cPickle 
class Markov: 
    def __init__(self, filenames): 
     self.filenames = filenames 
     self.cache = self.train(self.readfiles()) 
     picklefd = open("dump", "w") 
     cPickle.dump(self.cache, picklefd) 
     picklefd.close() 

    def train(self, text): 
     splitted = re.findall(r"(\w+|[.!?',])", text) 
     print "Total of %d splitted words" % (len(splitted)) 
     cache = {} 
     for i in xrange(len(splitted)-2): 
      pair = (splitted[i], splitted[i+1]) 
      followup = splitted[i+2] 
      if pair in cache: 
       if followup not in cache[pair]: 
        cache[pair][followup] = 1 
       else: 
        cache[pair][followup] += 1 
      else: 
       cache[pair] = {followup: 1} 
     return cache 

    def readfiles(self): 
     data = "" 
     for filename in self.filenames: 
      fd = open(filename) 
      data += fd.read() 
      fd.close() 
     return data 

    def concat(self, words): 
     sentence = "" 
     for word in words: 
      if word in "'\",?!:;.": 
       sentence = sentence[0:-1] + word + " " 
      else: 
       sentence += word + " " 
     return sentence 

    def pickword(self, words): 
     temp = [(k, words[k]) for k in words] 
     results = [] 
     for (word, n) in temp: 
      results.append(word) 
      if n > 1: 
       for i in xrange(n-1): 
        results.append(word) 
     return random.choice(results) 

    def gentext(self, words): 
     allwords = [k for k in self.cache] 
     (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache])) 
     sentence = [first, second] 
     while len(sentence) < words or sentence[-1] is not ".": 
      current = (sentence[-2], sentence[-1]) 
      if current in self.cache: 
       followup = self.pickword(self.cache[current]) 
       sentence.append(followup) 
      else: 
       print "Wasn't able to. Breaking" 
       break 
     print self.concat(sentence) 

Markov(["76.txt"])

module Markov 
(train 
, fox 
) where 

import Debug.Trace 
import qualified Data.Map as M 
import qualified System.Random as R 
import qualified Data.ByteString.Char8 as B 


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train :: [B.ByteString] -> Database 
train (x:y:[]) = M.empty 
train (x:y:z:xs) = 
    let l = train (y:z:xs) 
    in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l 

main = do 
    contents <- B.readFile "76.txt" 
    print $ train $ B.words contents 

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

来源

2010-05-26 Masse

有趣的是，也在寻找答案。 16秒对3秒是一个很大的区别。 – wvd 2010-05-26 17:32:09

顺便说一下，缩进似乎已经被Python代码弄坏了...... – 2010-05-26 17:54:53

我不认为你的Haskell代码能够完成你想要的东西。如果你检查输出，你会发现'M.Map String Int'映射中没有大于2的值。你的意思是'n + o'还是'o + 1'而不是'n + 1'？ – 2010-05-26 18:18:56

我试图避免做任何奇特或微妙的事情。这只是两种方法来进行分组;第一个强调模式匹配，第二个不强调。

import Data.List (foldl') 
import qualified Data.Map as M 
import qualified Data.ByteString.Char8 as B 

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train2 :: [B.ByteString] -> Database2 
train2 words = go words M.empty 
    where go (x:y:[]) m = m 
      go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1 
           addWord (Just m') = Just $ M.alter inc z m' 
           inc Nothing = Just 1 
           inc (Just cnt) = Just $ cnt + 1 
          in go (y:z:xs) $ M.alter addWord (x,y) m 

train3 :: [B.ByteString] -> Database2 
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) 
    where update m (x,y,z) = M.alter (addWord z) (x,y) m 
      addWord word = Just . maybe (M.singleton word 1) (M.alter inc word) 
      inc = Just . maybe 1 (+1) 

main = do contents <- B.readFile "76.txt" 
      let db = train3 $ B.words contents 
      print $ "Built a DB of " ++ show (M.size db) ++ " words"

我认为它们都比原始版本更快，但是我承认我只是试图对付我发现的第一个合理的语料库。

编辑按特拉维斯布朗的非常有效的一点，

train4 :: [B.ByteString] -> Database2 
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) 
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m 
      inc k _ = M.insertWith (+) k 1

来源

2010-05-26 18:45:20 Anthony

就我而言，我认为最好在这里使用比'alter'更具体的内容。我们知道在这种情况下我们永远不需要删除，而且必须像这样添加'Just'会削弱可读性。 – 2010-05-26 19:25:52

对不起，以延迟回复。你能否解释_为什么这是一个更快的解决方案？基本上都是一样的，除了压缩和下降。 – Masse 2010-08-21 14:55:06

一）你是如何编制的？（ghc -O2？）

b）哪个版本的GHC？ c）Data.Map是非常高效的，但你可以被诱骗到懒惰的更新 - 使用insertWith'，而不是insertWithKey。

d）不要将字符串转换为字符串。按照字符串的形式保存它们，并将它们存储在地图

来源

2010-05-26 17:26:29

版本是6.12.1。在你的帮助下，我能够从运行时中挤出1秒，但仍然远离python版本。 – Masse 2010-05-26 17:38:26

按照Don的建议，使用函数的更严格的版本：insertWithKey'（和M.insertWith'，因为您无论如何都会忽略关键字参数）。

它看起来像你的代码可能会建立很多thunk直到它到达[String]的末尾。

退房：http://book.realworldhaskell.org/read/profiling-and-optimization.html

...尤其是试图（通过章大约一半）作图堆。有兴趣看看你想要什么。

来源

2010-05-26 17:40:39 jberryman

我做了Don Stewart建议的更改。以前代码花费了41-44兆字节的内存，现在只花费了29分钟。对内存进行图形显示，TSO占用了大部分内存，然后是GHC.types，然后是代码中使用的其他数据类型。所有部分的记忆在一秒钟内迅速增加。之后，一秒钟TSO和GHC.types不断增加，所有其他开始缓慢后退。（如果我正在读图） – Masse 2010-05-26 18:09:06

1）我不清楚你的代码。 a）你定义“狐狸”，但不要使用它。你是否意味着我们试图帮助你使用“狐狸”而不是阅读文件？ b）你声明这是“模块马尔科夫”，然后在模块中有一个“主”。 c）System.Random不是必需的。如果您在发布之前清理一下代码，它的确帮助我们提供帮助。

2）使用ByteStrings和一些严格的操作，如唐说的。

3）使用-O2编译并使用-fforce-recomp确保您实际重新编译了代码。

4）尝试这个轻微的转换，它的工作速度非常快（0.005秒）。显然，输入是荒谬的小，所以你需要提供你的文件或只是自己测试它。

{-# LANGUAGE OverloadedStrings, BangPatterns #-} 
module Main where 

import qualified Data.Map as M 
import qualified Data.ByteString.Lazy.Char8 as B 


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train :: [B.ByteString] -> Database 
train xs = go xs M.empty 
    where 
    go :: [B.ByteString] -> Database -> Database 
    go (x:y:[]) !m = m 
    go (x:y:z:xs) !m = 
    let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m 
    in go (y:z:xs) m' 

main = print $ train $ B.words fox 

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

来源

2010-05-26 18:17:58

嗯，我是一个初学者，喜欢标签说：P。我没有意识到将模块命名为Main之外的结果。而狐狸是我用来测试算法。检查小型输入比输入整本书更容易 – Masse 2010-05-26 19:06:41

这里有一个foldl'基础的版本，这似乎是约快两倍，你train：

train' :: [B.ByteString] -> Database 
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs) 
    where 
    f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)

我试了一下Gutenberg项目Huckleberry Finn（我认为是你的76.txt），它产生与你的函数相同的输出。我的时间比较是非常不科学的，但这种方法可能值得一看。

来源

2010-05-26 18:55:42

Data.Map是根据类别Ord比较需要一定时间的假设而设计的。对于字符串键，这可能不是—的情况，并且当字符串相等时，它从来就不是这种情况。您可能会也可能不会碰到这个问题，具体取决于您的语料库有多大以及有多少词语具有共同前缀。

我会试图尝试一个数据结构，该数据结构被设计为与序列键一起操作，例如Don Stewart友好建议的bytestring-trie包。

来源

2010-05-26 22:30:21

一个字节串trie？ http://hackage.haskell.org/package/bytestring-trie – 2010-05-27 02:31:04

@don：感谢您的更新。我相信你知道至少有60％的名字是hackage的内容:-) – 2010-05-27 15:11:26

优化Haskell代码

回答

相关问题