我正在努力学习Haskell,并且在关于马尔可夫文本链的reddit中的一篇文章中,我决定首先在Python中实现Markov文本生成,现在在Haskell中。但是我注意到我的python实现比Haskell版本快,即使Haskell编译为本地代码。我想知道我应该怎么做才能让Haskell代码运行得更快,现在我相信由于使用Data.Map而不是hashmaps,所以速度非常慢,但我不确定优化Haskell代码
我将发布Python代码还有Haskell。使用相同的数据,Python需要大约3秒,而Haskell接近16秒。
不言而喻,我会采取任何建设性的批评:)。
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
-
module Markov
(train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
有趣的是,也在寻找答案。 16秒对3秒是一个很大的区别。 – wvd 2010-05-26 17:32:09
顺便说一下,缩进似乎已经被Python代码弄坏了...... – 2010-05-26 17:54:53
我不认为你的Haskell代码能够完成你想要的东西。如果你检查输出,你会发现'M.Map String Int'映射中没有大于2的值。你的意思是'n + o'还是'o + 1'而不是'n + 1'? – 2010-05-26 18:18:56