Haskell和可变结构的性能

我正在研究Optimizing Haskell code中给出的答案，并注意到与Python相比，使用小输入确实会导致更快的Haskell运行。Haskell和可变结构的性能

但随着数据集的规模不断扩大，Python占据了领先地位。使用基于散列表的版本提高了性能，但仍然落后。

更糟糕的是，我尝试将Python的字典转换为哈希表，并观察到硬性能。我真的很想知道发生了什么，因为我需要为将来的应用程序使用可变结构。

这里的稍微修改的Python代码：

3210

Haskell中，与原来的响应（train4），一个HashMap变体（trainHM2）和散列表音译（trainHT）：

{-# LANGUAGE BangPatterns,DeriveGeneriC#-} 

import GHC.Generics (Generic) 

import Data.List (foldl') 
import Data.Hashable 

import qualified Data.Map as M 

import qualified Data.HashMap.Strict as HM 
import qualified Data.ByteString.Char8 as B 

import qualified Data.HashTable.IO as HT 

--Using this instead of tuples yielded a 5~10% speedup 
data StringTuple = STP !B.ByteString !B.ByteString deriving(Ord,Eq,Generic) 
instance Hashable StringTuple 

type Database3 = M.Map StringTuple (M.Map B.ByteString Int) 
type DatabaseHM = HM.HashMap StringTuple (HM.HashMap B.ByteString Int) 
type DatabaseHT = HT.BasicHashTable StringTuple DatabaseInnerHT 
type DatabaseInnerHT = (HT.BasicHashTable B.ByteString Int) 

train4 :: [B.ByteString] -> Database3 
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) 
    where update m (x,y,z) = M.insertWith' (inc z) (STP x y) (M.singleton z 1) m 
      inc k _ = M.insertWith' (+) k 1 

trainHM2 :: [B.ByteString] -> DatabaseHM 
trainHM2 words = trainHM2G words HM.empty 
    where 
    trainHM2G (x:y:[]) !hm = hm 
    trainHM2G (x:y:z:rem) !hm = trainHM2G (y:z:rem) (HM.insertWith (inc z) (STP x y) (HM.singleton z 1) hm) 
      where inc k _ = HM.insertWith (+) k 1 


trainHT :: [B.ByteString] -> IO (DatabaseHT) 
trainHT words = do 
hm <- HT.new 

trainHT' words hm 
where 
    trainHT' (x:y:[]) !hm = return hm 
    trainHT' (x:y:z:rem) !hm = do 
    let pair = STP x y 
    inCache <- HT.lookup hm pair 
    case inCache of 
    Nothing -> do 
    htN <- HT.new :: IO (DatabaseInnerHT) 
    HT.insert htN z $! 1 
    HT.insert hm pair $! htN 
    Just ht -> do 
    cvM <- HT.lookup ht z 
    case cvM of 
     Nothing -> HT.insert ht z 1 
     Just cv -> HT.insert ht z $! (cv+1) 
    trainHT' (y:z:rem) hm 

main = do contents <- B.readFile "76.txt" 
     let bcont = B.split ' ' $ contents 
     print $ length bcont 
     let db = train4 $ bcont 
     print $ "Built a DB of " ++ show (M.size db) ++ " words" 
     --let db = trainHM2 $ bcont 
     --print $ "Built a DB of " ++ show (HM.size db) ++ " words"   
     --db <- trainHT $ (bcont) 
     --print $ "Built a DB"

临时C++ 11音译（需要-fpermissive编译，感觉自由纠正它）：

#include <iostream> 
#include <fstream> 
#include <sstream> 
#include <vector> 
#include <unordered_map> 
#include <tuple> 

/* 
Hash stuff here 
Taken from https://stackoverflow.com/a/7111460/314327 
*/ 
size_t hash_combiner(size_t left, size_t right) //replacable 
{ return left^right;} 

template<int index, class...types> 
struct hash_impl { 
    size_t operator()(size_t a, const std::tuple<types...>& t) const { 
     typedef typename std::tuple_element<index, std::tuple<types...>>::type nexttype; 
     hash_impl<index-1, types...> next; 
     size_t b = std::hash<nexttype>()(std::get<index>(t)); 
     return next(hash_combiner(a, b), t); 
    } 
}; 
template<class...types> 
struct hash_impl<0, types...> { 
    size_t operator()(size_t a, const std::tuple<types...>& t) const { 
     typedef typename std::tuple_element<0, std::tuple<types...>>::type nexttype; 
     size_t b = std::hash<nexttype>()(std::get<0>(t)); 
     return hash_combiner(a, b); 
    } 
}; 

namespace std { 
    template<class...types> 
    struct hash<std::tuple<types...>> { 
     size_t operator()(const std::tuple<types...>& t) { 
      const size_t begin = std::tuple_size<std::tuple<types...>>::value-1; 
      return hash_impl<begin, types...>()(1, t); //1 should be some largervalue 
     } 
    }; 
} 

/* 
Hash stuff end 
*/ 

using namespace std; 

/* 
Split, from https://stackoverflow.com/a/236803/314327 
*/ 
vector<string> &split(const string &s, char delim, vector<string> &elems) { 
stringstream ss(s); 
string item; 
while (getline(ss, item, delim)) { 
elems.push_back(item); 
} 
return elems; 
} 

vector<string> split(const string &s, char delim) { 
vector<string> elems; 
split(s, delim, elems); 
return elems; 
} 
/* 
Split end 
*/ 

typedef tuple<string,string> STP; 

unordered_map< STP,unordered_map< string,int > > train(vector<string> &words) 
{ 
unordered_map< STP,unordered_map< string,int > > cache; 

for(int i=0;i<words.size()-2;i++) 
{ 
    STP tup = make_tuple(words[i],words[i+1]); 
    auto it = cache.find(tup); 
    if(it!=cache.end()) 
    { 
    auto it2 = it->second.find(words[i+2]); 
    if(it2!=it->second.end()) 
    { 
    it2->second += 1; 
    } 
    else 
    it->second[words[i+2]] = 1; 
    } 
    else 
    {  
    unordered_map< string,int > cacheInner; 
    cacheInner[words[i+2]] = 1; 
    cache[tup] = cacheInner; 
    } 
} 

return cache; 
} 

int main() 
{ 
ifstream ifs("76.txt"); 
stringstream buf; 
buf << ifs.rdbuf(); 
string contents(buf.str()); 

auto words = split(contents,' '); 
cout << words.size(); 

auto wordCache = train(words); 

cout << "\nHashtable count " << wordCache.size(); 

cout << "\n"; 
return 0; 
}

并且结果：

C++（GCC 4.6.3）

$ g++ -O3 -fpermissive -std=c++0x cpptest.cpp -o cpptest 
$ /usr/bin/time -f "%E" ./cpptest 
1255153 

Hashtable count 64442 
0:01.02

的Python（2.7）

$ /usr/bin/time -f "%E" ./pytest.py 
Total of 1255153 splitted words 
Built a db of length 64442 
0:02.62

的Haskell（GHC 7.4 .1） - “train4”

$ ghc -fllvm -O2 -rtsopts -fforce-recomp -funbox-strict-fields hasktest.hs -o hasktest 
[1 of 1] Compiling Main    (hasktest.hs, hasktest.o) 
Linking hasktest ... 
$ /usr/bin/time -f "%E" ./hasktest 
1255153 
"Built a DB of 64442 words" 
0:06.35

哈斯克尔 - “trainHM2”

$ /usr/bin/time -f "%E" ./hasktest 
1255153 
"Built a DB of 64442 words" 
0:04.23

哈斯克尔 - “trainHT” - （？这是接近到什么Python不为字典，我猜）使用基本型

$ /usr/bin/time -f "%E" ./hasktest 
1255153 
"Built a DB" 
0:10.42

使用线性或杜鹃两个表

0:06.06 
0:05.69

使用杜鹃用于最外层表和线性在里面

0:04.17

剖析表明，有相当多的GC的，所以，用+ RTS -A256M

0:02.11

_{输入数据，我选择了76.txt作为答案的一个指示和复制的全文12次。它应该达到约7 MB。}

_{测试是在一个VirtualBox容器中的Ubuntu 12.04上运行的，使用一个i5-520M内核。不止一次运行，所有结果都非常接近。}

最后的结果是这个微基准精致漂亮，但还有什么代码改善，考虑到：

杜鹃&线性可能更适合这个数据集，但“通用”Python解决方案在这方面没有多少优化就很好，Valgrind报告称C++ & Python版本大约需要60MBs，而Haskell RTS从125MBs（Cuckoo & Linear）至409MBs（基本的，较大的堆）的内存用于同一任务。不会在生产环境中调整垃圾收集器是否有害？是否有可能重构代码，所以它的内存使用量更少？

更新：

我猜 “减少垃圾” 就是我要找的。我知道Haskell不能像C++那样工作，但我想知道是否有可能减少命令代码中产生的垃圾，因为C++示例在没有任何空间泄漏的情况下消耗了一半内存。它希望是内存使用和执行时间方面的改进（因为GC会更少）。

更新2：

计算表施工期间的长度减少了内存占用肯定（下降到大约40MBs，实际上！），这将导致GC需要更长的时间，导致较慢的执行时间（由于丢弃了从列表中懒得读出的值，我认为？）。

是的，哈希表的操作需要很长时间。我会尝试模仿改动，看看它是否有进一步改善。

来源

2013-10-09 felace

“还有什么需要改进的代码”是一个很大的问题。你可以说得更详细点吗？你说有很多GC，但是别说你从分析中学到了什么，或者出现了什么问题。 – jberryman

远不是一个完整的答案，但是你通过打印长度来强制整个单词列表，并保留它在字典构造的内存中。通过不打印长度，我节省了大约100M的基本，更大的堆大小。如果你需要它，你可以在字典构造的同时创建一个长度值。 –

这不是一个真正的答案，但它太多了，所以我会把它放在这里，直到有更好的东西出现。我没有看到你的哈希表代码（我真正看过的唯一部分）有什么明显的错误，除了一些重构/打高尔夫。

首先，内存使用情况对我来说并不是很让人意外。使用-A256M，您要求RTS的最小分配区域为256M，这样可以占用您的内存使用量。如果数据在GC之后被升级或复制，内存使用量将会增加。另外请注意，相对于其他语言，Haskell数据结构往往有点缺乏内存空间，例如参见Memory footprint of Haskell data types。考虑到这两个因素，我并不感到有大量分配区域的内存使用总量。

像HashMap或字符串trie结构可以使用更少的内存，伴随着使用哈希表以外的数据结构的附带缺点。

说到分配区域，这个代码有点微的基准，因为几乎所有分配的数据（主要是字节串数据和内部散列表值）都是长期存在的（它们持续到程序结束时）。这使得您的测试程序处于一个非常大的分配区域特别有利的情况，而如果这个数据库构造只是较大程序的一部分，则较大区域的成本可能占主导地位。

对于生产环境的最佳GC设置，很难告诉外部实际完整程序的上下文。我可以说，如果表现真的很重要，值得花些时间调整GC标志。如果您启用了线程运行时，则更是如此。

除了内存问题，我强烈怀疑hashtables包在这里对你有效。根据配置文件，4个最昂贵的功能是lookup/go,lookup,insert和delete'.go。我认为如果它的价值相当于Data.Map.alter，那么你的一些业务可以合并在一起以提高性能。毕竟，如果Python字典没有针对像cache[key] += 1这样的案例进行优化，我会感到非常惊讶。

来源

2013-10-09 06:39:04

Haskell和可变结构的性能

回答

相关问题