Haskell的性能时最小/最大/总和在大列表

我一直在试验用下面的Haskell代码：调整一个numItems和numLists当Haskell的性能时最小/最大/总和在大列表

data Foo = Foo 
    { fooMin :: Float 
    , fooMax :: Float 
    , fooSum :: Float 
    } deriving Show 


getLocalFoo :: [Float] -> Foo 
getLocalFoo x = Foo a b c 
    where 
    a = minimum x 
    b = maximum x 
    c = sum x 

getGlobalFoo :: [Foo] -> Foo 
getGlobalFoo x = Foo a b c 
    where 
    a = minimum $ fmap fooMin x 
    b = maximum $ fmap fooMax x 
    c = sum $ fmap fooSum x 


main :: IO() 
main = do 
    let numItems = 2000 
    let numLists = 100000 
    putStrLn $ "numItems: " ++ show numItems 
    putStrLn $ "numLists: " ++ show numLists 

    -- Create an infinite list of lists of floats, x is [[Float]] 
    let x = take numLists $ repeat [1.0 .. numItems] 

    -- Print two first elements of each item 
    print $ take 2 (map (take 2) x) 

    -- First calculate local min/max/sum for each float list 
    -- then calculate the global min/max/sum based on the results. 
    print . getGlobalFoo $ fmap getLocalFoo x

并依次测试运行时：

低尺寸：

numItems: 4.0 
numLists: 2 
[[1.0,2.0],[1.0,2.0]] 
Foo {fooMin = 1.0, fooMax = 4.0, fooSum = 20.0} 

real 0m0.005s 
user 0m0.004s 
sys 0m0.001s

高尺寸：

numItems: 2000.0 
numLists: 100000 
[[1.0,2.0],[1.0,2.0]] 
Foo {fooMin = 1.0, fooMax = 2000.0, fooSum = 1.9999036e11} 

real 0m33.116s 
user 0m33.005s 
sys 0m0.109s

我已经在我看来直观和简单的方式写了这个代码在没有考虑到性能，但是我很担心，这是远远没有达到最佳的代码，因为我实际上可以通过列表来折叠，然后方式多次必要？

有人可以提出一个更好的实施这个测试？

来源

2014-02-08 toeplitz

您正在计算33秒内三亿个元素的三次统计。这大约每秒2000万个元素。这听起来对你来说效率低下吗？（然而，你确实有一个巨大的空间泄露，但这是另一回事，这会帮助你解决这个问题） –

虽然你可能是对的关于这一点，我的问题的主要目的是了解代码是否可以在wrt上得到改进。性能。 – toeplitz

我想upvote这个评论为“你有一个巨大的空间泄漏，但是...” – epsilonhalbe

使用foldl库可以一次高效运行多个折叠。事实上，它非常好，所以你不需要把你的列表分成子列表。您可以将所有列表拼接成一个巨型列表并直接折叠。

方法如下：

import Control.Applicative 
import qualified Control.Foldl as L 

data Foo = Foo 
    { fooMin :: Maybe Float 
    , fooMax :: Maybe Float 
    , fooSum :: Float 
    } deriving Show 

foldFloats :: L.Fold Float Foo 
foldFloats = Foo <$> L.minimum <*> L.maximum <*> L.sum 
-- or: foldFloats = liftA3 Foo L.minimum L.maximum L.sum 

main :: IO() 
main = do 
    let numItems = 2000 
    let numLists = 100000 
    putStrLn $ "numItems: " ++ show numItems 
    putStrLn $ "numLists: " ++ show numLists 

    -- Create an infinite list of lists of floats, x is [[Float]] 
    let x = replicate numLists [1.0 .. numItems] 

    -- Print two first elements of each item 
    print $ take 2 (map (take 2) x) 

    print $ L.fold foldFloats (concat x)

从您的代码的主要区别是：

我用replicate n，这是同样的事情take n . repeat。事实上，这就是replicate实际上是如何定义的
我不打扰单独处理子列表。我只是concat他们在一起，并在一次通过折叠。
我使用Maybe作为最小和最大值，因为我需要处理空列表的情况。
此代码是更快

这里是数字：

$ time ./fold 
numItems: 2000.0 
numLists: 100000 
[[1.0,2.0],[1.0,2.0]] 
Foo {fooMin = Just 1.0, fooMax = Just 2000.0, fooSum = 3.435974e10} 

real 0m5.796s 
user 0m5.756s 
sys 0m0.024s

foldl是一个非常小的，简单易学库。你可以了解更多关于它here。

来源

2014-02-09 00:08:18

谢谢，这是一个整洁的解决方案，但总数是不同的？看到我原来的帖子。 – toeplitz

如果你在你的机器上编译和运行我的例子，你会得到什么结果？ –

与您得到的数字相同，但与我原始示例中的不一样。 – toeplitz

Mon to的救援。您的所有操作 - 总和，最小值和最大值 - 都可以表示为monoids。对于最小值和最大值，我们需要从semigroups中将它包含到Option中，因为我们需要以某种方式表示空集合的最小值和最大值。（另一种方法是将自己限制为非空集合，那么我们可以使用半群而不是单体。）

我们需要的另一件事是确保在每个步骤中强制执行所有计算。为此，我们声明Foo的NFData实例，添加我们使用的monoid类型的一些缺失实例，以及在折叠操作期间强制值的辅助函数。

import Control.DeepSeq 
import qualified Data.Foldable as F 
import Data.Semigroup 

-- Declare the data type so that each field is a monoid. 
data Foo a = Foo 
    { fooMin :: Option (Min a) 
    , fooMax :: Option (Max a) 
    , fooSum :: Sum a 
    } deriving Show 

-- Make a Monoid instance - just by combining individual fields. 
instance (Ord a, Num a) => Monoid (Foo a) where 
    mempty = Foo mempty mempty mempty 
    mappend (Foo n1 x1 s1) (Foo n2 x2 s2) = Foo (n1 <> n2) (x1 <> x2) (s1 <> s2) 

-- Add missing NFData instances 
instance (NFData a) => NFData (Option a) where 
    rnf (Option x) = rnf x `seq`() 
instance (NFData a) => NFData (Min a) where 
    rnf (Min x) = rnf x `seq`() 
instance (NFData a) => NFData (Max a) where 
    rnf (Max x) = rnf x `seq`() 
instance (NFData a) => NFData (Sum a) where 
    rnf (Sum x) = rnf x `seq`() 

-- Also add an instance for Foo 
instance (NFData a) => NFData (Foo a) where 
    rnf (Foo n x s) = rnf n `seq` rnf x `seq` rnf s `seq`() 

-- Convert a single element into Foo. 
locFoo :: a -> Foo a 
locFoo x = Foo (return $ Min x) (return $ Max x) (Sum x) 

-- A variant of foldMap that uses left fold and forces monoid 
-- elements on the way. 
foldMap' :: (F.Foldable f, Monoid m, NFData m) => (a -> m) -> f a -> m 
foldMap' f = F.foldl' (\m x -> (mappend $!! m) (f x)) mempty 

main :: IO() 
main = do 
    let numItems = 2000 
    let numLists = 100000 
    putStrLn $ "numItems: " ++ show numItems 
    putStrLn $ "numLists: " ++ show numLists 

    -- Create an infinite list of lists of floats, x is [[Float]] 
    let x = take numLists $ repeat [1.0 .. numItems] :: [[Float]] 

    -- Print two first elements of each item 
    print $ take 2 (map (take 2) x) 

    -- First calculate local min/max/sum for each float list 
    -- then calculate the global min/max/sum based on the results. 
    print . foldMap' (foldMap' locFoo) $ x

来源

2014-02-08 19:59:42

感谢您的答复。不过，我对这里的优点感到困惑，因为我的示例“高尺寸”参数使用此代码提供了超过2分钟的运行时间。 – toeplitz

@toeplitz代码未针对速度进行优化。它的优点是程序只消耗固定的内存，而不依赖于列表的大小。所有三个结果（最小/最大/总和）一起计算，因此不需要将整个列表保留在内存中 - 它是同时构建和使用的。 –

也许单倍便宜。尝试运行一些类似的测试：

{-# LANGUAGE BangPatterns #-} 
import Data.List 

getLocalFoo :: [Float] -> Foo 
getLocalFoo [] = error "getLocalFoo: empty list" 
getLocalFoo (x:xs) = foldl' f (Foo x x x) xs 
    where f (Foo !min1 !max1 !sum1) y = 
      Foo (min1 `min` y) (max1 `max` y) (sum1 + y)

及其类似的getGlobalFoo。

来源

2014-02-08 21:12:18 chi

Haskell的性能时最小/最大/总和在大列表

回答

相关问题