GHC中的交叉模块优化

如果我在同一个模块中使用Criterion进行测量，那么我有一个非递归函数来计算似乎表现良好的最长公共子序列（ghc 7.6.1，编译时使用-O2 -fllvm标志）。另一方面，如果我将函数转换为模块，则只导出该函数（建议使用here），然后再用Criterion进行测量，我会得到〜2x的减速（如果将标准测试移回模块，则会消失在哪里定义函数）。我尝试用INLINE编译指示标记函数，这对跨模块性能测量没有任何影响。GHC中的交叉模块优化

在我看来，GHC可能会做一个严格分析，当函数和主函数（从中可以访问函数）在同一个模块中时，它可以很好地工作，但当它们被分割时，不会。我很感激关于如何模块化函数的指针，以便在从其他模块调用时可以很好地执行。有问题的代码太大，无法粘贴 - 如果您想尝试一下，您可以看到它here。什么我试图做一个小例子下面是（与代码片段）：

-- Function to find longest common subsequence given unboxed vectors a and b 
-- It returns indices of LCS in a and b 
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int) 
lcs a b | (U.length a > U.length b) = lcsh b a True 
     | otherwise = lcsh a b False 

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us 
-- on my test machine 
{-- 
config :: Config 
config = defaultConfig { cfgSamples = ljust 100 } 

a = U.fromList ['a'..'j'] :: Vector Char 
b = U.fromList ['a'..'k'] :: Vector Char 

suite :: [Benchmark] 
suite = [ 
      bench "lcs 10" $ whnf (lcs a) b 
     ] 

main :: IO() 
main = defaultMainWith config (return()) suite 
--}

来源

2013-06-04 Sal

尝试使用INLINEABLE。它可能会更好。 – Carl

@Carl，尝试了它的lcs功能。还是一样。 – Sal

我怀疑问题是，当它全部在一个模块中时，GHC可以将类型变量'a'专用于'Char'，因为它从来没有与任何其他类型一起使用，从而消除类型类的开销。您可以尝试使用'SPECIALIZE'编译指示器（或者只是手动将其更改为'Char'），看看它是否有效。 – hammar

hammar is right，重要的问题是，编译器可以看到lcs在使用在同一时间类型因为它可以看到代码，因此它可以将代码专门化为该特定类型。

如果编译器不知道代码将被使用的类型，它不得不仅产生多态代码。这对性能不利 - 我很惊讶这只是一个大约2倍的差异。多态代码意味着对于许多操作，需要类型级的查找，并且至少不可能内联查找函数或恒定倍数大小[例如，对于unboxed数组/矢量访问]。

无法在单独的模块中实现和使用单模块案例，而无需在使用站点上显示需要专门化的代码（或者如果您知道实现站点上需要的类型，，{-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等）。

通过标记函数{-# INLINABLE #-}，使代码在使用地点可见，通常通过暴露界面文件中的展开来完成。

我试着用INLINE编译标记函数，它在跨模块性能测量中没有任何区别。

唯一标识

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int) 
lcs a b | (U.length a > U.length b) = lcsh b a True 
     | otherwise = lcsh a b False

INLINE或INLINABLE不作当然有差别，该功能是微不足道的，而编译器公开其反正展开，因为它是如此之小。即使它的展开没有暴露，差异也不可测量。

你需要公开的函数的开折做实际工作过，至少是多态的，lcsh的findSnakes，gridWalk和cmp（cmp是一个，这里是至关重要的，但其他人是必要的，1看到需要cmp，2.从他们那里调用专门的cmp）。

制作那些INLINABLE，分开的模块壳体

$ ./diffBench 
warming up 
estimating clock resolution... 
mean is 1.573571 us (320001 iterations) 
found 2846 outliers among 319999 samples (0.9%) 
    2182 (0.7%) high severe 
estimating cost of a clock call... 
mean is 40.54233 ns (12 iterations) 

benchmarking lcs 10 
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950 
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950 
variance introduced by outliers: 26.787% 
variance is moderately inflated by outliers

和单模块壳体

$ ./oneModule 
warming up 
estimating clock resolution... 
mean is 1.726459 us (320001 iterations) 
found 2092 outliers among 319999 samples (0.7%) 
    1608 (0.5%) high severe 
estimating cost of a clock call... 
mean is 39.98567 ns (14 iterations) 

benchmarking lcs 10 
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950 
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950 
variance introduced by outliers: 26.791% 
variance is moderately inflated by outliers

之间的差是小bearably。

来源

2013-06-04 08:01:28

好点。试图分析这一点时，我忘了专业化。 – Sal

GHC中的交叉模块优化

回答

相关问题