Performance of Haskell programs

Bibliography

"Real World Haskell", Bryan O'Sullivan, John Goerzen & Don Stewart. Chapter 25.

Haskell performance tutorial @ CUFP, Johan Tibell:

http://blog.johantibell.com/2010/07/high-performance-haskell-tutorial-at.html

Pros and cons of high-level languages

Reasoning about performance

Profiling

Profiling

Example: computing averages

avg :: [Double] -> Double
avg xs = sum / fromIntegral (length xs)
 -- sum :: [Double] -> Double
 -- length :: [Double] -> Int

main = do
   n <- read <$> getLine
   print (avg [1..n])

How much memory does this program use for input \(n\)?

Step 1: Collecting runtime statistics

$ stack build
$ stack exec example -- +RTS -s -RTS

Sample output

$ echo 1e7 | stack exec example -- +RTS -s +RTS
   1,680,163,544 bytes allocated in the heap
   1,202,239,544 bytes copied during GC
     335,815,968 bytes maximum residency (13 sample(s))
      68,227,472 bytes maximum slop
             762 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3198 colls,     0 par    0.280s   0.303s     0.0001s    0.0003s
  Gen  1        13 colls,     0 par    0.608s   0.773s     0.0595s    0.3081s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.548s  (  0.533s elapsed)
  GC      time    0.888s  (  1.076s elapsed)
  EXIT    time    0.008s  (  0.030s elapsed)
  Total   time    1.552s  (  1.639s elapsed)

  %GC     time      57.2%  (65.7% elapsed)

  Alloc rate    3,065,991,868 bytes per MUT second

  Productivity  42.8% of total user, 34.3% of total elapsed

Overview of statistics collected

762 MB total memory in use
%GC     time      57.2%  (65.7% elapsed)

Why did this program use so much memory?

Step 2: Profiling

$ stack build --executable-profiling
$ echo 1e7 | stack exec example -- +RTS -p -hc
# generates `example.prof':
        Fri Apr 27 11:38 2018 Time and Allocation Profiling Report  (Final)

           example +RTS -p -hc -RTS

        total time  =        1.17 secs   (1171 ticks @ 1000 us, 1 processor)
        total alloc = 1,680,126,272 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                         %time %alloc

main        Main   src/Main.hs:(10,1)-(12,20)   88.1  100.0
avg         Main   src/Main.hs:7:1-42           11.9    0.0
...

Profiling (again)

Profiling again to this time to collect heap allocation by data types.

$ stack build --executable-profiling
$ echo 1e7 | stack exec example -- +RTS -p -hy
# generates `example.prof' and `example.hp'
$ hp2ps -c -e5in example.hp
# generates `example.ps' chart

Understanding the profile

avg :: [Double] -> Double
avg xs = sum xs / fromIntegral (length xs)

main = print (avg [1..1e7])

Improving retention

avg :: [Double] -> Double
avg xs = worker xs (0,0)
  where
  worker :: [Double] -> (Double,Int) -> Double
  worker [] (s,n)     = s / fromIntegral n
  worker (x:xs) (s,n) = worker xs (x+s,1+n)

Profiling again

Compiled with optimizations (-O2) the program uses constant space.

Compiled without optimizations (-O0) we get worse results than before!

Why this difference?

Lazy evaluation

cond :: Bool -> a -> a -> a
cond True x y  = x
cond False x y = y

bottom :: Int -> Int 
bottom x = bottom x

> bottom 0
 --- infinite loop, Ctrl-C
> cond True 42 (bottom 0)
42

Lazy evaluation (2)

Lazy evaluation of average

  avg [1,2,3]
= worker [1,2,3] (0,0)
= worker [2,3] (1+0,1+0)
= worker [3] (2+1+0,1+1+0)
= worker [] (3+2+1+0,1+1+1+0)
= (3+2+1+0) / fromInt (1+1+1+0)
= 6 / 3
= 2.0

Eager evaluation of average

  avg [1,2,3]
= worker [1,2,3] (0,0)
= worker [2,3] (1+0, 1+0)
= worker [2,3] (1, 1)
= worker [3] (2+1,1+1)
= worker [3] (3, 2)
= worker [] (3+3,1+2)
= worker [] (6, 3)
= 6 / 3
= 2.0

Optimizing lazy evaluation

Controling evaluation

The primitive function seq function allows controlling evaluation order.

seq :: a -> b -> b

Evaluation of seq x y first evaluates x and then returns y.

Controlling evaluation (1)

Using seq to force evaluation of both s and n before the tail recursive call.

avg :: [Double] -> Double
avg xs = worker xs (0,0)
  where
   worker :: [Double] -> (Double,Int) -> Double
   worker [] (s,n) = s / fromIntegral n
   worker (x:xs) (s,n)
       = s `seq` n `seq` worker xs (x+s,1+n)

Controlling evaluation (2)

Bang patterns (!) also allow forcing evaluation.

{-# LANGUAGE BangPatterns #-}

avg :: [Double] -> Double
avg xs = worker xs (0,0)
  where
   worker :: [Double] -> (Double,Int) -> Double
   worker [] (s,n)       = s / fromIntegral n
   worker (x:xs) (!s,!n) = worker xs (x+s,1+n)
              -- ^^^^^^^ "bang" patterns

Results

Folds

Fold operations

Recall the higher-order fold operations (also called reduce) that process a list using some argument function.

https://en.wikipedia.org/wiki/Fold_(higher-order_function)

Left and right folds

foldr (+) 0 [1, 2, 3, 4, 5]
= 1 + (2 + (3 + (4 + (5 + 0))))

foldl (+) 0 [1, 2, 3, 4, 5]
= ((((0 + 1) + 2) + 3) + 4) + 5

Left vs right folds

Accumulating using a left fold

We could rewrite the average example using a left fold.

avg :: [Double] -> Double
avg xs = s / fromIntegral (n :: Int)
   where (s, n) = foldl f (0,0) xs
         f (!s,!n) x = (x+s,1+n)

However, even with bang patterns, this still exhibits excessive space use!

The problem with foldl

foldl f z [] = z
foldl f z (x:xs) = f (fold f z xs) x


  foldl f z [x1, x2, x3, x4]
=
  f (f (f (f z x4) x3) x2) x1

For a list with \(n\) elements builds up a thunk f (f (...)) of \(n\) applications of the function \(f\).

A strict left fold

To avoid building up the large thunk we should to use foldl' (from Data.List):

foldl' f z [] = z
foldl' f z (x:xs) = let z' = f z x
                    in z' `seq` foldl' f z' xs

foldl' forces evaluation of f before each recursive call.

Constant-space average

import Data.List (foldl')

avg :: [Double] -> Double
avg xs = s / fromIntegral (n :: Int)
   where (s, n) = foldl' f (0,0) xs
         f (!s,!n) x = (x+s,1+n)

Note the we also need the bang patterns (!).

Reasoning about space usage

Reasoning about space usage

Knowning how GHC represents values in memory is useful

Memory layout

GHC represent the list [1,2] = 1 : (2 : []) as

Memory usage for constructors

   data T a b = C1 a | C2 a b
   -- C1 takes 2 words; C2 takes 3 words

Unboxed types

   data Int = I# Int#

Quiz

data IntPair = IP Int Int

How many words are used to store a value of this type?

Answer: 7 words.

Why box basic types?

Because of lazy evaluation e.g. Int values may be thunks.

cond :: Bool -> Int -> Int -> Int
cond True x y  = x
cond False x y = y

> cond True 42 (1`div`0)
42

Unpacking

GHC allows controls over data representation using the UNPACK pragma:

Add the pragma just before the strictness annotation (!):

data Foo = Foo {-# UNPACK #-} !Int

GHC will also unpack automatically if you compile with optimizations.

Example of unpacking

data IntPair = IP Int Int

data IntPair = IP {-# UNPACK #-} !Int
                  {-# UNPACK #-} !Int

Benefits of unpacking

However: there are situations where making fields strict and unpacking them can hurt performance e.g. the data is passed to a non-strict function.