Purely functional data structures

Pedro Vasconcelos, DCC/FCUP

February 2021

Fundamentals

References

Purely Functional Data Structures (chapters 1 & 2), Chris Okasaki, Cambridge University Press, 1998.

Introduction to Functional Programming (Chapter 9.3), Richard Bird, Phil Wadler, 1988.

Functional vs. imperative data structures

Data structures in imperative languages are typically implemented by mutating fields
- e.g. modifing pointers
But in a purely functional paradigm we do not allow mutation
Purely functional data structures must be implemented by creating copies
However: we need only copy the affected parts — not the entire structure

Example

Haskell lists are built from constructors [] and :.

data [a] = [] | a : [a]  -- pseudo-Haskell

-- [0, 1, 2]  =  0 : (1 : (2 : []))

Pattern matching destructs one constructor at a time.

length :: [a] -> Int
length []     = 0
length (x:xs) = 1 + length xs

Operational view

Data structures are heap-allocated and passed by pointer
Pattern matching binds variables to pointers
- does not perform allocation
Evaluating a constructor allocates a new heap cell
- e.g. a “shallow copy” in C/C++/Java/Python terminology
Hence: both constructing and de-constructing cost \(O(1)\)
Assume result is fully evaluated (i.e., worst case cost for lazy evaluation)

Example 1: computing the length

length :: [a] -> Int
length []     = 0
length (x:xs) = 1 + length xs

Recursion over the list argument
Evaluating length xs takes \(O(n)\) time
(\(n\) = length xs)
Does not allocate any new list constructors

Example 2: appending lists

(++) :: [a] -> [a] -> [a]
[]     ++ ys  = ys
(x:xs) ++ ys  = x : (xs ++ ys)

Recursion over the first argument only
One new (:) allocated for each step
Evaluating xs ++ ys fully takes \(O(n)\) time/space
(where \(n\) = length xs)
The length of the second list does not matter

Functional append

Traverse the first list and copy the values
The original lists are not affected

Imperative append

Using mutation we could append in \(O(1)\) time by tracking the begining and end of the lists
However: this would invalidate the original lists

Example 3: list updating

Replace the \(n\)-th element in a list:

replace :: Int -> a -> [a] -> [a]
replace 0 v (x:xs) = v:xs
replace n v (x:xs) = x:replace (n-1) v xs

Sample:

  replace 2 7 [0,1,2,3,4]
= [0,1,7,3,4]

Example 3: list updating

The updated list shares part of the original structure
The old version is still valid

Example 4: Binary search trees

data Tree a 
   = Branch a (Tree a) (Tree a)
       -- value, left and right subtrees
   | Empty

-- insertion preserving ordering
insert :: Ord a => a -> Tree a -> Tree a
insert v Empty 
    = Branch v Empty Empty
insert v (Branch x l r)
    | v <= x  = Branch x (insert v l) r
    | v >  x  = Branch x l (insert v r)

Example 4: Binary search trees

xs = Branch 'd' (Branch 'b' ..) (Branch 'g' ..)
ys = insert 'e' xs

Persistent vs. ephemeral data structures

Imperative data structures are usually ephemeral
- updates invalidate the previous version
Functional data structures are automatically persistent
- updating creates a new version that coexists with the old one
- all nodes that are not affected can be shared between the old and new version
- sharing is safe because node fields are never mutated

Sharing can be useful even within a single data structure.

Example (from Data.List):

tails :: [a] -> [[a]]  -- list of all suffixes
tails [] = [[]]
tails xs = xs : tails (tail xs)

E.g.

   tails [1,2,3]
 = [[1,2,3],[2,3],[3],[]]

By sharing suffixes, tails uses \(O(n)\) time and space.

Advantages of functional structures

Allows safe sharing — no need for cloning operations
Allows trivial roll back of changes
- e.g. undo operations, backtracking, handling exceptions
Automatically safe from data races in concurrent / parallel programming

Disadvantages of functional structures

Higher worst-case complexity than the best imperative structures
- often incur a logarithmic overhead
Not as well studied and documented - only one dedicated textbook (Okasaki 1998)
Often only amortized bounds rather than bounds for individual operations
Requires memory management (GC or ref counts)
Causes more allocations (but not necessarily more residence)

Exercise 1

What is the complexity of the naive reverse function?

nrev :: [a] -> [a]
nrev []     = []
nrev (x:xs) = nrev xs ++ [x]

Exercise 2

What is the complexity of reverse with accumulator?

reverse :: [a] -> [a]
reverse xs = revAcc xs []
  where
  revAcc :: [a] -> [a] -> [a]
  revAcc [] ys     = ys
  revAcc (x:xs) ys = revAcc xs (x:ys)

Exercises 3

What is the complexity of tail and init?

(NB: both are undefined for [].)

tail :: [a] -> [a]
tail (_:xs) = xs 

init :: [a] -> [a]
init [x] = []
init (x:xs) = x : init xs

Example data structures

Maps

Map k v

Associations from keys k to values v (also called dictionaries.)
Basic operations:

empty :: Map k v
insert :: ... => k -> v -> Map k v -> Map k v
lookup :: ... => k -> Map k v -> Maybe v
-- NB: need some class constraints 
-- for efficient implementations

Maps implementation

Using binary search trees:

data Map k v = Empty 
             | Branch k v (Map k v) (Map k v)

Hence: most operations need an Ord k constraint on keys (but not values).

Example: lookup a key

lookup :: Ord k => k -> Map k v -> Maybe v
lookup k Empty = Nothing
lookup k (Branch k' v left right)
   | k == k'   = Just v
   | k <  k'   = lookup k left
   | otherwise = lookup k right

Exercises 1

Implement the remaining operations:

empty :: Map k v
-- construct an empty map; 
-- does *not* need the Ord constraint

insert :: Ord k => k -> v -> Map k v -> Map k v
-- insert or update a key

Exercises 2

Implement some extra operations:

fromList :: Ord k => [(k,v)] -> Map k v
-- build map from a list of pairs

keys :: Map k v -> [k]
-- get all keys in a map

assocs :: Map k v -> [(k,v)]
-- get the list of pairs in a map

NB: the last two do not need the Ord constraint (can you explain why?)

Exercise 3

Implement an instance for the Show class:

instance (Show k, Show v) => Show (Map k v)

Sugestion: show maps as lists of pairs:

> show (fromList [('a',1),('b',2)])
"fromList [('a',1),('b',2)]"

Exercise 4

This implementation does not guarantee that the search trees are balanced
This means that the height of the tree for \(n\) pairs may be \(O(n)\) instead of \(O(\log n)\)
Hence: insert and lookup are \(O(n)\) in worst case rather than \(O(\log n)\)

Exercise: modify the implemention to ensure balance — e.g. using AVL trees.

Haskell packages and libraries

Some common data structures

base: generic list functions (Prelude and Data.List)
containers: ordering-based containers (e.g. sets, maps)
unordered-containers: hashing-based containers (e.g. hash maps)
vector: purely-functional arrays with efficent loop fusion optimisations
text: efficent packed Unicode text

Fundamentals

References

Functional vs. imperative data structures

Example

Operational view

Example 1: computing the length

Example 2: appending lists

Functional append

Imperative append

Example 3: list updating

Example 3: list updating

Example 4: Binary search trees

Example 4: Binary search trees

Persistent vs. ephemeral data structures

Sharing

Sharing (cont.)

Advantages of functional structures

Disadvantages of functional structures

Exercise 1

Exercise 2

Exercises 3

Example data structures

Maps

Maps implementation

Example: lookup a key

Exercises 1

Exercises 2

Exercise 3

Exercise 4

Haskell packages and libraries

Some common data structures