# Self-Organising Lists

I was marking a Haskell assignment recently which introduced the concept of self-organising lists (specifically the “move to front” variety).

## The Context

Let’s say we have a tail-recursive list type. Haskell’s Prelude defines one, called `forall a. [a]`

. It’s conventional to leave the `forall a.`

implicit, so I’ll just write `[a]`

to mean the list type.

We can use `[]`

to construct an empty list and `:`

to prepend an element to another list.

We want the ability to look up an element which matches some predicate, eg. “has the ID ‘100’”, or “is shorter than 10”, or whatever. I’ll represent predicates using the function type `a -> Bool`

.

Implementing such a lookup function is pretty straightforward:

```
lookUp :: (a -> Bool) -> [a] -> Maybe a
lookUp p [] = Nothing
lookUp p (x:xs) = if p x
then Just x
else lookUp p xs
```

In the assignment we used `a`

as the return type and allowed errors to be thrown if no match was found, but that’s a horrible way to do it. Real Haskell code would avoid the errors completely by using the `Maybe`

type.

We can verify that this works using a few examples:

## The Problem

The `lookUp`

function steps through the list one element at a time until it finds a match. We can calculate its computational complexity by counting how many times it runs the predicate function `p`

, in terms of the length $N$ of the list. This should be roughly proportional to the amount of time it takes.

In the best case, the first element of the list satisfies the predicate, so we only run the predicate once. This gives best-case complexity of $O(1)$.

In the worst case, either *no* element satisfies the predicate, or only the *last* element of the list satisfies the predicate. In both of these cases, we must run the predicate on each element of the list, giving worst-case complexity of $O(N)$. Since the behaviour in both of these cases is identical, I’ll assume there’s always a match from now on.

In general, the complexity depends on the position of the first element which satisfies the predicate. Hence the *average* (or *expected*) complexity depends on the *average* position. With no prior knowledge of how our function will be used, the average position is $\frac{N}{2}$, hence the expected complexity is $O(\frac{N}{2})$.

If we knew more about how our function was going to be used, we could improve its performance. For example, let’s say that `lookUp`

will only be run with the constant list `myList`

. Let’s also say that one particular predicate is much more likely to be used than any other; let’s call it `p1`

.

The first thing we can do is make a `lookUp`

function which hard-codes the list `myList`

:

Any speedup this gives us will be modest (e.g. due to compiler optimisations, like loop unrolling). We can do much better by looking up a match (if any) for `p1`

ahead of time, and putting it at the start of the list:

```
-- 'acc' accumulates elements from the given list (in reverse order)
optimiseList :: [a] -> (a -> Bool) -> [a] -> [a]
optimiseList acc p [] = reverse acc -- No match
optimiseList acc p (x:xs) = if p x
then x : reverse acc ++ xs -- x matches
else optimiseList (x:acc) p xs -- Recurse on xs
```

Now we can look up matches from an optimised version of `myList`

:

The first time `lookUpOptimised`

is called, the local definition `go`

will be evaluated, resulting in a function. When that function is called, `lookUp`

will be called, which will force `newList`

to be evaluated. Since `newList`

is defined *outside* the `go`

function, it will be reused in subsequent calls to `lookUpOptimised`

, rather than being calculated again (in fact, we would probably see the same performance even if we wrote `newList`

inside `go`

, since Haskell implementations may spot that it’s a constant and “float” it outside). The `lookUp`

function will perform as usual, but in those cases where it’s called with `p1`

(which we’re assuming is quite often), and when `myList`

does actually contain an element satisfying `p1`

, then `lookUp`

will hit that element immediately since it will be at the start of `newList`

, and hence this (assumed to be common) use-case has the $O(1)$ best-case behaviour. Note that we’ll still get the $O(n)$ worst-case behaviour if there is no element in `myList`

which satisfies `p1`

, and we’ll also pay a one-time cost to do the initial search either way ($O(n)$ in the worst case).

This is all well and good, but its assumptions are quite strong. What if we *don’t* know what predicates will be common, or what if there are several common predicates? This is where self-organising lists come in.

## Self-Organising List Lookup

To make a self-organising list we just need to combine the processing performed by `lookUp`

and `optimiseList`

. When we’re asked to look up a value in a list, rather than *just* returning a matching element (if found) we *also* return a permutation of the list optimised for that predicate:

```
selfOrganisingLookUp :: (a -> Bool) -> [a] -> ([a], Maybe a)
selfOrganisingLookUp =
let go acc p [] = (reverse acc, Nothing) -- No match
go acc p (x:xs) = if p x
then (x : reverse acc ++ xs, Just x) -- x matches
else go (x:acc) p xs -- Recurse on xs
in go [] -- Start with an empty accumulator
```

To see any benefit we need to ensure that the list returned from each lookup is used as input to the following lookup. We can do this by propagating the resulting list up through our return values, all the way to our “main loop”, then pass them into the next iteration. We can do this explicitly, or get fancy with monads, applicatives, etc.

## Analysing Performance

The complexity of `selfOrganisingLookUp`

is tricky, and depends on the ratio of common-element lookups to non-common-element lookups. The best and worst case complexity match the regular list, $O(1)$ and $O(N)$, but the rearranging increases the chance that we’ll hit the best case, if some queries are more frequent than others. Hence, the self-organising list is *at least as fast* (asymptotically) as the regular list, when lookups are drawn at random from some fixed distribution. In reality there is some constant overhead caused by propagating the new lists around, and an $O(N)$ overhead from calling `reverse`

(although we’re only reversing the part of the list that we had to check, so this will match the complexity of the lookup: $O(1)$ when a match was found near the start, $O(N)$ when it was near the end or not found).

It’s also interesting to consider an “adversarial” situation, where an attacker knows that we’re using a self-organising list and wants to take down some service that we’re providing.

By sending lots of queries, an attacker can manipulate the overall order of the list, and hence come up with queries whose matches are near the end, causing us to hit the $O(N)$ worst-case performance again and again, which may be enough to slow down or disable our server. However, this isn’t too much of a worry since a simpler attack can cause the same behaviour by repeatedly sending a query which has no matches: that would also trigger the $O(N)$ behaviour, but doesn’t require manipulation of the order. Since this simpler attack works on regular lists too, I don’t see much disadvantage to using self-organising lists. Even if we limited ourselves to queries which always match, regular lists can be subject to a timing attack: timing how long different queries take, then sending the slowest ones again and again. Self-organising lists aren’t vulnerable to such repetitions of the same query, so at least they’d force attackers to work a little harder.

### A Poorly Performing Implementation

For the assignment, one student submitted an implementation something like the following:

```
selfOrganisingLookUp p xs = let notP x = not (p x)
ys = filter p xs
zs = filter notP xs
in (ys ++ zs, listToMaybe ys)
```

This approach produces *almost* the right output, so they wanted almost full marks. However, there are three problems with this implementation.

Firstly, the behaviour is not *quite* right: it moves *all* matching elements to the front of the list, but since only one is needed to speed up repeated queries, there’s nothing to be gained by looking for any others; we’re hence performing extra work (looking for more matches) which is of no benefit, slowing us down. This can actually make the implementation *slower* than regular lists, when averaged over multiple calls (laziness saves us from doing all of this extra work on the first call, but only manages to defer it to subsequent calls).

The second problem is that moving multiple matches to the front will *slow down* repetitions of *prior* queries, since they may have to skip past all of these front elements before they reach the match we previously prepended for them. For example, say our list is `[1, 2, 3, ..., 200]`

and the most common queries are for even numbers and odd numbers. Both versions would return the even number `2`

, but the fast version would return the list `[2, 1, 3, 4, 5, ..., 200]`

whilst the student’s version would return `[2, 4, 6, ..., 200, 1, 3, 5, ..., 199]`

. Querying the fast result for an odd number would quickly find `1`

, resulting in the list `[1, 2, 3, ..., 200]`

; yet the student’s version would have to check half of the list before it found any odd number, rearranging the result to be `[1, 3, 5, ..., 199, 2, 4, 6, ..., 200]`

. In other words, moving multiple matches to the front will disrupt the optimisations performed by previous queries, for no benefit and significant extra cost.

If the assignment were just about spitting out particular values then these issues wouldn’t be too bad; but since the context is that self-organising lists are an *optimisation* compared to regular lists, this implementation *isn’t* a very good solution since it’s actually *slower*.

The third problem isn’t specific to this assignment, and it isn’t related to the input/output behaviour at all: this implementation is more *complicated* than a function like `lookUp`

. I’m a big fan of simplifying code, even at the expense of speed (if speed’s an issue, profile it and benchmark). Yet this code is not only more complicated than it needs to be, it’s more complicated than the thing it’s replacing. If I came across this in the wild I wouldn’t hesitate to rip it out and replace it with built-in list functions, since they’re so much simpler.