Debugging random Haskell

Posted on 2013-01-31 by Chris Warburton

I was asked to look at some Haskell code the other day, from some kind of online roleplaying game. The code is given, along with the briefing that it should find a valid password amongst a file of about 35,000. Its creator supposedly fired, and managed to mangle it before he was kicked out. Our task is to find the correct password.

Rather than look at the file, let’s dive straight into the code. I found this particularly enjoyable, since it’s essentially an exercise in debugging and refactoring, which is how I tend to write code. I’ll write a naïve version which is generally very inefficient, over-complicated, repetitive, hard-coded and fragile. I’ll then repeatedly refactor it until it becomes efficient, simple, terse, generic and robust (I hope!). I assume no knowledge of Haskell, but knowing a language with first-class functions like Javascript may be useful.

import ParseLib

Import an unspecified parsing library; this isn’t too significant, but some Googling will find it if you really care.

runner :: [Char] -> [a]

This is a type annotation. It’s only a sanity check, since Haskell infers types. Thus there’s no point debugging it, so we remove it. I’ll ignore type annotations from now on, as a mangled annotation will just confuse us.

runner a | (length (test (a ++ "z")) == (sum [product (3..5), product (6..9), 1])) = False =
mzero
        | otherwise                                                                        =
result [papply defB a]

This is an invalid function definition. Let’s clean up the line breaks:

runner a | (length (test (a ++ "z")) == (sum [product (3..5), product (6..9), 1])) = False = mzero
         | otherwise                                                                       = result [papply defB a]

Each line is a different ‘clause’ of the function; they have the following form (| means ‘where’):

myFunction myArgument | myCondition = iWillRunIfMyConditionIsTrue
                      | otherwise   = iWillRunIfMyConditionIsFalse

This syntax can simplify complicated function definitions, but in our case I think it’s clearer to use an equivalent if condition:

runner a = if (length (test (a ++ "z")) == (sum [product (3..5), product (6..9), 1])) = False
              then mzero
              else result [papply defB a]

There’s a rookie error at the end of the if-condition, common from many languages: it uses = (definition) in the if condition, instead of == (comparison). Let’s swap it:

runner a = if (length (test (a ++ "z")) == (sum [product (3..5), product (6..9), 1])) == False
              then mzero
              else result [papply defB a]

Notice that we’re comparing something to False; we can get rid of the == False if we swap the then and else branches around:

runner a = if length (test (a ++ "z")) == (sum [product (3..5), product (6..9), 1])
              then result [papply defB a]
              else mzero

The right-hand-side of the condition doesn’t contain a, so it must be constant (due to ‘referential transparency’: the output of a function is completely determined by its input, and here there are no inputs). Let’s calculate it:

sum [product (3..5), product (6..9), 1]

There’s no such thing as (3..5) or (6..9). It could mean 3.5 and 6.9, but that wouldn’t type check, since product wants a list of numbers. This makes it more likely to be [3..5] and [6..9] which is shorthand for enumFromTo 3 5 and enumFromTo 6 9, which are, respectively, [3, 4, 5] and [6, 7, 8, 9]:

sum [product [3, 4, 5], product [6, 7, 8, 9], 1]

The products are easy to calculate:

sum [60, 3024, 1]

As is the sum:

If we put this simplified condition back into our original function we get:

runner a = if length (test (a ++ "z")) == 3084
              then result [papply defB a]
              else mzero

++ is a built-in function, defined like this (where [] is an empty list and x : y is a singly-linked list starting with element x and followed by list y):

[]             ++ list2 = list2  -- Concatenating an empty list does nothing
(elem : list1) ++ list2 = elem : (list1 ++ list2)  -- Recurse

What about test?

test x = init x

The function test, of argument x, is defined as init applied to x. Clearly:

test = init

That transformation is known as eta-reduction. The init function is built in to Haskell’s standard library too. It looks like the following:

init (elem : []) = []  -- init of a one-element list is an empty list
init (elem : xs) = elem : init xs  -- Recurse

init chops an element off the end of a list. Since we’re concatenating with a non-empty list, init will recurse over a and end up at "z". Let’s simplify:

length (init (a ++ "z"))
length (a ++ init "z")
length (a ++ init ('z' : []))  -- "z" is a one-element list of Chars
length (a ++ [])
length a

This gives us a much simpler version of the original function:

runner a = if length a == 3084
              then result [papply defB a]
              else mzero

Now what’s result?

result []          = "Nope"
result ((x,xs):ys) = x

(foo, bar) is a pair, so result is extracting the first element of a pair, which itself is the first element of a list. We know that this element will be a String, since it has to be of the same type as "Nope".

We can split this into two separate element extractions, using the built-in functions fst (get the first element of a pair) and head (get the first element of a list):

result [] = "Nope"
result xs = fst (head xs)

Notice that runner calls result on a one-element list [papply defB x], therefore:

result [papply defB x]
result ((papply defB x) : [])  -- [foo] is sugar for foo:[]
fst (head ((papply defB x) : []))  -- Bring 'result' in-line
fst (papply defB x)  -- 'head (x:y)' is just 'x'

This gives us a simplified runner:

runner x = if length x == 3084
             then fst (papply defB x)
             else mzero

mzero is funny; it’s a ‘method’ of a ‘type class’ called MonadPlus. A type class is basically an interface, and types can be instances of type classes in the same way that Object Oriented classes can implement OO interfaces (confusing re-use of existing terminology, I know!).

To know which implementation of mzero we’re dealing with, we need to know it’s type. This must be the same as the then branch, which we just discovered must be a String (otherwise we couldn’t have put "Nope" as a possible return value). As we saw above, String is just a synonym for a list of Chars (ie. [Char]), and indeed lists are an instance of the type class MonadPlus. Their implementation of mzero is the empty list [], which (since we’re dealing with [Char] AKA String) we can also write as "":

runner x = if length x == 3084
              then fst (papply defB x)
              else ""

That’s as simple as we can get with runner. Although we can rearrange it back to the two-clause version:

runner x | length x == 3084 = fst (papply defB x)
         | otherwise        = ""

Next we look at the parsers:

defB = do char 'X'
           baz <- digit
          char 'X'
          char 'W'
           foo <- defC
           bar <- defD
           char 'a'
           fooo <- many (do char 'f'
                           digit)
           char 'Y'
           char 'Z'
           return (foo ++ bar)

This is ‘do notation’, which is nice syntactic sugar for Haskell’s famously scary feature, Monads. A Monad is just a way to chain functions together. Here we’re chaining together simple parsers into more complex ones. We just write do followed by a bunch of parsers and the Monad will do all of the plumbing, composition, backtracking, shortcutting, etc. for us.

A few points to mention:

Haskell’s layout rule lets us do without braces and semicolons in favour of indenting and dedenting; the indentation above needs straightening out.
By default the intermediate return-values are ignored. To keep them we use an arrow, eg. foo <- bar will store the result of bar in foo.
Many of the results above are ignored, so there’s not point using the arrows (eg. baz and fooo are ignored, but foo and bar are used at the end).
A do block is just a regular function, hence the nested argument to many
The return value of a do block is the return value of the last function in the chain; to override this, we can use the return function which doesn’t perform any action (in our case any parsing) but does provide a return value.

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          foo <- defC
          bar <- defD
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return (foo ++ bar)

It turns out that defD is only used once in the whole program, so we may as well insert its definition straight in here (suitably indented):

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          foo <- defC
          bar <- do foo <- defC
                   do baar <- defC
                      digit
                      fooo <- many (do digit)
                      q <- defE
                      return (q : "d")
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return (foo ++ bar)

Note that nested chains can be flattened; do A; (do B; C) is the same as do A; B; C, as long as we keep the correct return values assigned to any intermediate results. Likewise do digit is the same as digit, since we’re not chaining anything on to it:

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          foo <- defC
          defC
          defC
          digit
          many digit
          q <- defE
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return (foo ++ q ++ "d")

Likewise, defE is never used anywhere else. We may as well inline it too:

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          foo <- defC
          defC
          defC
          digit
          many digit
          q <- do foo <- defC
                  do fooo <- defC
                     do foooo <- defC
                        return "i"
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return (foo ++ q ++ "d")

Simplifying out (we now know the value of q is "i"):

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          foo <- defC
          defC
          defC
          digit
          many digit
          defC
          defC
          defC
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return (foo ++ "id")

Now let’s look at defC. We keep this separate since it’s used many times:

defC = do char '!'
          baz <- digit
          char 's'
          do s <- many (do char 's')
             return "Val"
         +++ do char '@'
                baz <- char 's'
                digit
                do s <- many (do digit)
                   return 'Val'

This is actually in two parts, combined using +++. I’ll split them into separate functions to make it clearer:

defCA = do char '!'
           baz <- digit
           char 's'
           do s <- many (do char 's')
              return "Val"

defCB = do char '@'
           baz <- char 's'
           digit
           do s <- many (do digit)
              return "Val"

defC = defCA +++ defCB

According to ParseLib, +++ is a choice, so defC will try to match defCA first. If it fails, the input will be wound back and defCB will be tried. defC will only fail to match if both defCA and defCB fail to match. Let’s clean it up:

defCA = do char '!'
           digit
           char 's'
           many (char 's')
           return "Val"

defCB = do char '@'
           char 's'
           digit
           many digit
           return "Val"

defC = defCA +++ defCB

We can see that both branches will return "Val" if successful, so we no longer need to use the foo variable in defB, we can just use "Val" directly:

defB = do char 'X'
          digit
          char 'X'
          char 'W'
          defC
          defC
          defC
          digit
          many digit
          defC
          defC
          defC
          char 'a'
          many (do char 'f'
                   digit)
          char 'Y'
          char 'Z'
          return "Valid"

defCA = do char '!'
           digit
           char 's'
           many (char 's')

defCB = do char '@'
           char 's'
           digit
           many digit

That’s about all the ‘simplifying’ we can do, but with a little more work we can make it more succinct, if less simple.

First we can collect together individual characters into strings:

string (c:cs) = do char c
                   string cs

This reduces defB and defCB to:

defB = do char 'X'
          digit
          string "XW"
          defC
          defC
          defC
          digit
          many digit
          defC
          defC
          defC
          char 'a'
          many (do char 'f'
                   digit)
          string "YZ"
          return "Valid"
defCB = do string "@s"
           digit
           many digit

We can also special-case the foo; many foo pattern (+ in regular-expression syntax). This works by recursing on the left until the pattern p fails to match, backtracking once, then matching p without recursing:

plus p = (do p
             plus p) +++ p

This lets us write defB, 3A and 3B like this:

defB = do char 'X'
          digit
          string "XW"
          defC
          defC
          defC
          plus digit
          defC
          defC
          defC
          char 'a'
          many (do char 'f'
                   digit)
          string "YZ"
          return "Valid"
defCA = do char '!'
           digit
           plus (char 's')
defCB = do string "@s"
           plus digit

Notice that defC is always used three times in a row? We can hard-code that, to get:

defC3 = do defC
           defC
           defC
defB = do char 'X'
          digit
          string "XW"
          defC3
          plus digit
          defC3
          char 'a'
          many (do char 'f'
                   digit)
          string "YZ"
          return "Valid"

We can make this a bit more compact if we use explicit bind functions >> rather than do notation in a few places (do a; b is sugar for a >> b), and sprinkle some standard Haskell functions around to reduce redundancy. This gives the final, complete program as:

runner x | length x == 3084 = ""
         | otherwise        = fst (papply defB x)
defB = do string "X"
          digit
          string "XW"
          defC3
          plus digit
          defC3
          string "a"
          many (string "f" >> digit)
          string "YZ"
          return "Valid"
string = join . (map char)
plus p = (p >> plus p) +++ p
defC3  = defC >> defC >> defC
defC   = (string "!"  >> digit >> plus (string "s")) +++
         (string "@s" >> plus digit)

It’s straightforward enough to see what this is doing even without running it. If you do want to run it, say as a shell command, you can add this:

main = do stdInput <- getContents
          let inputLines = lines stdInput
              successes = filter (("Valid" ==) . runner) inputLines
              formatted = unlines successes
          putStr formatted

Pipe passwords in to stdin and you’ll get valid ones out of stdout. Note that Haskell is non-strict, meaning your program will handle pipes properly, and not sit there buffering.