SK logic in egglog: part 3, falsifying myself
Update: part 4 is here
In the previous
part we learned some background theory about SK logic, and built up
our notion of equality from simple identity (==
), through
equivalence of Normal
forms
(normalEq
) up to full
extensional equality (extEq
). In
this post we’ll investigate these definitions more thoroughly. Whilst
the eventual aim is to find the mistake I’d made in egglog, the best
path forward is to question and test our understanding as a whole.
False confidence
The properties we’ve written so far give us some confidence that our
definitions make sense, and our theory of their behaviour holds up,
since falsify was unable to disprove anything with a counterexample.
However, our confidence shouldn’t be too high, since most of
those properties have two branches: the situation we care about, and a
fallback which is trivially True
. If the
former is quite rare, finding no counterexamples may not tell us very
much.
For example, consider agreeOnExtensionalInputs (n, x, y, inputs)
,
which asserts that whenever extensionalInputs
claims x
and y
are extensionally equal for i
symbolic inputs, they agree on any
sequence of i
concrete input
values. This assertion has many caveats:
- Unless
x
andy
are extensionally equal, or happen to match a form thatprovablyDisagree
can recognise, thenextensionalInputs x y
will never finish. In that case we are merely asserting the fallback ofrunDelayOr True
, not learning anything about extensional equality. - In those cases that
extensionalInputs x y
would finish, it must take no more thann
steps to avoid that fallback. - When
x
andy
provablyDisagree
,extensionalInputs
givesNothing
and we’re merely asserting the other fallback:Now True
. - When
extensionalInputs
givesJust 0
,x
andy
arenormalEq
, so our assertion overlaps with other properties we’ve already established.
None of these make the property false; but they weaken the
evidence that it provides. We can see this more clearly by “labelling”
each situation, and having falsify
report statistics on how
common each was. Unfortunately this requires us to return more than just
a Bool
,
so we need to leave the world of simple predicates (and hence
cross-framework compatibility). Every property-checker provides its own
alternative implementations for such features: in falsify
it’s the Property
type.
Here’s a pretty direct translation of agreeOnExtensionalInputs
as a Property
, with
each of the above cases labelled:
=
agreeOnExtensionalInputsStats g "agreeOnExtensionalInputs" $ do
testProperty <- gen g
(n, x, y, ins) let check Nothing = Now (Left "Not equal")
Just i) = Right . (i,) . same <$> sAt i (agree x y ins)
check ("Identical" [show (x == y)]
label case runDelayOr (Left "Timeout") (extensionalInputs x y >>= check) n of
Left msg -> label "result" [msg]
Right (i, True ) -> label "i" [show i] *> label "result" ["True"]
Right (i, False) -> fail ("Disagree on " ++ show (sTake i ins))
-- | Split the first 'n' elements off a 'Stream'.
sSplitAt :: Integral n => n -> Stream a -> ([a], Stream a)
= go []
sSplitAt where go acc n xs | n <= 0 = (reverse acc, xs)
Cons x xs) = go (x:acc) (n - 1) xs
go acc n (
-- | Take the first 'n' elements of a 'Stream' as a list.
sTake :: Natural -> Stream a -> [a]
= fst . sSplitAt n sTake n
= defaultMain (agreeOnExtensionalInputsStats
main (tuple4 genFuel genCom genCom genComs))
agreeOnExtensionalInputs: OK (0.13s)
100 successful tests
Label "Identical":
98.0000% False
2.0000% True
Label "i":
8.0000% 0
Label "result":
14.0000% Not equal
78.0000% Timeout
8.0000% True
All 1 tests passed (0.13s)
The results will vary depending on the random seed (which changes
every time this page gets rendered), but I’ve seen around ⅓ resulting in
Timeout
, around ⅗ with Not equal
, and only a
handful resulting in True
. Most of the latter agree on
0 inputs, making them normalEq
and hence bypassing the use
of symbolic execution. The only good news is that x
and y
were hardly ever identical!
Here are similar results for other properties (their definitions are hidden in the following fold-out section, since the logic hasn’t changed):
Labelled properties…
=
notUnnormalEqToItselfStats g "notUnnormalEqToItself" $ do
testProperty <- gen g
(n, x) case runDelayOr Nothing (Just <$> normalEq x x) n of
Nothing -> label "result" ["Timeout"]
Just (Same _) -> label "result" ["True" ]
Just (Diff x' y') -> fail (show x' ++ " =/= " ++ show y')
=
normalEqImpliesAgreeStats g "normalEqImpliesAgree" $ do
testProperty <- gen g
(n, f, g, xs) case runDelay n (tuple2 (normalEq f g) (sHead (agree f g xs))) of
Now got@(Same _ , Diff _ _) -> fail (show got)
Now (Same _ , Same _ ) -> label "result" ["Same Same"]
Now (Diff _ _, Same _ ) -> label "result" ["Diff Same"]
Now (Diff _ _, Diff _ _) -> label "result" ["Diff Diff"]
Later _ -> label "result" ["Timeout" ]
=
skNeverDisagreesWithSKSKKKStats g "skNeverDisagreesWithSKSKKK" $ do
testProperty <- gen g
(n, xs) let f = App s k
= App (App s (App k (App s k))) (App k k)
g case runDelayOr Nothing (Just <$> sAt (2 + n) (agree f g xs)) n of
Nothing -> label "result" ["Timeout"]
Just (Same _ ) -> label "result" ["True"]
Just got@(Diff _ _) -> fail (show got)
=
agreementIsMonotonicStats g "agreementIsMonotonic" $ do
testProperty <- gen g
((n, m), (f, g), xs) case runDelay n (tuple2 (sAt n (agree f g xs))
+ m) (agree f g xs))) of
(sAt (n Now got@(Same _ , Diff _ _) -> fail (show got)
Now (Same _ , Same _ ) -> label "result" ["Same Same"]
Now (Diff _ _, Same _ ) -> label "result" ["Diff Same"]
Now (Diff _ _, Diff _ _) -> label "result" ["Diff Diff"]
Later _ -> label "result" ["Timeout" ]
=
normalEqImpliesEverAgreeStats g "normalEqImpliesEverAgree" $ do
testProperty <- gen g
(n, x, y) let go d = runDelayOr Nothing (Just <$> d)
case (go (same <$> normalEq x y) n, go (everAgree x y) (triangle n)) of
Nothing , _ ) -> label "result" ["Timeout"]
(Just False, _ ) -> label "result" ["Unequal"]
(Just True , Just True ) -> label "result" ["True" ]
(Just True , Nothing ) -> fail "everAgree timed out"
(Just True , Just False) -> fail "Didn't agree"
(
= do
commuteStats f g <- gen g
(x, y) case (f x y, f y x) of
True , True ) -> label "result" ["Both" ]
(False, False) -> label "result" ["Neither"]
(True , False) -> fail "f(x, y) but not f(y, x)"
(False, True ) -> fail "f(y, x) but not f(x, y)"
(
=
distinctSymbolicHeadsCommutesStats "distinctSymbolicHeadsCommutes" .
testProperty
commuteStats distinctSymbolicHeads
=
unequalArgCountCommutesStats "unequalArgCountCommutes" .
testProperty
commuteStats unequalArgCount
=
provablyDisagreeCommutesStats "provablyDisagreeCommutes" .
testProperty
commuteStats provablyDisagree
=
symbolGivenUnequalArgsCommutesStats g "symbolGivenUnequalArgsCommutes" $ do
testProperty <- gen g
(f, x, y) pure (x, y))
commuteStats (liftFun2 symbolGivenUnequalArgs f) (
=
extEqGeneralisesEqAndNormalEqAndEverAgreeStats g "extEqGeneralisesEqAndNormalEqAndEverAgree" $ do
testProperty <- gen g
(n, x, y) let go l d = case runDelay n d of
Now b -> label l [show b ] >> pure (Just b)
Later _ -> label l ["Timeout"] >> pure Nothing
<- go "extEq" ( extEq x y)
ext <- go "everAgree" ( everAgree x y)
evr <- go "normalEq" (same <$> normalEq x y)
nml <- go "equal" (pure $ (==) x y)
eql case ext of
Nothing -> label "result" ["Timeout"]
Just True -> label "result" ["Equal" ]
Just False -> case (evr, nml, eql) of
Just True, _ , _ ) -> fail "everAgree but not extEq"
(Just True, _ ) -> fail "normalEq but not extEq"
(_ , Just True) -> fail "== but not extEq"
(_ , _ , -> label "result" ["Unequal"] (_ , _ , _ )
= defaultMain $ testGroup "Stats"
main
[ notUnnormalEqToItselfStats
(tuple2 genFuel genCom)
, normalEqImpliesAgreeStats
(tuple4 genFuel genCom genCom genComs)
, skNeverDisagreesWithSKSKKKStats
(tuple2 genFuel genComs)
, agreementIsMonotonicStats
(tuple3 (tuple2 genFuel genFuel) (tuple2 genCom genCom) genComs)
, normalEqImpliesEverAgreeStats
(tuple3 genFuel genCom genCom)
, distinctSymbolicHeadsCommutesStats
(tuple2 genSymCom genSymCom)
, unequalArgCountCommutesStats
(tuple2 genSymCom genSymCom)
, symbolGivenUnequalArgsCommutesStatsFalse)) genSymCom genSymCom)
(tuple3 (Gen.fun (Gen.bool
, provablyDisagreeCommutesStats
(tuple2 genSymCom genSymCom)
, agreeOnExtensionalInputsStats
(tuple4 genFuel genCom genCom genComs)
, extEqGeneralisesEqAndNormalEqAndEverAgreeStats
(tuple3 genFuel genCom genCom) ]
Stats
notUnnormalEqToItself: OK (0.03s)
100 successful tests
Label "result":
5.0000% Timeout
95.0000% True
normalEqImpliesAgree: OK (0.07s)
100 successful tests
Label "result":
79.0000% Diff Diff
9.0000% Same Same
12.0000% Timeout
skNeverDisagreesWithSKSKKK: OK (1.04s)
100 successful tests
Label "result":
100.0000% Timeout
agreementIsMonotonic: OK (0.93s)
100 successful tests
Label "result":
100.0000% Timeout
normalEqImpliesEverAgree: OK (0.05s)
100 successful tests
Label "result":
9.0000% Timeout
7.0000% True
84.0000% Unequal
distinctSymbolicHeadsCommutes: OK (0.05s)
100 successful tests
Label "result":
2.0000% Both
98.0000% Neither
unequalArgCountCommutes: OK (0.03s)
100 successful tests
Label "result":
100.0000% Neither
symbolGivenUnequalArgsCommutes: OK (0.03s)
100 successful tests
Label "result":
100.0000% Neither
provablyDisagreeCommutes: OK (0.02s)
100 successful tests
Label "result":
2.0000% Both
98.0000% Neither
agreeOnExtensionalInputs: OK (0.13s)
100 successful tests
Label "Identical":
100.0000% False
Label "i":
7.0000% 0
1.0000% 1
Label "result":
11.0000% Not equal
81.0000% Timeout
8.0000% True
extEqGeneralisesEqAndNormalEqAndEverAgree: OK (0.24s)
100 successful tests
Label "equal":
98.0000% False
2.0000% True
Label "everAgree":
92.0000% Timeout
8.0000% True
Label "extEq":
12.0000% False
80.0000% Timeout
8.0000% True
Label "normalEq":
86.0000% False
6.0000% Timeout
8.0000% True
Label "result":
8.0000% Equal
80.0000% Timeout
12.0000% Unequal
All 11 tests passed (2.61s)
Again, the exact distribution of these tests will vary from run to
run; but I’ve seen some that always time-out, or never satisfy
the required preconditions, etc. Sometimes this is a legitimate problem
with our logic: for example, the properties skNeverDisagreesWithSKSKKKStats
and
agreementIsMonotonic
always hit
a timeout, since they are using the same Fuel
parameter
as the number of inputs and number of steps before timing out. We can
avoid this by using separate parameters for each:
=
skNeverDisagreesWithSKSKKKFixed g "skNeverDisagreesWithSKSKKK" $ do
testProperty <- gen g
(n, m, xs) let f = App s k
= App (App s (App k (App s k))) (App k k)
g case runDelayOr Nothing (Just <$> sAt (2 + n) (agree f g xs)) (m + n) of
Nothing -> label "result" ["Timeout"]
Just (Same _ ) -> label "result" ["True" ]
Just got@(Diff _ _) -> fail (show got)
=
agreementIsMonotonicFixed g "agreementIsMonotonic" $ do
testProperty <- gen g
((i1, i2), n, (f, g), xs) case tuple2 (runDelay n (sAt i1 (agree f g xs)))
+ i2) (agree f g xs))) of
(runDelay n (sAt (i1 Now got@(Same _ , Diff _ _) -> fail (show got)
Now (Same _ , Same _ ) -> label "result" ["Same Same"]
Now (Diff _ _, Same _ ) -> label "result" ["Diff Same"]
Now (Diff _ _, Diff _ _) -> label "result" ["Diff Diff"]
Later _ -> label "result" ["Timeout" ]
= defaultMain (skNeverDisagreesWithSKSKKKFixed
main (tuple3 genFuel genFuel genComs))
skNeverDisagreesWithSKSKKK: OK (1.28s)
100 successful tests
Label "result":
78.0000% Timeout
22.0000% True
All 1 tests passed (1.28s)
= defaultMain (agreementIsMonotonicFixed
main
(tuple4 (tuple2 genFuel genFuel)
genFuel
(tuple2 genCom genCom) genComs))
agreementIsMonotonic: OK (1.26s)
100 successful tests
Label "result":
2.0000% Diff Diff
98.0000% Timeout
All 1 tests passed (1.26s)
Discarding uninteresting cases
Property-checkers, including falsify
, allow us to
“discard” uninteresting cases, which aborts the current call; generating
another input to check instead. Discarded calls do not count towards the
reported total, so they avoid the false confidence issue we saw above.
The downside is that extra calls make the test slower; and it will be
abandoned entirely if too many are discarded (for falsify
,
the default limit is 100 in a row). Depending on the property checker
that may be considered a test failure: falsify
only
considers abandoned tests to have failed if there were no
successful calls.
In the following fold-out section we refactor our properties again, to discard any branches that are “uninteresting”, like timeouts:
Properties which discard…
=
notUnnormalEqToItselfDiscard g "notUnnormalEqToItself" $ do
testProperty <- gen g
(n, x) case runDelayOr Nothing (Just <$> normalEq x x) n of
Nothing -> discard
Just (Same _) -> pure ()
Just (Diff x' y') -> fail (show x' ++ " =/= " ++ show y')
=
normalEqImpliesAgreeDiscard g "normalEqImpliesAgree" $ do
testProperty <- gen g
(n, f, g, xs) case runDelay n (tuple2 (normalEq f g) (sHead (agree f g xs))) of
Now got@(Same _ , Diff _ _) -> fail (show got)
Now (Same _ , Same _ ) -> pure ()
Now (Diff _ _, _ ) -> discard
Later _ -> discard
=
skNeverDisagreesWithSKSKKKDiscard g "skNeverDisagreesWithSKSKKK" $ do
testProperty <- gen g
(n, m, xs) let f = App s k
= App (App s (App k (App s k))) (App k k)
g case runDelayOr Nothing (Just <$> sAt (2 + n) (agree f g xs)) (m + n) of
Nothing -> discard
Just (Same _ ) -> pure ()
Just got@(Diff _ _) -> fail (show got)
=
agreementIsMonotonicDiscard g "agreementIsMonotonic" $ do
testProperty <- gen g
((i1, i2), n, (f, g), xs) case tuple2 (runDelay n (sAt i1 (agree f g xs)))
+ i2) (agree f g xs))) of
(runDelay n (sAt (i1 Now got@(Same _ , Diff _ _) -> fail (show got)
Now (Same _ , Same _ ) -> pure ()
Now (Diff _ _, _ ) -> discard
Later _ -> discard
=
normalEqImpliesEverAgreeDiscard g "normalEqImpliesEverAgree" $ do
testProperty <- gen g
(n, x, y) let go d = runDelayOr Nothing (Just <$> d)
case (go (same <$> normalEq x y) n, go (everAgree x y) (triangle n)) of
Nothing , _ ) -> discard
(Just False, _ ) -> discard
(Just True , Just True ) -> pure ()
(Just True , Nothing ) -> fail "everAgree timed out"
(Just True , Just False) -> fail "Didn't agree"
(
=
agreeOnExtensionalInputsDiscard g "agreeOnExtensionalInputs" $ do
testProperty <- gen g
(n, x, y, ins) let check Nothing = pure discard
Just i) = pure . (i,) . same <$> sAt i (agree x y ins)
check (if x == y
then discard
else do (i, b) <- runDelayOr discard (extensionalInputs x y >>= check) n
"i" [show i]
label if b
then pure ()
else fail ("Disagree on " ++ show (sTake i ins))
=
extEqGeneralisesEqAndNormalEqAndEverAgreeDiscard g "extEqGeneralisesEqAndNormalEqAndEverAgree" $ do
testProperty <- gen g
(n, x, y) let go l d = case runDelay n d of
Now b -> label l [show b ] >> pure (Just b)
Later _ -> label l ["Timeout"] >> pure Nothing
<- go "extEq" ( extEq x y)
ext <- go "everAgree" ( everAgree x y)
evr <- go "normalEq" (same <$> normalEq x y)
nml <- go "equal" (pure $ (==) x y)
eql case ext of
Nothing -> discard
Just True -> pure ()
Just False -> case (evr, nml, eql) of
Just True, _ , _ ) -> fail "everAgree but not extEq"
(Just True, _ ) -> fail "normalEq but not extEq"
(_ , Just True) -> fail "== but not extEq"
(_ , _ , -> discard (_ , _ , _ )
= defaultMain $ testGroup "Discard"
main
[ notUnnormalEqToItselfDiscard
(tuple2 genFuel genCom)
, normalEqImpliesAgreeDiscard
(tuple4 genFuel genCom genCom genComs)
, skNeverDisagreesWithSKSKKKDiscard
(tuple3 genFuel genFuel genComs)
, agreementIsMonotonicDiscard
(tuple4 (tuple2 genFuel genFuel) genFuel (tuple2 genCom genCom) genComs)
, normalEqImpliesEverAgreeDiscard
(tuple3 genFuel genCom genCom)
, agreeOnExtensionalInputsDiscard
(tuple4 genFuel genCom genCom genComs)
, extEqGeneralisesEqAndNormalEqAndEverAgreeDiscard
(tuple3 genFuel genCom genCom) ]
Discard
notUnnormalEqToItself: OK (0.04s)
100 successful tests (discarded 8)
normalEqImpliesAgree: OK (0.92s)
100 successful tests (discarded 1365)
skNeverDisagreesWithSKSKKK: OK (5.08s)
100 successful tests (discarded 271)
agreementIsMonotonic: FAIL (0.84s)
All tests discarded (discarded 100)
Use -p '/agreementIsMonotonic/' to rerun this test only.
normalEqImpliesEverAgree: OK (0.71s)
100 successful tests (discarded 1281)
agreeOnExtensionalInputs: OK (1.42s)
59 successful tests (discarded 1201)
Label "i":
96.6102% 0
3.3898% 1
extEqGeneralisesEqAndNormalEqAndEverAgree: OK (2.85s)
100 successful tests (discarded 1251)
Label "equal":
77.0000% False
23.0000% True
Label "everAgree":
100.0000% True
Label "extEq":
100.0000% True
Label "normalEq":
3.0000% False
97.0000% True
1 out of 7 tests failed (11.87s)
Discarding works very well for some tests, e.g. notUnnormalEqToItself
only hit a few
timeouts, which caused a handful of extra calls. Other tests struggled,
making over 1000 extra calls; some were abandoned after reaching the
limit of 100 discards in a row; and some of those gave up without a
single success, causing the overall test suite to fail.
Smarter generators
One way to avoid excessive discards is to move logic into our data generators. This isn’t a magic bullet, but it can be useful if acceptable values are reasonably common; if retrying a generator is faster than discarding a test; and for retrying parts of an input, rather than starting from scratch.
For example, the following generator only produces Normal
values,
by running reduce
on generated
Com
values. That’s undecidable from within a property, but actually
quite easy in a generator, since we’re free to discard problematic
values and try again:
-- | Like genComN, but reduces its outputs to normal form. The fuel
-- | bounds the size of the initial expression (before it's reduced), and
-- | the number of steps to attempt when normalising.
genNormalN :: Fuel -> Gen Normal
= shrinkWith shrinkNormal $ do
genNormalN n <- genComN n -- Generate a Com value c
c -- Fall back to generating a fresh Com
runDelayOr (genNormalN n) pure <$> reduce c) -- Try to reduce c to a Normal value
(-- Give up after n steps
n
-- | Generates (relatively small) Normal values
genNormal :: Gen Normal
= genNormalN limit
genNormal
-- | Generates a 'Stream' from the given generator
genStream :: Gen a -> Gen (Stream a)
= Cons <$> g <*> genStream g
genStream g
-- | Generates a 'Stream' of 'Normal' values of the given size
genNormalsN :: Fuel -> Gen (Stream Normal)
= genStream . genNormalN
genNormalsN
-- | Generates a 'Stream' of reasonably-sized 'Normal' values
genNormals :: Gen (Stream Normal)
= genNormalsN limit genNormals
A minor digression about shrinking
Property-checkers which generate inputs randomly, like
falsify
, may stumble across monstrous counterexamples full
of irrelevant structure that is tedious for users to tease apart and
find the underlying problem. To reduce this burden, all such checkers
will attempt to “shrink” the counterexamples they find, in the hope of
discarding irrelevant parts: when a counterexample is found, a
“shrinker” turns it into a list of possible alternatives. If that list
is empty, shrinking stops and the counterexample is shown to the user.
If the list is not empty, the property is retried with the first
alternative: if the property holds then the next alternative is tried,
and so on. If an alternative is found for which the property does
not hold, then we’ve found a smaller counterexample. The shrinking
process is repeated for this smaller counterexample, and so on until
there are no alternatives remaining (either because the counterexample
cannot be shrunk, like an empty list; or the property held for every
smaller alternative).
There is no precise definition for what “shrinking” means, other than
a requirement to avoid cycles: if foo
can be shrunk to
bar
, then bar
should not also shrink to
foo
, or else the above algorithm can get stuck in a
loop.
QuickCheck implements shrinking alongside, but separately, to its data generators. A QuickCheck-style shrinker is a value with the following type:
type Shrink a = a -> [a]
That is, a function which takes as input the value we want to shrink,
and outputs a list (potentially empty!) of “smaller” alternatives. For
example, we could shrink a Com
value in
several ways:
- Replace leaf values with
K
(the simpler of the two basis values). We can’t replaceK
with itself, since that’s a cycle, so it has no alternatives. - Replace branches (
App
values) withS
orK
, since leaves are smaller. - Replace branches with either of their children.
- Shrink either of a branch’s children.
shrinkCom :: Shrink Com
C "K") = []
shrinkCom (C _) = [k]
shrinkCom (App l r) = interleave
shrinkCom (
[ [ k, s ]filter (`notElem` [k, s]) [ l, r ] -- Avoid duplicating k or s
, App l <$> shrinkCom r
, flip App r <$> shrinkCom l
,
]
-- | Alternate taking elements from each list.
interleave :: [[a]] -> [a]
:xs):ys) = x : interleave (ys ++ [xs])
interleave ((x:ys) = interleave ys
interleave ([] = [] interleave []
The shrinkNormal
function we
reference above is a simple wrapper around shrinkCom
, which discards non-Normal
values:
shrinkNormal :: Shrink Normal
= keep . shrinkCom . toCom
shrinkNormal where keep [] = []
:cs) = case toNormal c of
keep (cLeft _ -> keep cs
Right n -> n : keep cs
We use shrinkWith shrinkNormal
in our
definition of genNormalN
so that
the shrinking algorithm described above will get smaller alternatives by
calling our shrinkNormal
function. That would be the norm in QuickCheck (or its descendents, like
ScalaCheck), but falsify
is a bit smarter: it doesn’t need
custom shrinking functions, since its able to re-run its generators on
smaller (trees of) random numbers. Such an “integrated shrinking”
approach is usually good enough, and saves us a bunch of effort
(after all, we’ve got this far without having to know about shrinking!).
Unfortunately, integrated shrinking doesn’t work well for smarter
generators that perform a search, like genNormalN
: there’s no reason to
expect that re-running the search with some smaller random numbers will
find a smaller result. Indeed it may find no result, and get
stuck recursing forever. When writing such “smart” generators, it’s
hence worth thinking about their shrinking behaviour, and whether it
would be prudent to override.
Here are a few more general-purpose shrinkers (in a scrap your type classes style):
-- | Shrink the elements of a tuple.
shrink2 :: Shrink a -> Shrink b -> Shrink (a, b)
= interleave [(,y) <$> sA x, (x,) <$> sB y]
shrink2 sA sB (x, y)
-- | Shrink the elements of a tuple.
shrink3 :: Shrink a -> Shrink b -> Shrink c -> Shrink (a, b, c)
= unpack <$> shrink2 sA (shrink2 sB sC) (x, (y, z))
shrink3 sA sB sC (x, y, z) where unpack (x', (y', z')) = (x', y', z')
-- | Shrink a list, by dropping and shrinking its elements
shrinkL :: Shrink a -> Shrink [a]
= []
shrinkL _ [] = [] : (pure <$> sA x)
shrinkL sA [x] = [] : shrinkElems
shrinkL sA xs where len = length xs
= do
shrinkElems <- [0..len-1]
i case splitAt i xs of
-- Try dropping or shrinking the ith element
:suf) -> (pre ++ suf) : ((pre ++) . (:suf) <$> sA a)
(pre, a-> [] _
Revisiting our roots
Since Normal
values
reduce immediately, their normalEq
result is always Now
, and hence
they can be compared with any amount of Fuel
without
timing out. This fixes our very first property, that all values are
normalEq
to themselves!
normalsAreNormalEqToThemselves :: (Fuel, Normal) -> Bool
= normalEqToItself (n, toCom x) normalsAreNormalEqToThemselves (n, x)
= check normalsAreNormalEqToThemselves (tuple2 genFuel genNormal) main
normalsAreNormalEqToThemselves: OK (0.05s)
100 successful tests
All 1 tests passed (0.05s)
Generating related values
The following generator makes a Set
of values
which satisfy some given binary predicate. The most straightforward way
to implement this would be generating sets of values over and over until
we eventually find one that fits our criteria. However, it may be quite
rare for values to satisfy that predicate and attempting to generate
multiple at once will just compound that rarity.
Instead, our “smart” generator can be more efficient by accumulating values until any combination of them satisfies our criteria (exploiting the so-called “birthday paradox”):
-- | Accumulate more and more values from the given 'Gen', until we find 'n'
-- | that satisfy the given relation. The 'i' parameter allows for flexibility.
genMatchingN :: Ord o
=> Natural
-> (i -> Gen o)
-> (i -> o -> o -> Bool)
-> Shrink o
-> i
-> Gen (Set o)
= shrinkWith shrink (go Map.empty)
genMatchingN n g match shr i where go xss = do
<- g i
x let k = case filter (match i x) (Map.keys xss) of
:_ -> y
y-> x
_ -- Append x to the Set which matches k (or empty, if not present)
= Map.findWithDefault Set.empty k xss
xs case Set.insert x xs of
-> if len xs >= n
xs then pure xs
else go (Map.insert k xs xss)
-- Try shrinking the elements individually, keeping any that still match
=
shrink filter ((== n) . len) . (Set.fromList <$>) . shrink' . Set.toList
= []
shrink' [] :xs) =
shrink' (x:xs) <$> shr x, (x:) <$> shrink' xs])
keep (interleave [(
= fromIntegral . Set.size
len
= []
keep [] :xss) = keep xss
keep ( []:xs):xss) =
keep ((xif all (match i x) xs then [x:xs] else []) ++ keep xss
(
-- | Accumulate more and more values from the given 'Gen', until we find two
-- | that satisfy the given relation. The 'i' parameter allows for flexibility.
genMatching :: Ord o
=> (i -> Gen o)
-> (i -> o -> o -> Bool)
-> Shrink o
-> i
-> Gen (o, o)
= genMatchingN 2 g match shr i >>= get
genMatching g match shr i where get s = case Set.toList s of
:y:_ -> pure (x, y)
x-> genMatching g match shr i -- absurd but fall back to retrying
_
-- | Generate pairs of unequal 'Com' values which have the same 'Normal' form.
genNormalEqN :: Fuel -> Gen (Com, Com)
= genMatching genComN matchNormalEqN shrinkCom
genNormalEqN
= x /= y && runDelayOr False (same <$> normalEq x y) n
matchNormalEqN n x y
genNormalEq :: Gen (Com, Com)
= genNormalEqN limit genNormalEq
This generator greatly reduces discards for a property like normalEqImpliesAgree
:
= defaultMain . normalEqImpliesAgreeDiscard $ do
main <- genNormalEq
(x, y) pure x) (pure y) genComs tuple4 genFuel (
normalEqImpliesAgree: OK (0.59s)
100 successful tests (discarded 24)
All 1 tests passed (0.59s)
Hedging our bets
Unlike normalsAreNormalEqToThemselves
, which
relies on genNormal
to
avoid timeout counterexamples, the smart generator for normalEqImpliesAgree
is
purely an optimisation. For that reason, I actually prefer to
mix in a few inputs from the original “dumb” generator: that way, we
don’t need to trust the smart generator to completely cover the input
space; and we leave open the possibility for the dumb generator to
stumble on to a different form of counterexample. The simplest way to
mix two generators is via the Gen.choose
function, which uses each
half of the time; but that would increase the number of discards more
than I would like. Instead we’ll use the more sophisticated Gen.frequency
:
= defaultMain . normalEqImpliesAgreeDiscard $ do
main <- Gen.frequency
(x, y) 9, genNormalEq )
[ (1, (tuple2 genCom genCom)) ]
, (pure x) (pure y) genComs tuple4 genFuel (
normalEqImpliesAgree: OK (0.57s)
100 successful tests (discarded 42)
All 1 tests passed (0.57s)
A trickier case
We can use a similar approach for agreementIsMonotonic
, generating a
pair of values which only agree on a non-zero number of inputs:
-- | Generate a pair of 'Com' values which agree on the given number of inputs,
-- | within the given amount of 'Fuel'.
genAgreeFromN :: Natural -> Fuel -> Gen (Com, Com)
= genMatching genComN (matchAgreeFromN lo) shrinkCom
genAgreeFromN lo
= case runDelayOr Nothing (extensionalInputs x y) n of
matchAgreeFromN lo n x y Just i -> i >= lo
Nothing -> False
-- | Generate a pair of 'Com' values which agree on 1 or more inputs. This tends
-- | to avoid values which agree on 0 inputs, i.e. with equal 'Normal' forms,
-- | and hence exercise the symbolic execution more thoroughly.
= genAgreeFromN 1
genAgreeN
-- | Generate pairs of reasonably sized values which agree on 1 or more inputs.
= genAgreeN limit genAgree
= defaultMain . agreementIsMonotonicDiscard . Gen.frequency $
main 9, smart), (1, dumb)]
[(where dumb = tuple4 (tuple2 genNat genNat)
genFuel
(tuple2 genCom genCom)
genComs= do
smart <- (10 +) <$> genFuel
n <- genAgreeN n -- Generate values that agree on some inputs
(x, y) case runDelay n (extensionalInputs x y) of -- Find how many inputs
Now (Just i) -> tuple4
pure i) genNat)
(tuple2 (pure n)
(pure (x, y))
(fmap toCom <$> genNormals) -- Normal values don't need any Fuel
(-> smart -- absurd, but retry as a fallback
_ = fromIntegral <$> genFuel genNat
agreementIsMonotonic: OK (115.21s)
100 successful tests (discarded 189)
All 1 tests passed (115.21s)
This test now takes a lot longer to run than the others, due to the more precise precondition we require of the generator. I see that as a good investment, since we’re now maxing-out our CPU in a (reasonably targeted) attempt to find mistakes in our reasoning; which seems preferable to the original test, which quickly gave up even trying. Since the number of runs is configurable, we can choose a low number for a fast feedback cycle during development, and crank it up for a more thorough check like a pre-push git hook or a continuous integration server.
Thoroughly checking extensionality
Most of the checks so far have been focused on implementation details, like the definition of “agreement” and the particular patterns of disagreement we’re looking for. Now I want to focus on extensional equality itself, and its implications.
Extensionally equal values are indistinguishable
The main reason we care about extensional equality is that it’s broad enough to relate all values which are indistinguishable within SK. This is also known as observational equivalence (SK is such a simple language that these notions end up coinciding!).
As a consequence, extensionally equal values are interchangable: swapping any part of an SK expression for an extensionally-equal alternative should make no observable difference to the result; i.e. the results should themselves be extensionally equal. We can represent such “swapping” using a zipper datastructure:
-- | A path down a binary tree
type Path = [Either () ()]
-- | Replace a part (identified by 'Path') of the first 'Com' with the second.
swapPart :: Path -> Com -> Com -> Com
= unzipCom (part, position)
swapPart p whole part where (_, position) = focus whole p
-- | Represents a 'Com' with one of its sub-expressions "focused"
type ComZipper = (Com, [Either Com Com])
-- | Turns a 'ComZipper' back into a 'Com'
unzipCom :: ComZipper -> Com
= case xs of
unzipCom (x, xs) -> x -- x is the root, return it as-is
[] Left r:ys -> unzipCom (App x r, ys) -- x is a left child with sibling r
Right l:ys -> unzipCom (App l x, ys) -- x is a right child with sibling l
-- | Focus on a particular sub-expression of the given 'Com', at a position
-- | identified by the given 'Path' (stopping if it hits a leaf).
focus :: Com -> Path -> ComZipper
= go (x, []) -- Start focused on the root
focus x where go (App l r, xs) ( Left ():p) = go (l, Left r:xs) p -- Focus on left
App l r, xs) (Right ():p) = go (r, Right l:xs) p -- Focus on right
go (= z -- Reached a leaf or end of Path
go z _
-- | Generate a 'Path' through a binary tree (e.g. 'Com').
genPathN :: Fuel -> Gen Path
= Gen.list (to n) (go <$> Gen.bool False)
genPathN n where go False = Left ()
True = Right ()
go
-- | Generate a reasonably small 'Path' through a binary tree (e.g. 'Com')
genPath :: Gen Path
= genPathN limit genPath
Checking ComZipper
…
We can sanity-check our ComZipper
implementation using the property that turning them back into a Com
value
doesn’t depend on which part was “focused”:
= testProperty "unzipComIgnoresLocation" $ do
unzipComIgnoresLocation <- gen genCom
c <- gen genPath
p let c' = unzipCom (focus c p)
if c == c' then pure () else fail (show (c, c'))
= defaultMain unzipComIgnoresLocation main
unzipComIgnoresLocation: OK (0.02s)
100 successful tests
All 1 tests passed (0.02s)
Swapping-out part of an arbitrary expression, even one in
Normal
form, may result in a value which loops forever. For example, consider a
classic infinite loop like S(SKK)(SKK)(S(SKK)(SKK))
:
- The expression
SKK
acts as an identity function, reducing to its first input value:SKKx → Kx(Kx) → x
. - Let’s abbreviate
SKK
asI
, so our loop can be writtenSII(SII)
. - Applying the
S
rule, we getSII(SII) → I(SII)(I(SII))
. - Since
Ix
reduces tox
, both of thoseI(SII)
values reduce toSII
. - Hence
I(SII)(I(SII)) → SII(SII)
, which is what we started with! - Therefore our original expression
S(SKK)(SKK)(S(SKK)(SKK))
reduces to itself, over and over, forever.
Note that the repeated component S(SKK)(SKK)
is itself
in Normal
form,
since each S
is only applied to two args. Applying
K
to that, like K(S(SKK)(SKK))
, is also Normal
. If we
swap-out that K
for S(SKK)(SKK)
, we’ll get our
infinite loop.
Undecidability makes it impossible, in general, to avoid creating such loops; so we need our generator to retry if it can’t normalise a swapped-out expression in a reasonable number of steps:
genSwappableExtEqValsN :: Fuel -> Gen (Com, Com, [Either Com Com])
= shrinkWith zShrink $ do
genSwappableExtEqValsN n -- Generate a pair of extensionally-equal Coms. Avoid normally-equal values,
-- since they don't need symbolic execution.
<- genMatching genNormalComN matchNontriviallyExtEq shrinkCom n
(x, y) -- Generate a ComZipper whose focus can be swapped-out with 'x' or 'y'
-- without diverging
<- genZipperFor x y
zs pure (x, y, zs)
where genZipperFor x y = do
<- focus <$> genComN n <*> genPathN n
(_, zs) if checkZs (x, y, zs) then pure zs else genZipperFor x y
=
checkZs (x, y, zs) let go = isJust . countLaters n . reduce . unzipCom . (, zs)
in go x && go y
= filter checkZs
zShrink . shrink3 shrinkCom shrinkCom (shrinkL (const []))
-- | More efficient alternative to 'genComN': biased towards smaller values, and
-- | only generates 'Normal' forms.
= (toCom <$>) . genNormalN
genNormalComN
-- | Whether two 'Com' values are 'extEq' but *not* 'normalEq' (within 'Fuel').
matchNontriviallyExtEq :: Fuel -> Com -> Com -> Bool
= runDelayOr False ( extEq x y) n
matchNontriviallyExtEq n x y && runDelayOr False (diff <$> normalEq x y) n
Now we can generate extensionally-equal values, an expression to plug
them into, and a Path
describing where to plug them in, filtered such that they both reduce to
a Normal
.
However, one problem remains: we have no idea how much Fuel
will be
needed to find when those swapped-out expressions begin to agree. We can
re-use our double-negative trick of being “not unequal”:
= testProperty "extEqCannotBeDistinguished" $ do
extEqCannotBeDistinguished -- Use the smart generator 90% of the time, allow 10% to be unconstrained
<- gen $ Gen.frequency
(x, y, zs) 9, genSwappableExtEqValsN limit)
[ (1, tuple3 genCom genCom (snd <$> (focus <$> genCom <*> genPath)))
, (
]let (x', y') = (unzipCom (x, zs), unzipCom (y, zs))
case (runDelayOr False (extEq x y) limit, runDelay limit (extEq x' y')) of
True, Now True ) -> pure ()
(True, Now False) -> fail (show x' ++ " /= " ++ show y')
(-> discard _
= defaultMain extEqCannotBeDistinguished main
extEqCannotBeDistinguished: OK (19.67s)
100 successful tests (discarded 47)
All 1 tests passed (19.67s)
However, this doesn’t give us much confidence since we can only prove
inequality in certain specific cases. Expressions which are unequal in a
way we can’t prove will give a never-ending sequence of Later
, and
hence be discarded. There is a way we can assert that all the
results are extensionally equal, by removing the timeout: that’s risky,
since if we’re wrong then one of those never-ending sequences will cause
our test suite to run forever!
-- | Runs a 'Delay' value to completion. This may run forever!
unsafeRunDelay :: Delay a -> a
Now x) = x
unsafeRunDelay (Later x) = unsafeRunDelay x
unsafeRunDelay (
= testProperty "extEqAreInterchangable" $ do
extEqAreInterchangable -- For safety, we can only use the smart generator
<- gen $ genSwappableExtEqValsN limit
(x, y, zs) let (x', y') = (unzipCom (x, zs), unzipCom (y, zs))
case (unsafeRunDelay (extEq x y), unsafeRunDelay (extEq x' y')) of
True, True ) -> pure ()
(True, False) -> fail (show x' ++ " /= " ++ show y')
(-> discard _
= defaultMain extEqAreInterchangable main
extEqAreInterchangable: OK (20.09s)
100 successful tests
All 1 tests passed (20.09s)
(If you look at this page’s Markdown source you’ll see that I’m actually a coward, since I’m wrapping the above Haskell process in a timeout of a few minutes, just in case!)
Extensionally unequal values are distinguishable
When two values are extensional unequal, there exist input
values for which they disagree. Note that we can’t test this by just
generating some input values and asserting that they disagree, since
some inputs may happen to agree by coincidence. Instead, we’ll
generate a Stream
of
input Stream
s, and
check for disagreement on all of them, and interleave their execution
using race
:
-- | Generate a pair of 'Com' values which are extensionally unequal
genExtUneqN :: Fuel -> Gen (Com, Com)
= genMatching genComN uneq shrinkCom
genExtUneqN where uneq n x y = runDelayOr False (not . isJust <$> extensionalInputs x y) n
= genExtUneqN limit
genExtUneq
= testProperty "extUneqWillDisagree" $ do
extUneqWillDisagree <- gen genExtUneq
(x, y) <- gen genFuel
len <- gen genInputs
inputs let result = race (sAt len . agree x y <$> inputs) >>= findDisagreement
pure (unsafeRunDelay result)
where genInputs = Cons <$> genComs <*> genInputs
= if diff x
findDisagreement (x, _, xs) then pure ()
else race xs >>= findDisagreement
= defaultMain extUneqWillDisagree main
extUneqWillDisagree: OK (0.78s)
100 successful tests
All 1 tests passed (0.78s)
Transitivity
The transitive
property holds for all of the notions of equality we’ve seen so far:
when foo
equals bar
and bar
equals baz
, then foo
must equal
baz
:
-- | Assert that the function 'eq' is transitive for the three given args.
= case (eq x y, eq y z, eq x z) of
assertTrans eq x y z False, _, _) -> discard
(False, _) -> discard
( _, True, True, True) -> pure ()
( -> fail "Not transitive"
_
= testProperty "eqIsTransitive" $ do
eqIsTransitive -- Can't use genMatching, since Set will discard == values
let g xs = do
<- genCom
x let got = x : filter (== x) xs
if length got >= 3 then pure got else g (x:xs)
:y:z:_) <- gen $ Gen.shrinkWith (shrinkL shrinkCom) (g [])
(x==) x y z
assertTrans (where
= testProperty "normalEqIsTransitive" $ do
normalEqIsTransitive let g = genMatchingN 3 genComN matchNormalEqN shrinkCom limit
= matchNormalEqN limit
eq :y:z:_) <- gen $ Set.toList <$> g
(x
assertTrans eq x y z
= testProperty "agreeIsTransitive" $ do
everAgreeIsTransitive let g = genMatchingN 3 genNormalComN (matchAgreeFromN 1) shrinkCom limit
= runDelayOr False (everAgree x y) (limit * limit)
eq x y :y:z:_) <- gen $ Set.toList <$> g
(x
assertTrans eq x y z
= testProperty "extEqIsTransitive" $ do
extEqIsTransitive let g = genMatchingN 3 genNormalComN (matchAgreeFromN 1) shrinkCom limit
= runDelayOr False (extEq x y) (limit * limit)
eq x y :y:z:_) <- gen $ Set.toList <$> g
(x assertTrans eq x y z
= defaultMain $ testGroup "transitivity"
main
[ eqIsTransitive
, normalEqIsTransitive
, everAgreeIsTransitive
, extEqIsTransitive ]
transitivity
eqIsTransitive: OK (0.44s)
100 successful tests
normalEqIsTransitive: OK (1.50s)
100 successful tests
agreeIsTransitive: OK (65.64s)
100 successful tests
extEqIsTransitive: OK (59.20s)
100 successful tests
All 4 tests passed (126.78s)
Falsifying myself
Thanks to this in-depth testing, I eventually discovered the problem I was having in egglog: expressions containing symbolic variables were being used interchangably with concrete (variable-free) expressions. When we use symbolic execution to check for agreement, we are assuming that each symbol represents an input our expressions have been applied to: if those expressions already contain symbols, that assumption no longer holds, and our conclusions about agreement may be wrong.
Let’s walk through a simple example, with the expressions
SKK
(concrete) and Kx
(which contains the
symbol x
). These don’t agree on 0 inputs, but if we apply
them to a single symbolic input x
:
- The first expression reduces like
SKKx → Kx(Kx) → x
- The second expression reduces like
Kxx → x
Hence these seem to agree! Unfortunately, this is actually
just a coincidence, caused by the appearance of the symbol
x
in the latter expression. If we apply these expressions
to any input value other than x
, they will
disagree! To see this, notice that the first expression SKK
is an identity function: it returns its argument unchanged; whilst the
second, Kx
, ignores its argument and always returns
x
. They only agree when the argument is also
x
: but if that is the first symbol we’re using to check
agreement, we will mistakenly conclude that these expressions are
extensionally equal!
Once we’ve made such a mistake, all Com
expressions will quickly merge into one big equivalence class, due to
the properties described above. To see why, consider an arbitrary value
A
: if we apply our first expression to this we get
SKKA
which is equal to KA(KA)
and ultimately
to A
. However, we can also swap-out that SKK
for the “equal” expression Kx
, giving KxA
which is equal to x
. Hence this mistaken equality between
SKK
and Kx
can be used to “prove” that
any value is equal to x
; and therefore, by
transitivity, equal to any other value!
I believe this is why my implementation of extensional equality in egglog caused everything to collapse: once symbolic expressions started leaking into equivalence classes of concrete expressions (due to my unhygenic mixing of the two) the inevitable consequences quickly unfolded, thanks to egglog’s powerful Equation Graph machinery.
Conclusion
I’m pretty happy that I took this little diversion to double-check my
assumption about using symbolic computation to “fake” universal
quantification. It was a fun little exercise to learn the particular API
features of falsify
, which seems very promising. My initial
tests were quick and direct to write, but they didn’t yield any new
information. This didn’t give me much confidence, as I suspected they
weren’t exploring much of the search space: and adding instrumentation
proved as much. Using discards as a guard-rail against triviality, plus
some careful thought on how to avoid slamming into them repeatedly,
ultimately lead to some more in-depth understanding of what assumptions
we can and can’t make in this context.
My ultimate realisation, that it’s the “leaking” of symbolic values
that caused the collapse I was seeing, did not come directly from a
failing falsify
test. However, it did come from
clarification of my mental model, and that clarification was largely
thanks to these falsify
tests. That’s actually a truism
I’ve learned with experience: whether a test failure is caused by a bug
in an application, or “just” a bug in the tests, those are both symptoms
of a bug in our understanding; which is the most important
part.
With this Haskell digression over, future installments will switch
back to egglog and make sure that we don’t check agreement with
expressions that already contain symbolic values. There are several
mechanisms we can use to avoid this, and I’ve already confirmed that a
simple isConcrete
precondition is enough to prevent the
prior collapse. Stay tuned Go here for part
four!