Quantitative Benchmarking for Theory Exploration

Chris Warburton
University of Dundee


Why exploration?

(Automated) Theory Exploration

append(nil, ys)         = ys
append(cons(x, xs), ys) = cons(x, append(xs, ys))
   map(f, nil)          = nil
   map(f, cons(x, xs))  = cons(f(x), map(f, xs))
map(f, append(xs, ys)) = append(map(f, xs), map(f, ys))


Existing Approaches


  1. How do we evaluate conjectures?
    • Finding theorems is trivial!
  2. How do we evaluate/compare sets of conjectures?
    • We want to find more than one conjecture
    • Repeating a good conjecture 100 times isn't a good set!
  3. How do we evaluate/compare theory exploration systems?
    • Brute-force can optimise anything
    • How do we compare "practical" performance?

Q1. Evaluating Conjectures

What is "interesting"?

In theory:

In practice:

Q2. Evaluating Sets of Conjectures

Given a corpus of known theorems/conjectures and a set of proposed conjectures:

Existing Evaluations


Our Proposal


Q3. Evaluating Systems

Sampling theories from a large corpus provides two opportunities:

Corpus Source

We chose the TIP theorem proving benchmark (Claessen et al. 2015)

Generating Our Corpus


We have applied our benchmarking methodology to QuickSpec



  1. How do we evaluate conjectures?
    • Use known theories and look up in corpus
  2. How do we evaluate/compare sets of conjectures?
    • Precision/recall against large corpus
  3. How do we evaluate/compare theory exploration systems?
    • Repeated sampling from a large benchmark
    • Gather statistics on runtime, precision and recall
    • Compare systems using pairwise difference

Future Work


All code available at chriswarbo.net/git and github.com/Warbo

All results (modulo hardware speed) are reproducible using Nix. Please let me know if you have any problems!



