Quantitative Benchmarking for Theory Exploration

Chris Warburton
University of Dundee
c.m.warburton@dundee.ac.uk

Motivation

Why exploration?

(Automated) Theory Exploration

append(nil, ys)         = ys
append(cons(x, xs), ys) = cons(x, append(xs, ys))
   map(f, nil)          = nil
   map(f, cons(x, xs))  = cons(f(x), map(f, xs))
map(f, append(xs, ys)) = append(map(f, xs), map(f, ys))

Problems:

Existing Approaches

Questions

  1. How do we evaluate conjectures?
    • Finding theorems is trivial!
  2. How do we evaluate/compare sets of conjectures?
    • We want to find more than one conjecture
    • Repeating a good conjecture 100 times isn't a good set!
  3. How do we evaluate/compare theory exploration systems?
    • Brute-force can optimise anything
    • How do we compare "practical" performance?

Q1. Evaluating Conjectures

What is "interesting"?

In theory:

In practice:

Q2. Evaluating Sets of Conjectures

Given a corpus of known theorems/conjectures and a set of proposed conjectures:

Existing Evaluations

Problems:

Our Proposal

Benefits:

Q3. Evaluating Systems

Sampling theories from a large corpus provides two opportunities:

Corpus Source

We chose the TIP theorem proving benchmark (Claessen et al. 2015)

Generating Our Corpus

Example

We have applied our benchmarking methodology to QuickSpec

Observations

Summary

  1. How do we evaluate conjectures?
    • Use known theories and look up in corpus
  2. How do we evaluate/compare sets of conjectures?
    • Precision/recall against large corpus
  3. How do we evaluate/compare theory exploration systems?
    • Repeated sampling from a large benchmark
    • Gather statistics on runtime, precision and recall
    • Compare systems using pairwise difference

Future Work

Resources

All code available at chriswarbo.net/git and github.com/Warbo

All results (modulo hardware speed) are reproducible using Nix. Please let me know if you have any problems!

Questions?

References

Claessen, Koen, Moa Johansson, Dan Rosén, and Nicholas Smallbone. 2013. “Automating inductive proofs using theory exploration.” In Automated Deduction–CADE-24, 392–406. Springer.

———. 2015. “TIP: tons of inductive problems.” In Conferences on Intelligent Computer Mathematics, 333–37. Springer.

Claessen, Koen, Nicholas Smallbone, and John Hughes. 2010. “QuickSpec: Guessing Formal Specifications Using Testing.” In Tests and Proofs, edited by Gordon Fraser and Angelo Gargantini, 6143:6–21. Lecture Notes in Computer Science. Springer Berlin Heidelberg. doi:10.1007/978-3-642-13977-2_3.

Colton, Simon, Alan Bundy, and Toby Walsh. 2000. “On the notion of interestingness in automated mathematical discovery.” International Journal of Human-Computer Studies 53 (3). Elsevier: 351–75.

Johansson, Moa, Lucas Dixon, and Alan Bundy. 2009. “Isacosy: Synthesis of inductive theorems.” In Workshop on Automated Mathematical Theory Exploration (Automatheo).

Johansson, Moa, Dan Rosén, Nicholas Smallbone, and Koen Claessen. 2014. “Hipster: Integrating Theory Exploration in a Proof Assistant.” In Intelligent Computer Mathematics, edited by StephenM. Watt, JamesH. Davenport, AlanP. Sexton, Petr Sojka, and Josef Urban, 8543:108–22. Lecture Notes in Computer Science. Springer International Publishing. doi:10.1007/978-3-319-08434-3_9.

Montano, Omar, Roy Mccasl, Lucas Dixon, and Alan Bundy. n.d. “Scheme-based Definition and Conjecture Synthesis for Inductive Theories.” Citeseer.