Skip to content

Warbo/theory-exploration-benchmarks

Repository files navigation

# Theory Exploration Benchmarks #

This repository contains scripts to create large(ish) benchmarks for theory
exploration systems. It's based on the Tons of Inductive Problems (TIP) problem
set, but whilst TIP focuses on depth (proving a given theorem from minimal
premises), we focus on breadth (discovering theorems from many premises).

We use Racket, since it's well suited to manipulating the s-expressions defined
by TIP.

## How it Works ##

Each TIP benchmark defines datatypes, functions and a single (negated) theorem
statement. We combine the datatypes and functions from all of these benchmarks
together into one big theory, suitable for exploration by tools like QuickSpec
and IsaCoSy, and suitable for passing into the standard tip tools (e.g. for
translation into various other languages).

This super-theory does not include any of the theorem statements, but we do
provide tools for rewriting the separate theorem statements in terms of the
definitions present in the super-theory.

Combining all of these theories into one causes a few problems. In particular:

 - Multiple files may declare functions or datatypes of the same name. To allow
   for this, we prefix all defined names by the path from which they're taken.
   Hence a function called `add` in the file `arithmetic/plus_commutes.smt2`
   will become `arithmetic/plus_commutes.smt2add`.
 - We may get multiple definitions of the same (alpha-equivalent) values, e.g.
   if multiple benchmarks define the same natural number type and functions. We
   find these by normalising each definition and looking for duplicates. If any
   are found, we keep the first and remove the others; we also switch out the
   names for whichever of the duplicates occurs first lexicographically, and
   update all references elsewhere accordingly. We iterate this process until
   no more duplicates are found.
 - Many definitions have names which aren't suitable for targets like Haskell;
   especially when prefixed by their path. The `tip` program will rename such
   definitions when translating, but this is difficult to reverse, and hence we
   lose the ability to relate definitions back to theorem statements. To prevent
   this, we hex-encode all names, and prefix with either `global` or `Global`
   (following Haskell's upper/lowercase distinction between types/constructors
   and functions/destructors).
 - To ease automation, we add an extra function for each constructor (prefixed
   with "constructor-") and for each destructor (prefixed with "destructor-").
   This gives us a function for every value-level entity.

## Dependencies ##

All dependencies can be satisfied automatically by the Nix package manager,
using the included `default.nix` file. This provides the following attributes:

 - `tools`, which provides our TIP-manipulation scripts
 - `tip-benchmarks`, which is a specific revision of the TIP benchmarks
 - `tip-benchmark-smtlib`, which is the result of using the `tools` scripts to
   combine all of the `tip-benchmarks` together

These definitions are parameterised by the package repository and set of Haskell
packages to use, which allows easy overriding and aids reproducibility.

If you don't want to use Nix, you'll need (at least) `racket` in your `PATH`,
with the `shell-pipeline` and `grommet` packages available.

If you want to generate Haskell code you'll need the `tip` command, from the
`tip-lib` Haskell package.

The transformation takes place in multiple stages, the result of each is cached
on disk and the filename stored in an environment variable (these have names
prefixed with `BENCHMARKS`). Nix manages this for us; if you want to avoid Nix,
you can see which environment variables need to be set for your desired script
by looking at its invocation in the `*.nix` files.

You can invoke the test suite using `scripts/test.sh`, or by using the
`scripts/test.rkt` module either with `raco test` or using DrRacket. The tests
also need the `cabal` command, and a Haskell environment with the `QuickSpec`,
`QuickCheck`, `tip-lib` and `testing-feat` packages available.

The preferred way to use these tools is from Nix, by `import`ing the definitions
from `default.nix` and adding them to your own project's dependencies.

Alternatively, you can enter a shell to run commands manually (although this
sacrifices the reproducibility offered by Nix):

    nix-shell -p '(import ./. {}).tools'

This will use the default dependencies, and enter a shell with the tools
available in `PATH`.

## Testing ##

Tests are invoked automatically during the build process, and can also be run
manually via Dr Racket or by using Racket's `raco` command, e.g.

    raco test scripts/lib/foo.rkt

The Racket code makes heavy use of *contracts* to check pre- and post-conditions
of functions. These slow down the tools considerably (a factor of 5 has been
observed), so they are disabled by default. To force contract checking, set the
environment variable `PLT_TR_CONTRACTS` to `1`. A separate test derivation is
provided in default.nix, which tests all benchmark files with contracts enabled.
Setting this variable will also cause tests to loop over all benchmarks,
definitions, etc. (for speed, they default to testing a random subset).

## Benchmarking ##

Converting TIP into a theory exploration benchmark can be frustratingly slow. To
try to keep them as fast as possible we include a set of benchmark scripts (yes,
it's very meta!). These make use of `asv` ("Airspeed Velocity"), which lets us
track performance across revisions, separate results from different machines,
generate pretty HTML reports, etc.

To fetch `asv`, along with our custom plugins, use:

    nix-shell asv.nix

This will drop you into a shell with `asv` available, and show instructions for
invoking it, what to do with the results, etc.

## Usage ##

To combine the TIP benchmarks into one theory, ensure the `tools` are in your
`PATH` (e.g. using `nix-shell -p '(import ./. {}).tools'`) and run
`mk_final_defs`. This will print the resulting benchmark, in TIP format, to
stdout.

To generate a Haskell package from a benchmark, pipe the benchmark into
`full_haskell_package`. Make sure to set the `OUT_DIR` environment variable
to a path in which the package will be created.

To sample a selection of function names from the combined benchmarks, use the
`choose_sample` tool. This takes two arguments: the size of the desired sample
and an "index" number, used for entropy. The sampling process is deterministic,
so re-running with the same benchmark set, the same sample size and the same
index will produce the same sample. To get a different sample, use a different
index number.

These samples are chosen such that they contain all of the definitions used by
at least one theorem statement. This allows precision/recall experiments to
avoid division by zero. It also means that an error will be raised if a sample
size is requested which is too small to fit any theorem's dependencies.

Some features default to using a set of TIP benchmarks provided by Nix. You can
override this by setting the `BENCHMARKS` environment variable to a directory
containing the desired benchmarks.

## Other Uses ##

Besides benchmarking theory exploration systems based on their resource usage,
we can also use the theorem statements from TIP problems as a ground truth for
measuring the effectiveness of each exploration. For example, if we only
compared systems based on the number of theorems they discover, it would be
trivial to score highly by generating "uninteresting" theorems such as:

    (=       Z         Z  )
    (=    (S Z)     (S Z) )
    (= (S (S Z)) (S (S Z)))
    ...

Such theorems are too dull to bother writing into a benchmark. In comparison,
more interesting properties like commutativity *are* found in benchmarks, such
as this from `benchmarks/tip2015/int_add_comm.smt2` in the TIP benchmarks repo:

    (= (plus x y) (plus y x))

Hence a theorem can be deemed interesting if it appears in a benchmark. Whilst
far from comprehensive, the set of benchmarks is easy to extend as new cases
arise.