Last updated: 2018-08-13 14:37:16 +0100
Upstream URL: git clone http://chriswarbo.net/git/theory-exploration-benchmarks.git
Contents of README follows
This repository contains scripts to create large(ish) benchmarks for theory exploration systems. It’s based on the Tons of Inductive Problems (TIP) problem set, but whilst TIP focuses on depth (proving a given theorem from minimal premises), we focus on breadth (discovering theorems from many premises).
We use Racket, since it’s well suited to manipulating the s-expressions defined by TIP.
Each TIP benchmark defines datatypes, functions and a single (negated) theorem statement. We combine the datatypes and functions from all of these benchmarks together into one big theory, suitable for exploration by tools like QuickSpec and IsaCoSy, and suitable for passing into the standard tip tools (e.g. for translation into various other languages).
This super-theory does not include any of the theorem statements, but we do provide tools for rewriting the separate theorem statements in terms of the definitions present in the super-theory.
Combining all of these theories into one causes a few problems. In particular:
add
in the file
arithmetic/plus_commutes.smt2
will become
arithmetic/plus_commutes.smt2add
.tip
program will rename such definitions when translating, but this is
difficult to reverse, and hence we lose the ability to relate
definitions back to theorem statements. To prevent this, we hex-encode
all names, and prefix with either global
or
Global
(following Haskell’s upper/lowercase distinction
between types/constructors and functions/destructors).All dependencies can be satisfied automatically by the Nix package
manager, using the included default.nix
file. This provides
the following attributes:
tools
, which provides our TIP-manipulation scriptstip-benchmarks
, which is a specific revision of the TIP
benchmarkstip-benchmark-smtlib
, which is the result of using the
tools
scripts to combine all of the
tip-benchmarks
togetherThese definitions are parameterised by the package repository and set of Haskell packages to use, which allows easy overriding and aids reproducibility.
If you don’t want to use Nix, you’ll need (at least)
racket
in your PATH
, with the
shell-pipeline
and grommet
packages
available.
If you want to generate Haskell code you’ll need the tip
command, from the tip-lib
Haskell package.
The transformation takes place in multiple stages, the result of each
is cached on disk and the filename stored in an environment variable
(these have names prefixed with BENCHMARKS
). Nix manages
this for us; if you want to avoid Nix, you can see which environment
variables need to be set for your desired script by looking at its
invocation in the *.nix
files.
You can invoke the test suite using scripts/test.sh
, or
by using the scripts/test.rkt
module either with
raco test
or using DrRacket. The tests also need the
cabal
command, and a Haskell environment with the
QuickSpec
, QuickCheck
, tip-lib
and testing-feat
packages available.
The preferred way to use these tools is from Nix, by
import
ing the definitions from default.nix
and
adding them to your own project’s dependencies.
Alternatively, you can enter a shell to run commands manually (although this sacrifices the reproducibility offered by Nix):
nix-shell -p '(import ./. {}).tools'
This will use the default dependencies, and enter a shell with the
tools available in PATH
.
Tests are invoked automatically during the build process, and can
also be run manually via Dr Racket or by using Racket’s
raco
command, e.g.
raco test scripts/lib/foo.rkt
The Racket code makes heavy use of contracts to check pre-
and post-conditions of functions. These slow down the tools considerably
(a factor of 5 has been observed), so they are disabled by default. To
force contract checking, set the environment variable
PLT_TR_CONTRACTS
to 1
. A separate test
derivation is provided in default.nix, which tests all benchmark files
with contracts enabled. Setting this variable will also cause tests to
loop over all benchmarks, definitions, etc. (for speed, they default to
testing a random subset).
Converting TIP into a theory exploration benchmark can be
frustratingly slow. To try to keep them as fast as possible we include a
set of benchmark scripts (yes, it’s very meta!). These make use of
asv
(“Airspeed Velocity”), which lets us track performance
across revisions, separate results from different machines, generate
pretty HTML reports, etc.
To fetch asv
, along with our custom plugins, use:
nix-shell asv.nix
This will drop you into a shell with asv
available, and
show instructions for invoking it, what to do with the results, etc.
To combine the TIP benchmarks into one theory, ensure the
tools
are in your PATH
(e.g. using
nix-shell -p '(import ./. {}).tools'
) and run
mk_final_defs
. This will print the resulting benchmark, in
TIP format, to stdout.
To generate a Haskell package from a benchmark, pipe the benchmark
into full_haskell_package
. Make sure to set the
OUT_DIR
environment variable to a path in which the package
will be created.
To sample a selection of function names from the combined benchmarks,
use the choose_sample
tool. This takes two arguments: the
size of the desired sample and an “index” number, used for entropy. The
sampling process is deterministic, so re-running with the same benchmark
set, the same sample size and the same index will produce the same
sample. To get a different sample, use a different index number.
These samples are chosen such that they contain all of the definitions used by at least one theorem statement. This allows precision/recall experiments to avoid division by zero. It also means that an error will be raised if a sample size is requested which is too small to fit any theorem’s dependencies.
Some features default to using a set of TIP benchmarks provided by
Nix. You can override this by setting the BENCHMARKS
environment variable to a directory containing the desired
benchmarks.
Besides benchmarking theory exploration systems based on their resource usage, we can also use the theorem statements from TIP problems as a ground truth for measuring the effectiveness of each exploration. For example, if we only compared systems based on the number of theorems they discover, it would be trivial to score highly by generating “uninteresting” theorems such as:
(= Z Z )
(= (S Z) (S Z) )
(= (S (S Z)) (S (S Z)))
...
Such theorems are too dull to bother writing into a benchmark. In
comparison, more interesting properties like commutativity are
found in benchmarks, such as this from
benchmarks/tip2015/int_add_comm.smt2
in the TIP benchmarks
repo:
(= (plus x y) (plus y x))
Hence a theorem can be deemed interesting if it appears in a benchmark. Whilst far from comprehensive, the set of benchmarks is easy to extend as new cases arise.