writing: 6b27783d098da17a77bd74c9c5dd93e015dabcd3

     1: % Derived from the template file for the LaTeX package SVJour3
     2: % for Springer journals.          Springer Heidelberg 2010/09/16
     3: %
     4: % This template includes a few options for different layouts and
     5: % content for various journals. Please consult a previous issue of
     6: % your journal as needed.
     7: %
     8: % First comes an example EPS file -- just ignore it and
     9: % proceed on the \documentclass line
    10: % your LaTeX will extract the file if required
    11: \begin{filecontents*}{example.eps}
    12: %!PS-Adobe-3.0 EPSF-3.0
    13: %%BoundingBox: 19 19 221 221
    14: %%CreationDate: Mon Sep 29 1997
    15: %%Creator: programmed by hand (JK)
    16: %%EndComments
    17: gsave
    18: newpath
    19:   20 20 moveto
    20:   20 220 lineto
    21:   220 220 lineto
    22:   220 20 lineto
    23: closepath
    24: 2 setlinewidth
    25: gsave
    26:   .4 setgray fill
    27: grestore
    28: stroke
    29: grestore
    30: \end{filecontents*}
    31: 
    32: %INCLUDE STYLES HERE
    33: 
    34: %INCLUDE SPEEDUP CSV HERE
    35: 
    36: %INCLUDE BIBTEX HERE
    37: 
    38: \RequirePackage{fix-cm}
    39: %
    40: %\documentclass{svjour3}                     % onecolumn (standard format)
    41: \documentclass[smallcondensed]{svjour3}     % onecolumn (ditto)
    42: %\documentclass[smallextended]{svjour3}       % onecolumn (second format)
    43: %\documentclass[twocolumn]{svjour3}          % twocolumn
    44: %
    45: \smartqed  % flush right qed marks, e.g. at end of proof
    46: %
    47: \usepackage{graphicx}
    48: %
    49: % \usepackage{mathptmx}      % use Times fonts if available on your TeX system
    50: %
    51: % insert here the call for the packages your document requires
    52: \usepackage{csvsimple}
    53: \usepackage[inline, shortlabels]{enumitem}
    54: \usepackage{epstopdf}
    55: \usepackage{hyperref}
    56: \usepackage{mathtools}
    57: \usepackage{tikz}
    58: \usetikzlibrary{arrows, calc, decorations.pathreplacing, positioning,
    59:   shapes.geometric}
    60: % etc.
    61: 
    62: % please place your own definitions here and don't use \def but
    63: % \newcommand{}{}
    64: 
    65: \newcommand{\etal}{{\em et al.}}
    66: 
    67: % For example theory of lists
    68: \newcommand{\Zero}{\text{Z}}
    69: \newcommand{\Succ}{\text{S}}
    70: 
    71: \newcommand{\List}{\text{List}}
    72: \newcommand{\ListA}{\text{List} \  a}
    73: \newcommand{\Nil}{\text{Nil}}
    74: \newcommand{\Cons}{\text{Cons}}
    75: \newcommand{\Head}{\text{head}}
    76: \newcommand{\Tail}{\text{tail}}
    77: \newcommand{\Append}{\text{append}}
    78: \newcommand{\Reverse}{\text{reverse}}
    79: \newcommand{\Length}{\text{length}}
    80: \newcommand{\Map}{\text{map}}
    81: \newcommand{\Foldl}{\text{foldl}}
    82: \newcommand{\Foldr}{\text{foldr}}
    83: 
    84: % Insert the name of "your journal" with
    85: \journalname{Journal of Automated Reasoning}
    86: 
    87: \begin{document}
    88: 
    89: \title{Quantitative Benchmarking for Automatically Generated Conjectures%\thanks
    90: % {Grants or other notes about the article that should go on the front page
    91: % should be placed here. General acknowledgments should be placed at the end of
    92: % the article.}
    93: }
    94: % \subtitle{Do you have a subtitle?\\ If so, write it here}
    95: 
    96: % \titlerunning{Short form of title}        % if too long for running head
    97: 
    98: \author{Chris Warburton \and
    99:         Alison Pease    \and
   100:         Jianguo Zhang}
   101: 
   102: % \authorrunning{Short form of author list} % if too long for running head
   103: 
   104: \institute{C. Warburton \at
   105:            University of Dundee \\
   106:            \email{c.m.warburton@dundee.ac.uk} \\
   107:            ORCID: 0000-0002-4878-6319 \\
   108:            Phone: +44 (0)1382 386968
   109: %             \emph{Present address:} of F. Author  %  if needed
   110:            \and
   111:            A. Pease \at
   112:            University of Dundee \\
   113:            \email{a.pease@dundee.ac.uk} \\
   114:            ORCID: 0000-0003-1856-9599
   115:            \and
   116:            J. Zhang \at
   117:            University of Dundee \\
   118:            \email{j.n.zhang@dundee.ac.uk} \\
   119:            ORCID: 0000-0001-9317-0268
   120: }
   121: 
   122: \date{Received: date / Accepted: date}
   123: % The correct dates will be entered by the editor
   124: 
   125: \maketitle
   126: 
   127: \begin{abstract}
   128:   We propose a benchmarking methodology to evaluate the efficiency and quality
   129:   of \emph{conjecture generation} by automated tools for \emph{mathematical
   130:     theory exploration}. Our approach uses widely available theorem proving
   131:   tasks as a \emph{ground-truth} corpus, and we demonstrate its use on the
   132:   QuickSpec and IsaCoSy tools. We found that both may fail, even for small
   133:   inputs, but QuickSpec usually finishes in significantly less time than IsaCoSy
   134:   and produces significantly more ``interesting'' output. By defining a standard
   135:   benchmark we provide a measure of progress for the field, encourage
   136:   innovation through healthy competition and allow direct comparisons to be made
   137:   between the disparate approaches currently being pursued for this task.
   138:   \keywords{Theory Formation \and Theorem Proving \and Lemma Discovery \and Conjecture Generation \and Benchmarking}
   139: % \PACS{PACS code1 \and PACS code2 \and more}
   140: % \subclass{MSC code1 \and MSC code2 \and more}
   141: \end{abstract}
   142: 
   143: \begin{acknowledgements}
   144:   We are grateful to the authors of the systems we have used in our experiments,
   145:   especially for help in obtaining and configuring their tools. We would
   146:   especially like to thank Koen Claessen, Lucas Dixon, Moa Johansson, Dan
   147:   Ros\'{e}n and Nicholas Smallbone for useful discussions and help in adapting
   148:   their software to our purposes.
   149: 
   150:   This work was supported by the Engineering and Physical Sciences Research
   151:   Council grant EP/P017320/1 and studentship 1526504.
   152: \end{acknowledgements}
   153: 
   154: \pagebreak
   155: 
   156: \section{Introduction}
   157: \label{intro}
   158: 
   159: \emph{Conjecture generation} is the open-ended task of conjecturing properties
   160: of a given logical theory which are somehow ``interesting'', and is
   161: studied as a sub-field of \emph{mathematical theory exploration} (MTE). This has
   162: applications wherever we find uses for formal methods: in proof assistants and
   163: their libraries, in mathematics education and research, and in the specification
   164: and verification of software. Although formal methods, and automated reasoning
   165: more generally, have been advocated in mathematics (notably by
   166: Voevodsky~\cite{voevodsky2010univalent} and Gowers~\cite{ganesalingam2013fully})
   167: and software engineering (for example via dependently-typed
   168: programming~\cite{McKinna:2006}), they have a reputation for being difficult and
   169: expensive, which has historically limited their use to high-assurance domains
   170: such as aerospace and microprocessor design.
   171: 
   172: One path to reducing these barriers is to increase the level of automation.
   173: Software verification tools have been shown to benefit from the addition of a
   174: conjecture generation step for finding ``background lemmas'', which are useful
   175: when proving user-provided statements~\cite{Claessen.Johansson.Rosen.ea:2013}.
   176: Another path is to tailor the division of labour between humans and machines
   177: such that each is playing to their strengths, which has been referred to as
   178: ``centaur teams''~\cite{harari2017reboot,davenport2015beyond}. Machines can
   179: systematically search a much larger space of possible relationships than humans,
   180: potentially bringing to light novel and surprising connections, whilst allowing
   181: the user to make the ultimate judgement over which of the most promising results
   182: are worth investigating further.
   183: 
   184: Similar tasks are also studied in the domain of Computational \emph{Scientific}
   185: Discovery~\cite{king2004functional,Williams20141289,schmidt2009distilling}, for
   186: example in the search for new drugs. The major difference between scientific and
   187: mathematical discovery is that inductive reasoning and experimental testing are,
   188: in principle, more important for the former, although we note that they are also
   189: core principles in the mathematical tools we have investigated (in particular
   190: for ensuring the \emph{plausibility} of conjectures).
   191: 
   192: We limit ourselves to mathematical applications, but even here it is difficult
   193: to measure and compare the success of approaches and tools. This is partially
   194: due to their diversity, but also because of the inherent ambiguity of the task
   195: (what counts as ``interesting''?), the different goals emphasised by their
   196: designers and the variety of evaluation methods employed. We attempt to solve
   197: this discrepancy, at least for the foreseeable future, by defining a standard,
   198: unambiguous benchmarking approach with which to compare the conjecture
   199: generation of MTE tools. Our contributions include:
   200: 
   201: \begin{itemize}
   202: \item A general methodology for benchmarking conjecture generation.
   203: \item Resolving the issue of ``interestingness'' through the use of
   204:   theorem-proving benchmarks as a ground-truth.
   205: \item A specific instantiation of this methodology, using the Tons of Inductive
   206:   Problems (TIP) benchmark as a corpus.
   207: \item Automated tooling to perform this benchmarking.
   208: \item Application of our methodology to the QuickSpec and IsaCoSy MTE tools,
   209:   and a comparison and discussion of the results.
   210: \end{itemize}
   211: 
   212: We describe the conjecture generation problem in more detail, along with the
   213: difficulty of comparing existing solutions, in $\S$\ref{sec:background}. We
   214: explain our proposal for a more general benchmark in $\S$\ref{sec:proposal} and
   215: demonstrate its application to existing MTE tools in $\S$\ref{sec:application}.
   216: Issues facing our approach, and the field in general, are discussed in
   217: $\S$\ref{sec:discussion}; related work is surveyed in
   218: $\S$\ref{sec:related-work}, including the diverse terminology found in the
   219: literature; and concluding remarks are given in $\S$\ref{sec:conclusion}.
   220: 
   221: \begin{sloppypar}
   222:   Our benchmark implementation is available at
   223:   \url{http://chriswarbo.net/projects/theory-exploration} and
   224:   \url{https://github.com/warbo}. All processes, including benchmark generation,
   225:   experimental runs and generation of this paper, are specified using the
   226:   Nix~\cite{dolstra2004nix} system to aid reproducibility.
   227: \end{sloppypar}
   228: 
   229: \section{Background}
   230: \label{sec:background}
   231: 
   232: \subsection{Motivation}
   233: \label{sec:motivation}
   234: 
   235: \begin{figure}
   236:   \begin{equation*}
   237:     \begin{split}
   238:       \forall a. \Nil            &: \ListA                                  \\
   239:       \forall a. \Cons           &: a \rightarrow \ListA \rightarrow \ListA \\
   240:       \Head(\Cons(x, xs))        &= x                                       \\
   241:       \Tail(\Cons(x, xs))        &= xs                                      \\
   242:       \Append(\Nil,         ys)  &= ys                                      \\
   243:       \Append(\Cons(x, xs), ys)  &= \Cons(x, \Append(xs, ys))               \\
   244:       \Reverse(\Nil)             &= \Nil                                    \\
   245:       \Reverse(\Cons(x, xs))     &= \Append(\Reverse(xs), \Cons(x, \Nil))   \\
   246:       \Length(\Nil)              &= \Zero                                   \\
   247:       \Length(\Cons(x, xs))      &= \Succ (\Length(xs))                     \\
   248:       \Map(f, \Nil)              &= \Nil                                    \\
   249:       \Map(f, \Cons(x, xs))      &= \Cons(f(x), \Map(f, xs))                \\
   250:       \Foldl(f, x, \Nil)         &= x                                       \\
   251:       \Foldl(f, x, \Cons(y, ys)) &= \Foldl(f, f(x, y), ys)                  \\
   252:       \Foldr(f, \Nil,         y) &= y                                       \\
   253:       \Foldr(f, \Cons(x, xs), y) &= f(x, \Foldr(f, xs, y))
   254:     \end{split}
   255:   \end{equation*}
   256:   \caption{A simple theory defining a $\List$ type and some associated
   257:     operations, taken from~\cite{Johansson.Dixon.Bundy:conjecture-generation}.
   258:     $\Zero$ and $\Succ$ are from a Peano encoding of the natural numbers}
   259:   \label{figure:list_theory}
   260: \end{figure}
   261: 
   262: Given a logical theory, such as the theory of lists shown in
   263: Figure~\ref{figure:list_theory}, we may want to find properties (e.g.
   264: invariants) describing its behaviour. This could be for mathematical curiosity,
   265: or due to the theory's importance in some domain. In particular, when dealing
   266: with definitions in a software library, we may regard a set of such properties
   267: as the code's \emph{specification}. Even those properties which only hold by
   268: coincidence, rather than by design, may still be useful for \emph{optimising}
   269: uses of this library, by rewriting expressions into equivalent forms which
   270: require less time, memory, network usage, etc. For example, our theory of lists
   271: turns out to satisfy the following invariant:
   272: 
   273: \begin{equation} \label{eq:mapreduce}
   274:   \forall f. \forall xs. \forall ys.
   275:     \Map(f, \Append(xs, ys)) = \Append(\Map(f, xs), \Map(f, ys))
   276: \end{equation}
   277: 
   278: Whilst these two expressions produce equivalent results, the form on the right
   279: can execute both calls to $\Map$ in parallel (in a pure functional language, at
   280: least); a fact which is exploited by the ``map/reduce'' parallel programming
   281: paradigm.
   282: 
   283: Such use cases require the ability to \emph{discover} true properties of some
   284: particular definitions; this requires not only \emph{proving} their correctness,
   285: but also \emph{posing} them as conjectures in the first place. This is a hard
   286: problem in general, but presents opportunities for automation due to the
   287: precise, symbolic nature of the domain. These processes can also feed into each
   288: other, since proven properties can be used as lemmas for subsequent proofs. As
   289: an example, proving Equation~\ref{eq:mapreduce} may require ``background
   290: knowledge'', such as the property $\forall x. \Append(x, \Nil) = x$, which may
   291: be tedious to produce by hand but possible to discover automatically.
   292: 
   293: \subsection{Automating Construction and Exploration of Theories}
   294: \label{sec:te}
   295: 
   296: The task of automatically proving or disproving a \emph{given} conjecture has
   297: been studied extensively in the field of Automated Theorem Proving (ATP). Less
   298: attention has been paid to \emph{generating} such conjectures automatically,
   299: either from a given set of definitions (the task we will focus on) or where the
   300: definitions are also generated. This research has been pursued under the names
   301: of Automated Theory Formation (ATF) and Mathematical Theory Exploration (MTE);
   302: we will use the MTE terminology, and discuss their relationship further in
   303: $\S$\ref{sec:related-work}.
   304: 
   305: A major challenge when generating conjectures is choosing how to narrow down the
   306: resulting set to only those deemed ``interesting'', since this is an imprecise
   307: term with many different interpretations. For example, all existing approaches
   308: agree that simple tautologies are ``uninteresting'', but differ when it comes to
   309: more complex statements.  Colton \etal{} surveyed tools for theory formation
   310: and exploration, and their associated notions of ``interestingness'' for
   311: concepts and conjectures~\cite{colton2000notion}. Six qualities were identified,
   312: which are applied to conjectured properties as follows:
   313: 
   314: {\bf Empirical plausibility} checks whether a property holds across some
   315: specific examples. This is especially useful for avoiding falsehoods, without
   316: resorting to a deductive proof search.
   317: 
   318: {\bf Novelty} depends on whether the property, or one isomorphic or more
   319: general, has already been seen.
   320: 
   321: {\bf Surprisingness} of a property is whether or not it is ``obvious'', for
   322: example if it is an instance of a tautology.
   323: 
   324: {\bf Applicability} depends on the number of models in which a
   325: property holds. High applicability makes a property more interesting, but so
   326: does zero applicability (i.e. non-existence).
   327: 
   328: {\bf Comprehensibility} depends on the syntactic complexity of the property's
   329: statement. Simpler statements are considered more interesting (which favours
   330: tools adopting a small-to-large search order).
   331: 
   332: {\bf Utility} is the relevance or usefulness of a property to the user's
   333: particular task. For example, if we want to find properties which are useful for
   334: optimisation, like Equation~\ref{eq:mapreduce}, then utility would include
   335: whether the property justifies some rewrite rule, the difference in resource
   336: usage of the expressions involved, and how common those expressions are in real
   337: usage.
   338: 
   339: Attempting to \emph{measure} such qualities is difficult, and having too many
   340: measures complicates comparisons. System developers have employed a more
   341: practical alternative in their evaluations, which is to perform
   342: \emph{precision/recall} analysis against a \emph{ground-truth}. This requires
   343: choosing a set of definitions (such as those in Figure~\ref{figure:list_theory})
   344: and a set of properties (the ground truth) which represent the ``ideal'' result
   345: of conjecture generation for those definitions (e.g. this might include
   346: Equation~\ref{eq:mapreduce}). To analyse the quality of a tool's conjectures, we
   347: run it on these chosen definitions and compare its output to the ground truth:
   348: 
   349: \begin{itemize}
   350: \item \emph{Precision} is the proportion of a tool's output which appears in
   351:   the ground truth (the ratio of true positives to all positives). This
   352:   penalises overly-liberal tools which output a large number of properties in
   353:   the hope that some turn out to be ``good''.
   354: \item \emph{Recall} is the proportion of the ground truth which appears in the
   355:   tool's output (the ratio of true positives to actual positives). This
   356:   penalises overly-conservative tools which generate very few properties as a
   357:   way to avoid ``bad'' ones.
   358: \end{itemize}
   359: 
   360: To score 100\% on precision and recall, all of the properties which appear in
   361: the ground truth must have been conjectured, and nothing else. This gives us a
   362: simple method to evaluate and compare tools without requiring a general solution
   363: to the question of what is ``interesting''; although we must still decide what
   364: to put in the ground truth set, and what to leave out, for each measured set of
   365: definitions.
   366: 
   367: \begin{sloppypar}
   368:   The precision and recall of three MTE tools are compared
   369:   in~\cite{claessen2013automating}:
   370:   IsaCoSy~\cite{Johansson.Dixon.Bundy:conjecture-generation},
   371:   IsaScheme~\cite{Montano-Rivas.McCasland.Dixon.ea:2012} and
   372:   HipSpec~\cite{Claessen_hipspec:automating}.\footnote{Unfortunately this
   373:     comparison only reports prior results for each tool, rather than reproducing
   374:     them. This makes the numbers less comparable, since each tool was tested
   375:     with slightly different definitions (e.g. whether or not the exponential and
   376:     maximum functions were included).} The definitions and ground truths for all
   377:   three were taken from the standard library of the Isabelle proof assistant.
   378:   The fact that the library authors expended the effort of stating, proving and
   379:   bundling these properties with every copy of their software is a good
   380:   indication that they are interesting. Two sets of definitions were used: a
   381:   theory of natural numbers; and a theory of lists (which we show in
   382:   Figure~\ref{figure:list_theory} on page~\pageref{figure:list_theory}). With
   383:   only two datapoints, each containing only a few definitions and properties (4
   384:   functions and 12 properties, in the natural number example), it is difficult
   385:   to draw general conclusions; yet despite these limitations we believe that
   386:   this is a viable method to performing meaningful, objective analyses and
   387:   comparisons for such tools. It is only with larger data sets that we can
   388:   generalise the measured performance to different, especially \emph{novel},
   389:   domains (which, after all, is the purpose of MTE tools). In addition, we might
   390:   be concerned that such ``popular'' theories as the natural numbers may be
   391:   unrepresentative of typical usage. For example, we would not expect the
   392:   library functions in a typical verification problem to be as mathematically
   393:   rich as number theory.
   394: \end{sloppypar}
   395: 
   396: \subsection{Existing Tools}
   397: \label{sec:existing-tools}
   398: 
   399: Since our benchmarking methodology builds on the precision/recall analysis
   400: described in $\S$\ref{sec:te}, we decided to use two of the same tools, IsaCoSy
   401: and QuickSpec (the conjecture generating component of HipSpec), as our
   402: demonstration.
   403: 
   404: IsaCoSy (Isabelle Conjecture Synthesis) is written for the Isabelle proof
   405: assistant, mostly in Standard
   406: ML~\cite{Johansson.Dixon.Bundy:conjecture-generation}. It conjectures equations
   407: by enumerating expressions involving a given set of (typed) constants and free
   408: variables (a \emph{signature}). A constraint solver forbids certain
   409: (sub)expressions from being synthesised, and these constraints are extended
   410: whenever a new property is conjectured, to avoid generating any special-cases of
   411: this property in the future. Conjectures are tested by looking for
   412: counterexamples and, if none are found, sent to IsaPlanner which attempts to
   413: prove them.
   414: 
   415: QuickSpec emerged from work on the QuickCheck software testing framework for the
   416: Haskell programming language~\cite{claessen2011quickcheck}. As well as
   417: conjecturing properties of Haskell definitions for HipSpec, it has been applied
   418: to Isabelle definitions via the Hipster tool~\cite{Hipster}. Like IsaCoSy,
   419: QuickSpec takes a signature and enumerates well-typed terms. It collects
   420: together those of the same type into equivalence classes, assuming them to be
   421: equal, then uses QuickCheck to find counterexamples to this assumption by
   422: randomly instantiating the variables and comparing the resulting closed
   423: expressions. Any equivalence class whose elements don't compare equal are split
   424: up, and the process is repeated with new random values.
   425: 
   426: After 500 rounds of testing, any expressions still sharing the same equivalence
   427: class are conjectured to be equal for all values of their variables. Rather than
   428: using a constraint system to prevent redundancies (such as special-cases of
   429: other properties), QuickSpec instead passes its output through a congruence
   430: closure algorithm to achieve the same effect.
   431: 
   432: \section{Theory Exploration Benchmark}
   433: \label{sec:proposal}
   434: 
   435: Our main contribution is a benchmarking methodology, shown in
   436: Figure~\ref{figure:flow_chart} on page~\pageref{figure:flow_chart}, which both
   437: generates a large definition/ground-truth corpus, and provides a scalable,
   438: statistical approach to evaluating MTE tools using this corpus. We follow a
   439: precision/recall approach similar to prior work, with the main difference being
   440: the source of definitions and ground truths: we take existing problem sets
   441: designed for automated theorem proving, and adapt their content for use in the
   442: conjecture generation setting.
   443: 
   444: \subsection{Preparation}
   445: \label{section:prep}
   446: 
   447: Automated theorem proving is an active area of research, with large problem sets
   448: and regular competitions to prove as much as possible, as fast as
   449: possible~\cite{pelletier2002development}. These problem sets are an opportunity
   450: for MTE, as their definitions and theorems can be used as a corpus in the same
   451: way that Isabelle libraries have been used in the past.
   452: 
   453: Some problem sets are more amenable for this purpose than others. The most
   454: suitable are those meeting the following criteria:
   455: 
   456: \begin{itemize}
   457: \item For each problem, there should be a clear distinction between the
   458:   statement to be proved and the definitions involved, such that the two can be
   459:   easily and meaningfully separated. This rules out problem sets like those of
   460:   SMT-COMP~\cite{barrett2005smt}, where many problems involve uninterpreted
   461:   functions, whose behaviour is \emph{implicit} in the logical structure of the
   462:   theorem statement but not separately \emph{defined}.
   463: \item Definitions should be translatable into a form suitable for the MTE tools
   464:   under study. Our application in $\S$\ref{sec:application} requires Haskell and
   465:   Isabelle translations, and also benefits from having definitions be strongly
   466:   typed.
   467: \item The problem set should be relevant to the desired domain. We focus on pure
   468:   functional programming, which requires higher-order functions and inductively
   469:   defined datatypes, which rules out first-order languages/logics (such as
   470:   TPTP~\cite{sutcliffe2009tptp}).
   471: \item The problem set should ideally contain \emph{every} ``interesting''
   472:   property involving its included definitions, since non-membership in the
   473:   ground truth will be treated as being ``uninteresting''. More realistically,
   474:   we should aim for each definition to have \emph{multiple} properties; rather
   475:   than a mixture of unrelated problems, cherry-picked from different domains.
   476: \item Larger problem sets are preferable as they give more robust statistics, all
   477:   else being equal (i.e. when this does not sacrifice quality).
   478: \end{itemize}
   479: 
   480: Once such a problem set has been chosen, we must separate the definitions from
   481: the theorem statements which reference those definitions. The definitions will
   482: be used as input to the MTE tools, whilst the theorem statements form the ground
   483: truth corpus of properties (which tool output will be compared against).
   484: 
   485: It is important to ensure that there are no duplicate definitions: we are only
   486: concerned with the \emph{logical} content of the input, not the more arbitrary
   487: aspects of their presentation like the names of functions. For example, consider
   488: a problem set which includes a statement of commutativity for a \texttt{plus}
   489: function, and of associativity for an \texttt{add} function, where the
   490: definitions of \texttt{plus} and \texttt{add} are $\alpha$-equivalent. We would
   491: expect an MTE tool to either conjecture commutativity and associativity for
   492: \emph{both} functions, or for \emph{neither} function, since they are logically
   493: equivalent. Yet a na\"ive precision/recall analysis would treat commutativity of
   494: \texttt{add} and associativity of \texttt{plus} as \emph{uninteresting}, since
   495: they don't appear in the ground truth.
   496: 
   497: For this reason, duplicates should be removed, and any references to them
   498: updated to use the remaining definition (e.g. chosen based on lexicographic
   499: order). In the above example, the \texttt{plus} function would be removed, and
   500: the commutativity statement updated to reference \texttt{add} instead.
   501: 
   502: \subsection{Sampling}
   503: \label{section:sampling}
   504: 
   505: We could, in theory, send these de-duplicated definitions straight into an MTE
   506: tool and use the entire set of properties (taken from the theorem statements) as
   507: the ground truth for analysis. However, this would cause two problems:
   508: 
   509: \begin{itemize}
   510: \item The result would be a single data point, which makes it difficult to
   511:   infer performance \emph{in general}.
   512: \item It is impractical to run existing MTE tools on inputs containing more
   513:   than a few dozen definitions.
   514: \end{itemize}
   515: 
   516: To solve both of these problems we instead \emph{sample} a subset of
   517: definitions.\footnote{This could be done randomly, but for reproducibility we
   518:   use a deterministic order based on cryptographic hashes.} The sample size is a
   519: free parameter so that users have some control over how long the runs will take
   520: ($\S$\ref{sec:application} shows that larger samples take longer, but the
   521: correlation is very noisy). Given a sample size, we need to choose a subset of
   522: that many definitions from the corpus, and provide only those as input to the
   523: tool.
   524: 
   525: We cannot expect an MTE tool to produce all of the properties found in the
   526: corpus, if it is only given a subset of definitions as input. Hence for each
   527: sample we generate a corresponding ground truth by selecting those properties
   528: from the corpus which ``depend on'' (contain references to) \emph{only} the
   529: definitions in that sample. Transitive dependencies aren't required (e.g. a
   530: property involving only a \texttt{times} function would not depend on a
   531: \texttt{plus} function, even if \texttt{plus} occurs in the definition of
   532: \texttt{times}).
   533: 
   534: The question remains of how to pick definitions for each sample. One simple way
   535: is to choose uniformly without replacement, but this will give many samples
   536: whose corresponding ground truths are empty. The effect is similar to a lottery,
   537: where our sample is a ticket and must contain all of some property's
   538: dependencies in order to ``win'' (i.e. contain that property in its ground
   539: truth). In particular, if we choose a sample of size $S$ uniformly at random
   540: from a corpus containing $N$ definitions and $P$ properties which depend on an
   541: average of $D$ definitions each (assuming no properties have the exact same
   542: dependencies), then the probability that our sample contains the dependencies of
   543: some particular property $p$ follows a hypergeometric distribution (similar to
   544: the binomial distribution, but without replacement):
   545: 
   546: \begin{equation*}
   547:   Pr(p) = \frac{\binom{S}{D} \binom{N - S}{S - D}}
   548:                {\binom{N}{S}}
   549: \end{equation*}
   550: 
   551: Choosing at least one property's dependencies is a Bernoulli process with $P$
   552: repetitions, with probability $1 - (1 - Pr(p))^P$. This gets larger if we
   553: increase the sample size $S$ or the number of properties $P$, but it gets
   554: smaller if we increase the number of definitions $N$ or the average number of
   555: dependencies $D$. This is problematic because we would like large corpora for
   556: better statistics, but this would make empty ground truths more likely.
   557: 
   558: Samples whose corresponding ground truth is empty are problematic because they
   559: have $0$ precision and undefined recall (due to division by $0$). Such samples
   560: are hence unsuitable for our benchmark, especially since these results would be
   561: \emph{independent} of the tool's output (all tools would fare the same, so no
   562: comparisons could be made). Other measurements could be used for such cases, but
   563: since our choice of analysis comes from existing work we would rather avoid such
   564: samples by only allowing those with a non-empty ground truth (i.e. containing
   565: all dependencies of at least one corpus property).
   566: 
   567: We can achieve this by instead performing uniform sampling \emph{with
   568:   rejection}, where samples with empty ground truth are rejected and a different
   569: sample is drawn instead.
   570: 
   571: One problem with performing rejection sampling is that it becomes very
   572: inefficient when the probability of rejection is high. In our case there is a
   573: more direct method of sampling, whose output is exactly equivalent to rejection
   574: sampling but is far more efficient.
   575: 
   576: Since non-rejected samples must contain the dependencies of some property, we
   577: can consider each such sample to be the union of those dependencies and some
   578: other ``padding'' definitions. By choosing these two components separately and
   579: returning their union, we can obtain an identical distribution without having to
   580: perform any rejection.
   581: 
   582: We first choose the set of dependencies by picking from those of all corpus
   583: properties, weighted according to the probability that they would appear under
   584: uniform sampling. To do this we first discard any dependency sets which are
   585: larger than the given sample size (since they cannot appear as subsets); we then
   586: discard any dependency set which is a superset of another (since only the subset
   587: is required to avoid rejection). Their probabilities under uniform sampling
   588: depend only on their size, so we assign weights by calculating the least
   589: common multiple (LCM) of their sizes, and dividing that by each set's size. For
   590: example, given dependency sets $X = \{a, b, c\}$, $Y = \{b, d\}$ and
   591: $Z = \{a, b, e, f\}$ the LCM of their lengths ($3$, $2$ and $4$) is 12, so the
   592: weighting will be:
   593: 
   594: \begin{equation*}
   595:   w(X) = \frac{12}{3} = 4
   596:   w(Y) = \frac{12}{2} = 6
   597:   w(Z) = \frac{12}{4} = 3
   598: \end{equation*}
   599: 
   600: Once we've chosen a dependency set, we know how much padding to choose (sample
   601: size minus dependency set size). The padding will be uniformly distributed,
   602: since it does not affect the rejection criterion.
   603: 
   604: Using rejection sampling or, equivalently, our direct method, ensures that the
   605: analysis is well defined and depends meaningfully on the tool's output. One
   606: downside of applying this non-empty ground truth restriction is that we limit
   607: the number of possible samples to measure; this is significant for sample sizes
   608: less than~3.
   609: 
   610: \subsection{Evaluation}
   611: \label{section:evaluation}
   612: 
   613: Given a sample of definitions and a corresponding ground truth, the actual
   614: execution of the MTE tool proceeds as in prior work. We must translate the
   615: chosen definitions into the required input format, then we time the execution
   616: with a timeout (e.g. 5 minutes). We use wall-clock time since
   617: \begin {enumerate*} [i) ]%
   618: \item this is most relevant to a user's experience,
   619: \item it has a straightforward interpretation and
   620: \item measuring it does not require intrusive alterations to a tool's
   621:   implementation.
   622: \end {enumerate*}
   623: The downside is that comparisons must take hardware performance into account,
   624: which would not be the case if time were measured by some proxy like number of
   625: expressions evaluated.
   626: 
   627: In our experiments we have found that memory usage is also an important part of
   628: a tool's performance, but rather than complicating our analysis with an extra
   629: dimension, we instead allow programs to use as much memory as they like, and
   630: either get killed by the operating system or slow down so much from swapping
   631: that they time out. This is in line with the expected usage of these tools:
   632: either there is enough memory, or there isn't; implementations should not be
   633: penalised for making use of available resources.
   634: 
   635: To calculate precision and recall, the conjectures generated by each tool, as
   636: well as the ground-truth properties, need to be parsed into a common format and
   637: compared syntactically after normalising away ``irrelevant'' details. We
   638: consider variable naming to be irrelevant, which can be normalised by numbering
   639: free variables from left to right and using de Bruijn indices for bound
   640: variables. We also consider the left/right order of equations to be irrelevant,
   641: which we normalise by choosing whichever order results in the
   642: lexicographically-smallest expression.
   643: 
   644: We specifically \emph{ignore} other logical relationships between syntactically
   645: distinct statements, such as one equation being implied by another. Whilst
   646: logically sound, second-guessing the ground truth in this way would harm other
   647: aspects which influence interestingness (for example, a more general statement
   648: is more widely applicable, but it might also be harder to comprehend).
   649: 
   650: This procedure gives us, for each sample, a set of generated conjectures and a
   651: set of ground truth properties (from which precision and recall can be
   652: calculated) and a single runtime. We propose two methods for analysing this
   653: data: \emph{summarising} the performance of a single MTE tool and
   654: \emph{comparing} the performance of two tools.
   655: 
   656: \subsection{Summarising}
   657: 
   658: Each sample is only explored once by each tool, so that we cover as many
   659: independent samples as possible to better estimate how a tool's performance
   660: generalises to unseen inputs. How we combine these data into an aggregate
   661: summary depends on what we are interested in measuring. One general question we
   662: might ask is how a tool's performance scales with respect to the input size (the
   663: number of given definitions). This is straightforward to measure by varying the
   664: sample size, but we need some way to combine the results from samples of the
   665: same size.
   666: 
   667: We can summarise a set of runtimes by choosing the median, as this is more
   668: robust than the mean against long-running outliers, and hence represents
   669: performance for a ``typical'' input of that size. Since our methodology measures
   670: performance over many samples, we can also compute the \emph{spread} of our
   671: data, for example the inter-quartile range. This is not possible using the
   672: one-off measurements common in prior evaluation methods.
   673: 
   674: Precision and recall are ratios, which can be averaged (separately) in two ways,
   675: known as the \emph{Average of Ratios} (AoR) and the \emph{Ratio of Averages}
   676: (RoA)~\cite{egghe2012averages}. The Average of Ratios is the mean value. Using
   677: the definitions of precision and recall given in $\S$\ref{sec:te}, we can define
   678: their means for $S$ samples indexed by $1 \leq i \leq S$ as follows:
   679: 
   680: \begin{align*}
   681:        \overline{\text{precision}}_{AoR}
   682:     &= \frac{1}{S} \sum_i{\text{precision}_i} \\
   683:     &= \frac{1}{S} \sum_i{
   684:          \frac{\#\text{interesting}_i}
   685:               {\#\text{generated}_i}}         \\[10pt]
   686:        \overline{\text{recall}}_{AoR}
   687:     &= \frac{1}{S} \sum_i{\text{recall}_i} \\
   688:     &= \frac{1}{S} \sum_i{
   689:          \frac{\#\text{interesting}_i}
   690:               {\#\text{groundtruth}_i}}
   691: \end{align*}
   692: 
   693: We interpret these mean values as the \emph{expected} precision and recall for
   694: samples of this size. In particular, these tell us the quality of the \emph{set}
   695: of conjectures that will be generated if we were to run the tool on such a
   696: sample, but they don't give us direct information about the individual
   697: conjectures themselves. Since these are mean values, they are the reference
   698: point for measures of spread such as the variance, standard deviation and mean
   699: absolute deviation.
   700: 
   701: The Ratio of Averages is \emph{pooled}, combining the (multi)sets from each
   702: sample before taking the ratios. The size of these pooled sets is simply the sum
   703: of the individual set sizes, and ratios of these sums are equal to ratios of the
   704: means (since the normalising factor $\frac{1}{S}$ appears in both numerator and
   705: denominator):
   706: 
   707: \begin{align*}
   708:        \overline{\text{precision}}_{RoA}
   709:     &= \frac{\sum_i{\#\text{interesting}_i}}
   710:             {\sum_i{\#\text{generated}_i}}                              \\
   711:     &= \frac{\frac{1}{S} \sum_i{\#\text{interesting}_i}}
   712:             {\frac{1}{S} \sum_i{\#\text{generated}_i}}                  \\
   713:     &= \frac{\hspace{5pt} \overline{\#\text{interesting}} \hspace{5pt}}
   714:             {\hspace{5pt} \overline{\#\text{generated}}   \hspace{5pt}} \\[10pt]
   715:        \overline{\text{recall}}_{RoA}
   716:     &= \frac{\sum_i{\#\text{interesting}_i}}
   717:             {\sum_i{\#\text{groundtruth}_i}}                            \\
   718:     &= \frac{\frac{1}{S} \sum_i{\#\text{interesting}_i}}
   719:             {\frac{1}{S} \sum_i{\#\text{groundtruth}_i}}                \\
   720:     &= \frac{\hspace{5pt} \overline{\#\text{interesting}} \hspace{5pt}}
   721:             {\hspace{5pt} \overline{\#\text{groundtruth}} \hspace{5pt}}
   722: \end{align*}
   723: 
   724: Larger sets contribute more to a Ratio of Averages, unlike the mean which treats
   725: each set equally. This weighting gives us expected qualities of
   726: \emph{conjectures} rather than sets: the RoA for precision is the expected
   727: chance that some particular generated conjecture will appear in the ground
   728: truth; the RoA for recall is the expected chance that some particular ground
   729: truth property will appear in the generated output.
   730: 
   731: These averages can differ when there is a large variation in set size. For
   732: example, if we generate a single, interesting conjecture for one sample and 99
   733: uninteresting conjectures for another sample, the AoR (mean) precision will be
   734: $\frac{1}{2}$ but the RoA precision will be $\frac{1}{100}$. This highlights the
   735: importance of knowing \emph{absolute} numbers of conjectures, such as the mean
   736: output size $\overline{\#\text{generated}}$ and its spread, alongside the
   737: proportions given by precision and recall.
   738: 
   739: It is also possible to combine precision values \emph{with} recall values
   740: (either individually or aggregated) by calculating their \emph{F-score}, defined
   741: as:
   742: 
   743: \begin{equation*}
   744:   F = 2 \times \frac{\text{precision} \times \text{recall}}
   745:                     {\text{precision} +      \text{recall}}
   746: \end{equation*}
   747: 
   748: This is the harmonic mean of precision and recall, ranging between 0 and 1. The
   749: F-score summarises performance into a single number, which is useful for
   750: purposes like ranking, but it obscures insights we might gain from looking at
   751: precision and recall separately (e.g. whether a tool is generating too much
   752: output or too little).
   753: 
   754: \subsection{Comparison}
   755: 
   756: To measure progress and encourage competition in the field, it is important to
   757: compare the relative performance of different tools on the same task. Since the
   758: aggregate statistics in the above summaries have obscured the details of
   759: specific runs, any comparison based on them (such as comparing mean recall)
   760: would have very low statistical power. More direct comparisons can be made using
   761: \emph{paired} tests, since for each individual sample we have measurements for
   762: \emph{both} tools.
   763: 
   764: Running times do not follow a normal distribution, in particular due to the
   765: lower bound at 0, which complicates any application of standard comparisons like
   766: the paired z-test. An alternative which does not assume normality is the
   767: Wilcoxon signed-rank test~\cite{wilcoxon1945individual}, which has been applied
   768: to software benchmarking in the Speedup-Test protocol of Touati, Worms and
   769: Briais~\cite{touati2013speedup}. Our methodology differs by measuring with a
   770: different sample each time, rather than repeatedly measuring one sample of each
   771: size; measuring both tools on the \emph{same} samples allows the paired form of
   772: the Wilcoxon test to be used.
   773: 
   774: Another complication with our time measurements is the censoring effect caused
   775: by timeouts. Paired difference tests are robust to this, since the upper-bound
   776: on total time imposes an upper-bound to the observable difference; this biases
   777: our conclusions in a conservative way, \emph{towards} the null hypothesis of
   778: indistinguishable performance.
   779: 
   780: Quality can be compared using precision and recall, but we must choose how to
   781: handle samples where one tool succeeded and the other failed (e.g. by timing
   782: out). Failed runs can be treated as if they had succeeded with an empty set of
   783: conjectures; this penalises failure-prone tools and simplifies analysis by
   784: ensuring every sample produces a data point. An alternative is to discard
   785: samples which caused either tool to fail, and compare only those where both
   786: tools finished successfully. We follow this second approach, since it is more
   787: charitable to the tools being evaluated (which, after all, were not designed
   788: with our benchmark in mind). With fewer data points to analyse, we pool together
   789: samples of all sizes (allowing duplicates) and compare proportions from these
   790: aggregations.
   791: 
   792: Recall relies on a fixed set of properties, the ground truth, so we can compare
   793: tools using McNemar's test for paired samples~\cite{mcnemar1947note}. For each
   794: tool, we count how many of the ground-truth properties it found which were
   795: \emph{not} found by the other tool; i.e. those occasions where one tool was
   796: objectively better than the other. McNemar's test determines whether we can
   797: reject the null hypothesis that both counts are the same.
   798: 
   799: Comparing precision cannot be done in the same way as recall, since we do not
   800: have a fixed set of properties to compare both tools on (there are an unlimited
   801: number of properties that may or may not appear in each tool's output). Instead,
   802: we can count how many of the generated conjectures were interesting and
   803: uninteresting for each tool, and use Boschloo's form of Barnard's
   804: test~\cite{lydersen2009recommended} to determine if these proportions are
   805: independent. This assumes that precision follows a binomial distribution, and
   806: the test is conditioned on the number of generated conjectures (as if this were
   807: fixed by design, rather than determined by the tool).
   808: 
   809: \begin{figure}
   810:   \centering
   811:   \tikzstyle{startstop} = [rectangle, rounded corners, minimum width=3cm,
   812:                            minimum height=1cm,text centered, draw=black]
   813: 
   814:   \tikzstyle{io} = [trapezium, trapezium left angle=70,
   815:                     trapezium right angle=110, minimum width=1cm,
   816:                     minimum height=1cm, text centered, draw=black]
   817: 
   818:   \tikzstyle{process} = [rectangle, minimum width=3cm, minimum height=1cm,
   819:                          text centered, draw=black]
   820:   \tikzstyle{decision} = [diamond, minimum width=3cm, minimum height=1cm,
   821:                           text centered, draw=black]
   822: 
   823:   \tikzstyle{arrow} = [thick,->,>=stealth]
   824: 
   825:   % Avoids too much padding in io shapes
   826:   \tikzset{trapezium stretches=true}
   827: 
   828:   \begin{tikzpicture}[node distance=1cm]
   829: 
   830:     % Preparation section
   831: 
   832:     \node (in)  [io]{Theorem Proving Benchmark};
   833:     \node (sep) [process,       below=1cm of in         ]{Separate definitions from property statements};
   834:     \node (def) [process, below  left=1.5cm and -1cm of sep]{Remove duplicate definitions};
   835:     \node (ref) [process, below right=1.5cm and -1cm of sep]{Update references};
   836: 
   837:     \node (thy)  [startstop, below=1cm of def]{Distinct Definitions};
   838:     \node (thm)  [startstop, below=1cm of ref]{Corpus of Properties};
   839: 
   840:     \draw [arrow] (in)  -- (sep);
   841:     \draw [arrow] (def) -- (ref);
   842:     \draw [arrow] (def) -- (thy);
   843:     \draw [arrow] (ref) -- (thm);
   844: 
   845:     % Arrows with labels
   846:     \path[->]
   847:         (sep) edge [arrow, sloped, above] node {definitions} (def)
   848:         (sep) edge [arrow, sloped, above] node {properties}  (ref);
   849: 
   850:     % Sampling section
   851: 
   852:     \node (choose) [process, below=of thm   ]{Choose a property};
   853:     \node (deps)   [process, below=of choose]{List property's dependencies};
   854: 
   855:     % Create dummy coordinate below deps, then use its y coordinate for pad
   856:     \coordinate [below=of deps] (padDummy);
   857:     \path let \p{dummy} = (padDummy),
   858:               \p{in}    = (in)
   859:               in coordinate (padPos) at (\x{in}, \y{dummy});
   860:     \node (pad)  [process, at=(padPos)]{Pad list to form sample};
   861: 
   862:     \node (sthm) [startstop, below right=1.5cm and -1cm of pad]{Ground Truth};
   863:     \node (sthy) [startstop, below  left=1.5cm and -1cm of pad]{Sampled Theory};
   864: 
   865: 
   866:     % Calculate position of (size) using let, then define as normal
   867:     \path let \p{pad}    = (pad),
   868:               \p{choose} = (choose)
   869:            in coordinate (sizePos) at (\x{pad},\y{choose});
   870:     \node (size) [io, at=(sizePos)] {Sample size};
   871: 
   872:     % Evaluation section
   873:     \path let \p{pad}  = (pad),
   874:               \p{sthm} = (sthm)
   875:            in coordinate (runPos) at (\x{pad}, \y{sthm});
   876:     \node (run) [process, below=of runPos]{Run MTE tool};
   877:     \node (pr)  [process, below=of run]{Analysis};
   878: 
   879:     \node (prec) [startstop, below=of pr  ]{Precision};
   880:     \node (time) [startstop,  left=of prec]{Time taken};
   881:     \node (rec)  [startstop, right=of prec]{Recall};
   882: 
   883:     \draw [arrow] (thy)    |- (pad);
   884:     \draw [arrow] (thm)    -- (choose);
   885:     \draw [arrow] (choose) -- (deps);
   886:     \draw [arrow] (deps)   |- (pad);
   887:     \draw [arrow] (size)   -- (choose);
   888:     \draw [arrow] (size)   -- (pad);
   889:     \draw [arrow] (pad)    -- (sthm);
   890:     \draw [arrow] (pad)    -- (sthy);
   891:     \draw [arrow] (sthy)   |- (run);
   892:     \draw [arrow] (run)    -- (pr);
   893:     \draw [arrow] (sthm)   |- (pr);
   894:     \draw [arrow] (pr)     -- (prec);
   895:     \draw [arrow] (pr)     -- (rec);
   896:     \draw [arrow] (pr)     -- (time);
   897: 
   898:     % Awkward arrow
   899:     \draw [arrow] (thm) -| ([shift={(7mm,-7mm)}]thm.east) |- (sthm);
   900: 
   901:     % Braces
   902: 
   903:     % Preparation
   904:     \draw
   905:       let \p{thm} = (thm.east),
   906:           \p{in}  = (in.north),
   907:           \p{rec} = (rec.south east)
   908:        in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
   909:        (\x{rec}, \y{in}) -- (\x{rec}, \y{thm})
   910:          node [black, midway, right, xshift=0.3cm]
   911:               {Preparation $\S$\ref{section:prep}};
   912: 
   913:     % Sampling
   914:     \draw
   915:       let \p{thm} = (thm.east),
   916:           \p{gt}  = (sthm.east),
   917:           \p{rec} = (rec.south east)
   918:        in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
   919:        (\x{rec}, \y{thm}) -- (\x{rec}, \y{gt})
   920:          node [black, midway, right, xshift=0.3cm]
   921:               {Sampling $\S$\ref{section:sampling}};
   922: 
   923:     % Evaluation
   924:     \draw
   925:       let \p{sthm} = (sthm.east),
   926:           \p{rec}  = (rec.south east)
   927:        in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
   928:        (\x{rec}, \y{sthm}) -- (\x{rec}, \y{rec})
   929:          node [black, midway, right, xshift=0.3cm]
   930:               {Evaluation $\S$\ref{section:evaluation}};
   931: 
   932:   \end{tikzpicture}
   933:   \caption[]{High-level view of the benchmarking methodology described in
   934:     $\S$\ref{sec:proposal}, showing
   935:     \begin{tikzpicture}
   936:       \node [rectangle, text centered, draw=black]{processes};
   937:     \end{tikzpicture},
   938:     \begin{tikzpicture}
   939:       \node [rectangle, rounded corners, text centered, draw=black]{data};
   940:     \end{tikzpicture} and
   941:     \begin{tikzpicture}
   942:       \node [trapezium, trapezium left angle=70, trapezium right angle=110,
   943:              text centered, draw=black]{inputs};
   944:   \end{tikzpicture}}
   945:   \label{figure:flow_chart}
   946: \end{figure}
   947: 
   948: \section{Application}
   949: \label{sec:application}
   950: 
   951: We have applied our benchmarking methodology to the QuickSpec and IsaCoSy MTE
   952: tools, using version 0.2 of the TIP (Tons of Inductive Problems) theorem proving
   953: benchmark as our ground truth corpus~\cite{claessen2015tip}.
   954: 
   955: To determine our benchmarking parameters we ran some initial tests on both tools
   956: for a few samples sized between 1 and 100, for an hour each on our benchmarking
   957: machine with a 3.2GHz dual-core Intel i5 processor with hyper-threading and 8GB
   958: of RAM. Most QuickSpec runs either finished within 200 seconds or not at all,
   959: and sizes above 20 mostly timed out. IsaCoSy mostly finished within 300 seconds
   960: on sample sizes up to 4, but by size 8 was mostly timing out; its few successes
   961: above this took thousands of seconds each, which we deemed infeasibly long.
   962: 
   963: Based on this we decided to benchmark sample sizes up to 20, since neither tool
   964: seemed to perform well beyond that. The Speedup-Test protocol follows the
   965: statistical ``rule of thumb'' of treating sample sizes $\leq$ 30 as ``small'',
   966: so we pick 31 samples of each size in order to cross this threshold. This gave a
   967: total of 1240 runs. To keep the benchmarking time down to a few days we chose a
   968: timeout of 300 seconds, since that covered most of the successful QuickSpec and
   969: IsaCoSy results we saw, and longer times gave rapidly diminishing
   970: returns. During analysis, duplicate samples (caused by our requirement that
   971: samples have a non-empty ground truth) were found and discarded, so only 14
   972: samples of size 1 were used and 30 of size 2.
   973: 
   974: \subsection{Tons of Inductive Problems}
   975: \label{sec:tip}
   976: 
   977: We chose the Tons of Inductive Problems (TIP) benchmark for our ground truth
   978: since it satisfies the criteria specified in $\S$\ref{section:prep}: each
   979: benchmark problem has standalone type and function definitions, making their
   980: separation trivial; known examples from the software verification and inductive
   981: theorem proving literature are included, ensuring relevance to those fields; the
   982: format includes the higher-order functions and inductive datatypes we are
   983: interested in; it is large enough to pose a challenge to current MTE tools; plus
   984: it is accompanied by tooling to convert its custom format (an extension of
   985: SMT-Lib~\cite{BarFT-SMTLIB}) into a variety of languages, including Haskell and
   986: Isabelle.
   987: 
   988: We use TIP version 0.2 which contains 343 problems, each stating a single
   989: property and together defining a total of 618 datatypes and 1498 functions. Most
   990: of these are duplicates, since each problem (re\nobreakdash-)defines all of the
   991: datatypes and functions it involves.
   992: 
   993: TIP datatypes can have several ``constructors'' (introduction forms) and
   994: ``destructors'' (elimination forms; field accessors). For example the type of
   995: lists from Figure~\ref{figure:list_theory} can be defined in the TIP format as
   996: follows:
   997: 
   998: \begin{samepage}
   999: \begin{verbatim}
  1000: (declare-datatypes
  1001:   (a)                       ;; Type parameter (element type)
  1002:   ((List                    ;; Type name
  1003:      (Nil)                  ;; Constructor (nullary)
  1004:      (Cons                  ;; Constructor (binary)
  1005:        (head a)             ;; Field name and type
  1006:        (tail (List a))))))  ;; Field name and type
  1007: \end{verbatim}
  1008: \end{samepage}
  1009: 
  1010: Our target languages (Haskell and Isabelle) differ in the way they handle
  1011: constructors and destructors, which complicates comparisons. To avoid this, we
  1012: generate a new function for each constructor (via $\eta$-expansion) and
  1013: destructor (via pattern-matching) of the following form:
  1014: 
  1015: \begin{samepage}
  1016: \begin{verbatim}
  1017: (define-fun
  1018:   (par (a)                   ;; Type parameter
  1019:     (constructor-Cons        ;; Function name
  1020:       ((x a) (xs (List a)))  ;; Argument names and types
  1021:       (List a)               ;; Return type
  1022:       (as                    ;; Type annotation
  1023:         (Cons x xs)          ;; Return value
  1024:         (List a)))))         ;; Return type
  1025: \end{verbatim}
  1026: \end{samepage}
  1027: 
  1028: \begin{samepage}
  1029: \begin{verbatim}
  1030: (define-fun
  1031:   (par (a)                        ;; Type parameter
  1032:     (destructor-head              ;; Function name
  1033:       ((xs (List a)))             ;; Argument name and type
  1034:       a                           ;; Return type
  1035:       (match xs                   ;; Pattern-match
  1036:         (case (Cons h t) h)))))   ;; Return relevant field
  1037: \end{verbatim}
  1038: \end{samepage}
  1039: 
  1040: \begin{sloppypar}
  1041:   We rewrite the TIP properties (our ground truth) to reference these expanded
  1042:   forms instead of the raw constructors and destructors, and use these functions
  1043:   in our samples in lieu of the raw expressions. Note that these destructor
  1044:   wrappers are \emph{partial} functions (e.g. \texttt{destructor-head} and
  1045:   \texttt{destructor-tail} are undefined for the input \texttt{Nil}), which
  1046:   complicates their translation to proof assistants like Isabelle.
  1047: \end{sloppypar}
  1048: 
  1049: Another complication is TIP's ``native'' support for booleans and integers,
  1050: which allows numerals and symbols like \texttt{+} to appear without any
  1051: accompanying definition. To ensure consistency in the translations, we replace
  1052: all occurrences of such expressions with standard definitions written with the
  1053: ``user-level'' \texttt{declare-datatypes} and \texttt{define-fun}
  1054: mechanisms.~\footnote{\texttt{Boolean} has \texttt{true} and \texttt{false}
  1055:   constructors; \texttt{Natural} has \texttt{zero} and \texttt{successor};
  1056:   \texttt{Integer} has unary \texttt{positive} and \texttt{negative}
  1057:   constructors taking \texttt{Natural}s, and a nullary \texttt{zero} for
  1058:   symmetry.}
  1059: 
  1060: When we add all of these generated types and functions to those in TIP, we get a
  1061: total of 3598 definitions. Removing $\alpha$-equivalent duplicates leaves 269,
  1062: and we choose to only sample from those 182 functions which are referenced by at
  1063: least one property (this removes ambiguity about which \emph{definitions} count
  1064: as interesting and which are just ``implementation details'' for other
  1065: definitions).
  1066: 
  1067: TIP comes with software to translate its definitions into Haskell and Isabelle
  1068: code, including comparison functions and random data generators suitable for
  1069: QuickCheck. We translate all 269 unique definitions into a single module/theory
  1070: which is imported on each run of the tools, although only those functions which
  1071: appear in the current sample are included in the signature and explored. We also
  1072: encode all names in hexadecimal to avoid problems with language-specific naming
  1073: rules, for example \texttt{add} becomes \texttt{global616464} (the prefix
  1074: distinguishes these from local variables and prevents names from beginning with
  1075: a digit). This ensures that the generated conjectures will be using the same
  1076: names as the ground truth, rather than some language-specific variant.
  1077: 
  1078: \subsection{QuickSpec}
  1079: 
  1080: We benchmarked QuickSpec version 0.9.6, a tool written in Haskell for
  1081: conjecturing equations involving Haskell functions, described in more detail in
  1082: $\S$\ref{sec:existing-tools}. In order to thoroughly benchmark QuickSpec, we
  1083: need to automate some of the decisions which are normally left up to the user:
  1084: 
  1085: \begin{sloppypar}
  1086:   \begin{itemize}
  1087:   \item We must decide what variables to include. We choose to add three
  1088:     variables for each type that appears as a function argument, except for
  1089:     types which have no QuickCheck data generators.
  1090:   \item We must \emph{monomorphise} all types. For example, functions like
  1091:     \texttt{constructor-Cons} are \emph{polymorphic}: they build lists of any
  1092:     element type, but we need to pick a specific type in order to know which
  1093:     random value generator to use. We resolve this (arbitrarily) by picking
  1094:     \texttt{Integer}.~\footnote{We pick \texttt{Integer} for variables of kind
  1095:       \texttt{*} (types); for kind \texttt{* -> *} (type constructors) we pick
  1096:       \texttt{[]} (Haskell's list type constructor). If these violate some
  1097:       type class constraint, we pick a suitable type non-deterministically from
  1098:       those in scope during compilation; if no suitable type is found, we give
  1099:       up and don't include that function.}
  1100:   \item Haskell functions are ``black boxes'', which QuickSpec can't compare
  1101:     during its exploration process. They are also curried, always taking one
  1102:     argument but potentially returning another function. QuickSpec lets us
  1103:     assign an arity to each function in the signature, from 0 to 5, so we pick
  1104:     the highest that is type-correct, since this avoids a proliferation of
  1105:     incomparable, partially-applied functions.
  1106:   \end{itemize}
  1107: \end{sloppypar}
  1108: 
  1109: \begin{figure}
  1110:   \centering
  1111:   \includegraphics[scale=1]{Fig3}
  1112:   \caption{Running times of QuickSpec on theories sampled from TIP. Each point
  1113:     is a successful run (spread out horizontally to reduce overlaps). Red
  1114:     crosses show runs which timed out, which occurred for every size. Each size
  1115:     has 31 runs total, except for size 1 (14 runs) and size 2 (30 runs) due to
  1116:     sampling restrictions. Box plots show inter-quartile range, which reaches
  1117:     the timeout for sizes above 8. Medians are marked with a line and remain
  1118:     near zero until size 10, with those of size 13 and 20 timing out. Whiskers
  1119:     show 1.5$\times$~inter-quartile range}
  1120:   \label{figure:quickspec_runtimes}
  1121: \end{figure}
  1122: 
  1123: \begin{figure}
  1124:   \centering
  1125:   \includegraphics[scale=1]{Fig4}
  1126:   \caption{Precision and recall of successful QuickSpec runs (spread
  1127:     horizontally to reduce overlaps). Lines show the mean proportion for each
  1128:     sample size (Average of Ratios) and errorbars show the sample standard
  1129:     deviation. Mean precision steadily reduces from around $0.5$ at size 1 to
  1130:     around $0.1$ for size 20, whilst mean recall remains roughly flat between
  1131:     $0.2$ and $0.5$}
  1132:   \label{figure:quickspec_precRec}
  1133: \end{figure}
  1134: 
  1135: The time taken to explore samples of different sizes is shown in
  1136: Figure~\ref{figure:quickspec_runtimes}. Failed runs are shown in red: all
  1137: failures were due to timing out, and these occurred for all sample sizes but are
  1138: more common for larger samples (over half the runs for sample sizes 13 and 20
  1139: timed out). Runs which succeeded mostly finished within 30 seconds, generating
  1140: few or no conjectures; those taking longer generated more conjectures, with the
  1141: most being 133 conjectures from a sample of size 19.
  1142: 
  1143: Precision and recall are shown in Figure~\ref{figure:quickspec_precRec}.
  1144: Since QuickSpec generates monotonically more conjectures as definitions are
  1145: added to a signature (assuming sufficient testing), its decreasing precision
  1146: can't be due to finding fewer wanted conjectures at larger sizes. This is
  1147: supported by the relative flatness of the recall results. Rather, the number of
  1148: conjectures generated is increasing at a higher rate than the size of the ground
  1149: truth. These extra conjectures may involve those ``padding'' definitions in a
  1150: sample which don't contribute to its ground truth, or may be ``uninteresting''
  1151: relationships between the dependencies of different ground truth properties.
  1152: 
  1153: This indicates two potential improvements to the QuickSpec algorithm (as far as
  1154: this benchmark is concerned). The deluge of generated conjectures could be
  1155: filtered down to a more desirable sub-set by another post-processing step.
  1156: Alternatively, rather than exploring all of the given definitions together,
  1157: multiple smaller signatures could be selected from the input by predicting which
  1158: combinations are likely to lead to interesting conjectures; this would avoid
  1159: both the ``padding'' and the cross-dependency relationships. Both of these
  1160: methods could improve the precision, although care would be needed to avoid a
  1161: large reduction in recall. The latter option could also improve the running
  1162: time, since (based on Figure~\ref{figure:quickspec_runtimes}) multiple smaller
  1163: signatures may be faster to explore than a single large one. Such improvements
  1164: would also need to be ``generic'', to avoid over-fitting to this particular
  1165: benchmark.
  1166: 
  1167: QuickSpec's recall is limited by two factors: the algorithm is unable to
  1168: synthesise some properties, such as conditional equations, inequalities, terms
  1169: larger than the search depth and those containing anonymous functions. The
  1170: congruence closure algorithm used as a post-processor may also be removing
  1171: ``interesting'' results, for example if we found an ``uninteresting'' result
  1172: which is more general.
  1173: 
  1174: \subsection{IsaCoSy}
  1175: 
  1176: We took IsaCoSy from version 2015.0.3 of the IsaPlanner project, and ran it
  1177: with the 2015 version of Isabelle. The following issues had to be overcome to
  1178: make our benchmark applicable to IsaCoSy:
  1179: 
  1180: \begin{itemize}
  1181: \item TIP includes a benchmark called \texttt{polyrec} whose types cannot be
  1182:   encoded in Isabelle. We strip out this type and the functions which depend on
  1183:   it before translating. It still appears in samples and contributes to the
  1184:   ground truth, which penalises IsaCoSy for being unable to explore such
  1185:   definitions.
  1186: \item When using a type in an IsaCoSy signature, that type's constructors will
  1187:   automatically be included in the exploration. Since those constructors will
  1188:   not appear in the ground truth (we use $\eta$-expanded wrappers instead, and
  1189:   even those may not be present in the current sample) this will unfairly reduce
  1190:   the calculated precision. To avoid this, we add a post-processing step which
  1191:   replaces all occurrences of a constructor with the corresponding wrapper, then
  1192:   discards any conjectures which involve functions other than those in the
  1193:   current sample. This presumably results in more work for IsaCoSy, exploring
  1194:   constructors unnecessarily, but it at least does not bring down the quality
  1195:   measures.
  1196: \item Since Isabelle is designed for theorem proving rather than programming, it
  1197:   requires every definition to be accompanied by proofs of exhaustiveness and
  1198:   termination. These are difficult to generate automatically, and don't exist in
  1199:   the case of partial functions like destructor wrappers. Hence we use the
  1200:   ``quick and dirty'' option in Isabelle, which lets us skip these proofs with
  1201:   the \texttt{sorry} keyword.
  1202: \item Partial functions cause problems during exploration, since they can throw
  1203:   an ``undefined'' exception which causes IsaCoSy to abort. We avoid this by
  1204:   pre-populating IsaCoSy's constraint set with these undefined expressions
  1205:   (for example \texttt{(destructor-head constructor-Nil)}), hence preventing
  1206:   IsaCoSy from ever generating them.
  1207: \end{itemize}
  1208: 
  1209: \begin{figure}
  1210:   \centering
  1211:   \includegraphics[scale=1]{Fig5}
  1212:   \caption{Running times of IsaCoSy on theories sampled from TIP. Each point
  1213:     is a successful run (spread out horizontally to reduce overlaps). Red
  1214:     crosses are failed runs, caused by timeouts or out-of-memory. All runs of
  1215:     size 1 succeeded, whilst nothing succeeded above size 6. Each size has 31
  1216:     runs total, except for size 1 (14 runs) and size 2 (30 runs) due to sampling
  1217:     restrictions. Box plots show inter-quartile range, which grows with size,
  1218:     and lines mark the median time, which increases rapidly from around 100
  1219:     seconds at size 1 to timing out at size 4 and above. Whiskers show
  1220:     1.5$\times$~inter-quartile range}
  1221:   \label{figure:isacosy_runtimes}
  1222: \end{figure}
  1223: 
  1224: \begin{figure}
  1225:   \centering
  1226:   \includegraphics[scale=1]{Fig6}
  1227:   \caption{Precision and recall of all successful IsaCoSy runs (spread
  1228:     horizontally to reduce overlaps); there were no successes above size 6.
  1229:     Lines show the mean proportion for each sample size (Average of Ratios),
  1230:     which trends downwards for precision and appears flat for recall, although
  1231:     there are too few datapoints to be definitive. Errorbars show the sample
  1232:     standard deviation}
  1233:   \label{figure:isacosy_precRec}
  1234: \end{figure}
  1235: 
  1236: The most striking result is how rapidly IsaCoSy's running time, shown in
  1237: Figure~\ref{figure:isacosy_runtimes}, increases with sample size: all runs above
  1238: size 6 failed, with most timing out and a few aborting early due to running out
  1239: of memory. Like in our preliminary testing, this increase appears exponential,
  1240: so even large increases to the timeout would not produce many more successful
  1241: observations. A significant amount of IsaCoSy's time (around 50 seconds) was
  1242: spent loading the generated theory (containing all distinct datatypes and
  1243: functions from TIP); this overhead is unfortunate, but since it is constant
  1244: across all sample sizes it does not affect our conclusions.
  1245: 
  1246: With so few successful datapoints it is difficult to make strong claims about
  1247: the precision/recall quality of IsaCoSy's output, shown in
  1248: Figure~\ref{figure:isacosy_precRec}. It is clear from the quantity of non-zero
  1249: proportions that IsaCoSy is capable of discovering interesting conjectures, and
  1250: the graphs appear to follow the same shapes as those of QuickSpec (decreasing
  1251: precision, flat recall), although the error margins are too wide to be
  1252: definitive.
  1253: 
  1254: \subsection{Comparison}
  1255: 
  1256: We compare the running times of QuickSpec and IsaCoSy using our paired variant
  1257: of the Speedup-Test protocol, where each of our sample sizes is a separate
  1258: ``benchmark''. We used version 2 of the Speedup-Test \texttt{R} implementation,
  1259: which we patched to use the \emph{paired} form of the (one-sided) Wilcoxon
  1260: signed-rank test~\cite{wilcoxon1945individual}. We use Pratt's
  1261: method~\cite{pratt1959remarks} to handle ties, which occur when both tools time
  1262: out.
  1263: 
  1264: For each sample size (``benchmark'') the Speedup-Test protocol compares the
  1265: distributions of each tool's times using a Kolmogorov-Smirnov test. If these are
  1266: significantly different (with $\alpha < 0.05$), then the signed-rank test is
  1267: only performed if more than 30 samples are available. This was the case for all
  1268: sample sizes except for 1 and 2 (due to the removed duplicates), and hence those
  1269: two speedups were not deemed statistically significant. For all other sample
  1270: sizes QuickSpec was found to be significantly faster with $\alpha = 0.05$,
  1271: putting the proportion of sizes with faster median times between 66.9\% and
  1272: 98.2\% with 95\% confidence. The magnitude of each speedup (median IsaCoSy time
  1273: divided by median QuickSpec time) is given in table~\ref{table:speedups}. We
  1274: would predict the speedup to grow for larger sample sizes, but the increasing
  1275: proportion of timeouts causes our estimates to become more conservative: by size
  1276: 20 most runs are timing out, and hence are tied.
  1277: 
  1278: \begin{table}
  1279:   \centering
  1280:   \begin{tabular}{ |r|l| }
  1281:     \hline
  1282:     \bfseries Sample Size & \bfseries Speedup
  1283:     \csvreader[]{speedups.csv}{}
  1284:     {\\\hline\csvcoli&\csvcolii} \\
  1285:     \hline
  1286:   \end{tabular}
  1287:   \caption{Speedup from using QuickSpec compared to IsaCoSy, i.e. the median
  1288:     time taken by IsaCoSy divided by that of QuickSpec. QuickSpec was found to
  1289:     be significantly faster ($\alpha = 0.05$) for all sizes, except 1 and 2 due
  1290:     to too few samples. These are conservative estimates, since the 300 second
  1291:     time limit acts to reduce the measured time difference (e.g. sizes 13 and
  1292:     20, where the median for both tools timed out)}
  1293:   \label{table:speedups}
  1294: \end{table}
  1295: 
  1296: We compare the recall proportions by applying McNemar's test to only those
  1297: samples where both tools succeeded. We pooled these together from all sample
  1298: sizes to produce table~\ref{table:recall}, and found that the recall of
  1299: QuickSpec ($37\%$) is significantly higher than IsaCoSy ($25\%$) with a p-value
  1300: of 0.0026.
  1301: 
  1302: \begin{table}
  1303:   \centering
  1304:   \begin{tabular}{ |r|l|l| }
  1305:     \hline
  1306:     & \bfseries Found by IsaCoSy & \bfseries Not found by IsaCoSy \\
  1307:     \hline
  1308:     \bfseries Found by QuickSpec     & 28 & 19 \\
  1309:     \hline
  1310:     \bfseries Not found by QuickSpec & 4  & 75 \\
  1311:     \hline
  1312:   \end{tabular}
  1313:   \caption{Contingency table for ground truth properties, pooled from those
  1314:     samples where both QuickSpec and IsaCoSy finished successfully. The combined
  1315:     recall for QuickSpec is $37\%$ and for IsaCoSy is $25\%$}
  1316:   \label{table:recall}
  1317: \end{table}
  1318: 
  1319: McNemar's test is not applicable for comparing precision, since each tool
  1320: generated different sets of conjectures. Instead, we add up the number of
  1321: results which are ``interesting'' (appear in the ground truth) and those which
  1322: are ``uninteresting'' (do not appear in the ground truth), with the results
  1323: shown in table~\ref{table:precision}
  1324: 
  1325: \begin{table}
  1326:   \centering
  1327:   \begin{tabular}{ |r|l|l| }
  1328:     \hline
  1329:               & \bfseries Interesting & \bfseries Uninteresting \\
  1330:     \hline
  1331:     \bfseries IsaCoSy   & 32          & 137      \\
  1332:     \hline
  1333:     \bfseries QuickSpec & 47          & 301      \\
  1334:     \hline
  1335:   \end{tabular}
  1336:   \caption{Interesting and uninteresting conjectures generated by QuickSpec and
  1337:     IsaCoSy, pooled from those samples where both both finished successfully.
  1338:     The combined precision for QuickSpec is $14\%$ and for IsaCoSy is $19\%$}
  1339:   \label{table:precision}
  1340: \end{table}
  1341: 
  1342: We tested these totals for independence using Boschloo's test and find a p-value
  1343: of 0.111, which exceeds our (conventional) significance threshold of
  1344: $\alpha = 0.05$; hence we do not consider the difference between these
  1345: proportions (14\% for QuickSpec, 19\% for IsaCoSy) to be significant.
  1346: 
  1347: Note that precision is the only comparison where IsaCoSy scored higher, and even
  1348: that was not found to be significant. Also, precision on its own is not the
  1349: whole story: it can be increased at the expense of recall by making a tool more
  1350: conservative. IsaCoSy may already be overly-conservative compared to QuickSpec,
  1351: given that its recall is significantly lower.
  1352: 
  1353: More importantly, the poor time and memory usage of IsaCoSy meant that very few
  1354: samples finished successfully. If we include failed runs in our tables, treating
  1355: them as if they succeeded with no output, the resulting statistics lean
  1356: overwhelmingly in favour of QuickSpec simply because it generated so much more
  1357: than IsaCoSy in the available time.
  1358: 
  1359: \section{Discussion}
  1360: \label{sec:discussion}
  1361: 
  1362: Fundamentally, when designing any mathematical reasoning system we must decide
  1363: on, and formalise, what counts as ``the good'' in mathematics.  Obvious metrics
  1364: such as ``true'' or ``provable'' include trivial tautologies, while at the same
  1365: time failing to capture the ``almost true'', which can be a valuable trigger for
  1366: theory change, as demonstrated by Lakatos in his case studies of mathematical
  1367: development~\cite{lakatos}. ``Beautiful'' is another -- albeit vague -- commonly
  1368: proposed metric. Neuro-scientists such as Zeki \etal{} have attempted to shed
  1369: light on this by testing whether mathematicians' experiences of abstract beauty
  1370: correlates with the same brain activity as experiences of sensory
  1371: beauty~\cite{10.3389/fnhum.2014.00068}. Qualities from computer scientists like
  1372: Colton (such as those in~\cite{colton2000notion}) are based largely on
  1373: ``intuition'', plausibility arguments about why a metric would be important, and
  1374: previous use of such metrics (in the case of ``surprisingness'', from a single
  1375: quote from a mathematician). Opinions from mathematicians themselves include
  1376: Gowers' suggestion that we can identify features which are commonly associated
  1377: with good proofs~\cite{gowers2000two}, Erdos's famous idea of ``The
  1378: Book''~\cite{aigner2010proofs} as well as McCasland's personal evaluation of the
  1379: interestingness of MATHsAiD's output~\cite{roy} (of which he was the main system
  1380: developer).
  1381: 
  1382: All of these approaches rest upon the assumption that it makes sense to speak of
  1383: ``the good'' in mathematics. However, empirical psychological studies
  1384: call into question such assumptions: for example, work by Inglis and colleagues
  1385: has shown that there is not even a single standard of \emph{validity} among
  1386: contemporary mathematicians~\cite{inglis2013mathematicians}.
  1387: 
  1388: Whilst these difficulties are real and important, we cannot ignore the fact that
  1389: mathematics is nevertheless being practised around the world; and similarly that
  1390: researchers have forged ahead to develop a variety of tools for automated
  1391: exploratory mathematics. If we wish to see these tools head in a useful,
  1392: fruitful direction then \emph{some} method is needed to compare their
  1393: approximate ``quality'' in a concrete, measurable way.
  1394: 
  1395: The key to our benchmarking methodology is to side-step much of this
  1396: philosophical quagmire using the simplifying assumption that theorem proving
  1397: problem sets are a good proxy for desirable input/output behaviour of MTE tools.
  1398: As an extension of existing precision/recall analysis, this should hopefully not
  1399: prove too controversial, although we acknowledge that there are compelling
  1400: reasons to refute it.
  1401: 
  1402: We do not claim that our use of corpora as a ground truth exactly captures all
  1403: interesting conjectures of their definitions, or that those definitions exactly
  1404: represent all theories we may wish to explore. Rather, we consider our approach
  1405: to offer a pareto-optimal balance between theoretical rigour and experimental
  1406: practicality, at least in the short term. Futhermore, since research is already
  1407: on-going in these areas, we hope to at least improve on existing evaluation
  1408: practices and offer a common ground for future endeavours.
  1409: 
  1410: One notable weakness is that our methodology does not allow negative examples,
  1411: such as particularly dull properties that tools should never produce. Everything
  1412: outside the ground truth set is treated equally, whether it's genuinely
  1413: uninteresting or was only left out due to oversight. In particular this limits
  1414: the use of our approach to domains where existing human knowledge surpasses that
  1415: discovered by the machine. Any \emph{truly novel} insights discovered during
  1416: testing will not, by definition, appear in any existing corpus, and we would in
  1417: fact \emph{penalise} tools for such output. We do not believe this to be a
  1418: realistic problem for the time being, as long as evaluation is limited to
  1419: well-studied domains and results can be generalised to real areas of
  1420: application. This does emphasise the continued importance of testing these tools
  1421: with real human users, rather than relying solely on artificial benchmarks.
  1422: 
  1423: Another practical limitation of our benchmarking approach is that it only
  1424: applies to tools which act in ``batch mode'', i.e. those which choose to halt
  1425: after emitting some output. Whilst all of the systems we have encountered are of
  1426: this form, some (such as IsaCoSy, QuickSpec 2 and Speculate) could conceivably
  1427: be run without a depth limit, and form part of the ``scientist assistant'' role
  1428: which Lenat envisaged for AM, or McCarthy's earlier ``advice
  1429: taker''~\cite{McCarthy_Programs59}. Analogous benchmarks could be designed for
  1430: such long-running, potentially interactive programs, but that is beyond the
  1431: scope of this project.
  1432: 
  1433: \section{Related Work}
  1434: \label{sec:related-work}
  1435: 
  1436: The automation of mathematical tasks has been pursued since at least the time of
  1437: mechanical calculators like the Pascaline~\cite{d'ocagne}. A recurring theme in
  1438: these efforts is the separation between those undertaken by mathematicians like
  1439: Pascal and Babbage~\cite{bowden}, and those of engineers such as
  1440: M\"uller~\cite[p. 65]{lindgren}. This pattern continues today, with the tasks we
  1441: are concerned with (automatically constructing and evaluating concepts,
  1442: conjectures, theorems, axioms, examples, etc.) being divided into two main
  1443: fields: Mathematical Theory Exploration (MTE)~\cite{buchberger:06} (also
  1444: sometimes prefaced with ``Computer-Aided'', ``Automated'' or
  1445: ``Algorithm-Supported''), which is championed by mathematicians such as
  1446: Buchberger~\cite{buchberger}; and Automated Theory Formation
  1447: (ATF)~\cite{lenat:77,colton:book}, pursued by AI researchers including Lenat.
  1448: Other related terms include ``Automated Mathematical
  1449: Discovery''~\cite{epstein:91,colton2000notion,esarm2008},
  1450: ``Concept Formation in Discovery Systems''~\cite{haase}, and
  1451: ``Automated Theorem Discovery''~\cite{roy}.
  1452: 
  1453: Such a plethora of terminology can mask similarities and shared goals between
  1454: these fields. Even notable historical differences, such as the emphasis of MTE
  1455: on user-interaction and mathematical domains, in contrast to the full automation
  1456: and more general applications targeted by ATF, are disappearing in recent
  1457: implementations.
  1458: 
  1459: An important historical implementation of ATF is Lenat's AM (Automated
  1460: Mathematician) system. Unlike prior work, such as
  1461: Meta-Dendral~\cite{buchanan:75} and those described in~\cite{winston}, AM aims
  1462: to be a general purpose mathematical discovery system, designed to both
  1463: construct new concepts and conjecture relationships between them. AM is a
  1464: rule-based system which represents knowledge using a frame-like scheme, enlarges
  1465: its knowledge base via a collection of heuristic rules, and controls the firing
  1466: of these rules via an agenda mechanism. Evaluation of AM considered generality
  1467: (performance in new domains) and how finely-tuned various aspects of the program
  1468: are (the agenda, the interaction of the heuristics, etc). Most of this
  1469: evaluation was qualitative, and has subsequently been
  1470: criticised~\cite[chap.~13]{colton:book}. In their case study in methodology,
  1471: Ritchie and Hanna found a large discrepancy between the theoretical claims made
  1472: of AM and the implemented program~\cite{ritchie1984case}; for example, AM ``invented''
  1473: natural numbers from sets, but did so using a heuristic specifically designed to
  1474: make this connection.
  1475: 
  1476: The prototypical implementation of MTE is the Theorema system of Buchberger and
  1477: colleagues~\cite{buchberger,buchberger2016theorema}, which also places a strong
  1478: emphasis on user interface and output presentation. Theory exploration in the
  1479: Theorema system involves the user formalising their definitions in a consistent,
  1480: layered approach; such that reasoning algorithms can exploit this structure in
  1481: subsequent proofs, calculations, etc. The potential of this strategy was
  1482: evaluated by illustrating the automated synthesis of Buchberger's own Gr\"obner
  1483: bases algorithm~\cite{buchberger:04}.
  1484: 
  1485: A similar ``layering'' approach is found in the IsaScheme system of
  1486: Monta{\~n}o-Rivas \etal{}~\cite{Montano-Rivas.McCasland.Dixon.ea:2012}, which
  1487: has also been quantitatively compared against IsaCoSy and HipSpec using
  1488: precision/recall analysis~\cite{claessen2013automating}. The name comes from its
  1489: embedding in the Isabelle proof assistant and its use of ``schemes'':
  1490: higher-order formulae which can be used to generate new concepts and
  1491: conjectures. Variables within a scheme are instantiated automatically and this
  1492: drives the invention process. For example, the concept of ``repetition'' can be
  1493: encoded as a scheme, and instantiated with existing encodings of zero, successor
  1494: and addition to produce a definition of multiplication. The same scheme can be
  1495: instantiated with this new multiplication function to produce exponentiation.
  1496: 
  1497: IsaCoSy and QuickSpec (the conjecture generation component of HipSpec) are
  1498: described in more detail in $\S$\ref{sec:existing-tools}, since these are the
  1499: tools we chose to evaluate and compare for $\S$\ref{sec:application}. QuickSpec
  1500: has since evolved to version 2~\cite{smallbone2017quick}, which replaces the
  1501: distinct enumeration and testing steps with a single, iterative algorithm
  1502: similar to that of IsaCoSy. Generated conjectures are fed into a Knuth-Bendix
  1503: completion algorithm to form a corresponding set of rewrite rules. As
  1504: expressions are enumerated, they are simplified using these rules and discarded
  1505: if equal to a known expression. If not, QuickCheck tests whether the new
  1506: expression can be distinguished from the known expressions through random
  1507: testing: those which can are added to the set of known expressions. Those which
  1508: cannot be distinguished are conjectured to be equal, and the rewrite rules are
  1509: updated.
  1510: 
  1511: QuickSpec has also inspired another MTE tool for Haskell called
  1512: Speculate~\cite{braquehais2017speculate}, which operates in a similar way but
  1513: also makes use of the laws of total orders and Boolean algebra to conjecture
  1514: \emph{in}equalities and conditional relations between expressions.
  1515: 
  1516: Another notable MTE implementation, distinct from those based in Isabelle and
  1517: Haskell, is the MATHsAiD project (Mechanically Ascertaining Theorems from
  1518: Hypotheses, Axioms and Definitions)~\cite{roy}. Unlike the tools above, which
  1519: generate \emph{conjectures} that may later be sent to automated provers,
  1520: MATHsAiD directly generates \emph{theorems}, by making logically valid
  1521: inferences from a given set of axioms and definitions. Evaluation of the
  1522: interestingness of these theorems was performed qualitatively by the system's
  1523: developer, which highlights how these tools could benefit from the availability
  1524: of an objective, repeatable, quantitative method of evaluation and comparison
  1525: such as ours.
  1526: 
  1527: Whilst there are many reasonably objective benchmarks for mathematical tasks
  1528: such as automated theorem proving, the precision/recall analysis shown
  1529: in~\cite{claessen2013automating}, and described further in $\S$\ref{sec:te}, is
  1530: the only quantitative comparison of these recent MTE tools we have found in the
  1531: literature. Our work is essentially an extension of this approach, to a larger
  1532: and more diverse set of examples. The suitability of the TIP theorem proving
  1533: benchmark for our purposes, detailed in $\S$\ref{sec:tip}, is not coincidental,
  1534: since its developers are also those of the IsaCoSy and QuickSpec tools we have
  1535: tested. This goes some way to ensuring that our demonstration is a faithful
  1536: representation of the task these tools were intended to solve; our independent
  1537: repurposing of this problem set, in a way it was not designed for, reduces the
  1538: risk that the benchmark is tailor-made for these tools (or that the tools
  1539: over-fit to these particular problems).
  1540: 
  1541: \section{Conclusion}
  1542: \label{sec:conclusion}
  1543: 
  1544: We propose a general benchmarking methodology for measuring, summarising and
  1545: comparing performance of diverse approaches to the conjecture generation
  1546: problem, whilst avoiding some of the philosophical complications and ambiguities
  1547: of the field. This methodology can be tailored for specific interests, such as
  1548: the choice of problem set and the focus of the resulting analysis.
  1549: 
  1550: We have also presented an example application for the domain of higher-order
  1551: inductive theories (which is immediately applicable to problems in functional
  1552: programming). Using the TIP problem set as a corpus, we have evaluated the
  1553: performance of the QuickSpec and IsaCoSy tools and demonstrated QuickSpec to be
  1554: both significantly faster and to also output more desirable conjectures than
  1555: IsaCoSy; although more \emph{undesirable} output may be generated as well. We
  1556: found that they both fail, due to time and memory constraints, for a large
  1557: proportion of inputs; that this gets worse as input size increases; and that
  1558: IsaCoSy fails more often than QuickSpec. Based on these results we proposed two
  1559: possible directions to improve the QuickSpec algorithm: a post-processing filter
  1560: to remove more ``uninteresting'' conjectures and a pre-processing filter to
  1561: ``focus'' on promising subsets of the input.
  1562: 
  1563: Other promising directions for future work include the evaluation of other MTE
  1564: tools, the use of other corpora more suited to different problem domains, and
  1565: the extension of existing corpora with new definitions and properties (which
  1566: would also be of benefit to the original ATP benchmarks).
  1567: 
  1568: We believe that a standard approach to benchmarking and comparison such as ours
  1569: will ease the burden on researchers wanting to evaluate different potential
  1570: approaches to this task, and provide a common goal to pursue in the short term.
  1571: 
  1572: ``Solving'' this benchmark would not solve the problem of conjecture generation
  1573: in general, so more ambitious goals must be set in the future. For now, we
  1574: believe that our approach provides a compelling challenge and will encourage
  1575: healthy competition to improve the field.
  1576: 
  1577: \bibliographystyle{plain}
  1578: \bibliography{./Bibtex}
  1579: 
  1580: \end{document}

Generated by git2html.