writing: 6b27783d098da17a77bd74c9c5dd93e015dabcd3
1: % Derived from the template file for the LaTeX package SVJour3
2: % for Springer journals. Springer Heidelberg 2010/09/16
3: %
4: % This template includes a few options for different layouts and
5: % content for various journals. Please consult a previous issue of
6: % your journal as needed.
7: %
8: % First comes an example EPS file -- just ignore it and
9: % proceed on the \documentclass line
10: % your LaTeX will extract the file if required
11: \begin{filecontents*}{example.eps}
12: %!PS-Adobe-3.0 EPSF-3.0
13: %%BoundingBox: 19 19 221 221
14: %%CreationDate: Mon Sep 29 1997
15: %%Creator: programmed by hand (JK)
16: %%EndComments
17: gsave
18: newpath
19: 20 20 moveto
20: 20 220 lineto
21: 220 220 lineto
22: 220 20 lineto
23: closepath
24: 2 setlinewidth
25: gsave
26: .4 setgray fill
27: grestore
28: stroke
29: grestore
30: \end{filecontents*}
31:
32: %INCLUDE STYLES HERE
33:
34: %INCLUDE SPEEDUP CSV HERE
35:
36: %INCLUDE BIBTEX HERE
37:
38: \RequirePackage{fix-cm}
39: %
40: %\documentclass{svjour3} % onecolumn (standard format)
41: \documentclass[smallcondensed]{svjour3} % onecolumn (ditto)
42: %\documentclass[smallextended]{svjour3} % onecolumn (second format)
43: %\documentclass[twocolumn]{svjour3} % twocolumn
44: %
45: \smartqed % flush right qed marks, e.g. at end of proof
46: %
47: \usepackage{graphicx}
48: %
49: % \usepackage{mathptmx} % use Times fonts if available on your TeX system
50: %
51: % insert here the call for the packages your document requires
52: \usepackage{csvsimple}
53: \usepackage[inline, shortlabels]{enumitem}
54: \usepackage{epstopdf}
55: \usepackage{hyperref}
56: \usepackage{mathtools}
57: \usepackage{tikz}
58: \usetikzlibrary{arrows, calc, decorations.pathreplacing, positioning,
59: shapes.geometric}
60: % etc.
61:
62: % please place your own definitions here and don't use \def but
63: % \newcommand{}{}
64:
65: \newcommand{\etal}{{\em et al.}}
66:
67: % For example theory of lists
68: \newcommand{\Zero}{\text{Z}}
69: \newcommand{\Succ}{\text{S}}
70:
71: \newcommand{\List}{\text{List}}
72: \newcommand{\ListA}{\text{List} \ a}
73: \newcommand{\Nil}{\text{Nil}}
74: \newcommand{\Cons}{\text{Cons}}
75: \newcommand{\Head}{\text{head}}
76: \newcommand{\Tail}{\text{tail}}
77: \newcommand{\Append}{\text{append}}
78: \newcommand{\Reverse}{\text{reverse}}
79: \newcommand{\Length}{\text{length}}
80: \newcommand{\Map}{\text{map}}
81: \newcommand{\Foldl}{\text{foldl}}
82: \newcommand{\Foldr}{\text{foldr}}
83:
84: % Insert the name of "your journal" with
85: \journalname{Journal of Automated Reasoning}
86:
87: \begin{document}
88:
89: \title{Quantitative Benchmarking for Automatically Generated Conjectures%\thanks
90: % {Grants or other notes about the article that should go on the front page
91: % should be placed here. General acknowledgments should be placed at the end of
92: % the article.}
93: }
94: % \subtitle{Do you have a subtitle?\\ If so, write it here}
95:
96: % \titlerunning{Short form of title} % if too long for running head
97:
98: \author{Chris Warburton \and
99: Alison Pease \and
100: Jianguo Zhang}
101:
102: % \authorrunning{Short form of author list} % if too long for running head
103:
104: \institute{C. Warburton \at
105: University of Dundee \\
106: \email{c.m.warburton@dundee.ac.uk} \\
107: ORCID: 0000-0002-4878-6319 \\
108: Phone: +44 (0)1382 386968
109: % \emph{Present address:} of F. Author % if needed
110: \and
111: A. Pease \at
112: University of Dundee \\
113: \email{a.pease@dundee.ac.uk} \\
114: ORCID: 0000-0003-1856-9599
115: \and
116: J. Zhang \at
117: University of Dundee \\
118: \email{j.n.zhang@dundee.ac.uk} \\
119: ORCID: 0000-0001-9317-0268
120: }
121:
122: \date{Received: date / Accepted: date}
123: % The correct dates will be entered by the editor
124:
125: \maketitle
126:
127: \begin{abstract}
128: We propose a benchmarking methodology to evaluate the efficiency and quality
129: of \emph{conjecture generation} by automated tools for \emph{mathematical
130: theory exploration}. Our approach uses widely available theorem proving
131: tasks as a \emph{ground-truth} corpus, and we demonstrate its use on the
132: QuickSpec and IsaCoSy tools. We found that both may fail, even for small
133: inputs, but QuickSpec usually finishes in significantly less time than IsaCoSy
134: and produces significantly more ``interesting'' output. By defining a standard
135: benchmark we provide a measure of progress for the field, encourage
136: innovation through healthy competition and allow direct comparisons to be made
137: between the disparate approaches currently being pursued for this task.
138: \keywords{Theory Formation \and Theorem Proving \and Lemma Discovery \and Conjecture Generation \and Benchmarking}
139: % \PACS{PACS code1 \and PACS code2 \and more}
140: % \subclass{MSC code1 \and MSC code2 \and more}
141: \end{abstract}
142:
143: \begin{acknowledgements}
144: We are grateful to the authors of the systems we have used in our experiments,
145: especially for help in obtaining and configuring their tools. We would
146: especially like to thank Koen Claessen, Lucas Dixon, Moa Johansson, Dan
147: Ros\'{e}n and Nicholas Smallbone for useful discussions and help in adapting
148: their software to our purposes.
149:
150: This work was supported by the Engineering and Physical Sciences Research
151: Council grant EP/P017320/1 and studentship 1526504.
152: \end{acknowledgements}
153:
154: \pagebreak
155:
156: \section{Introduction}
157: \label{intro}
158:
159: \emph{Conjecture generation} is the open-ended task of conjecturing properties
160: of a given logical theory which are somehow ``interesting'', and is
161: studied as a sub-field of \emph{mathematical theory exploration} (MTE). This has
162: applications wherever we find uses for formal methods: in proof assistants and
163: their libraries, in mathematics education and research, and in the specification
164: and verification of software. Although formal methods, and automated reasoning
165: more generally, have been advocated in mathematics (notably by
166: Voevodsky~\cite{voevodsky2010univalent} and Gowers~\cite{ganesalingam2013fully})
167: and software engineering (for example via dependently-typed
168: programming~\cite{McKinna:2006}), they have a reputation for being difficult and
169: expensive, which has historically limited their use to high-assurance domains
170: such as aerospace and microprocessor design.
171:
172: One path to reducing these barriers is to increase the level of automation.
173: Software verification tools have been shown to benefit from the addition of a
174: conjecture generation step for finding ``background lemmas'', which are useful
175: when proving user-provided statements~\cite{Claessen.Johansson.Rosen.ea:2013}.
176: Another path is to tailor the division of labour between humans and machines
177: such that each is playing to their strengths, which has been referred to as
178: ``centaur teams''~\cite{harari2017reboot,davenport2015beyond}. Machines can
179: systematically search a much larger space of possible relationships than humans,
180: potentially bringing to light novel and surprising connections, whilst allowing
181: the user to make the ultimate judgement over which of the most promising results
182: are worth investigating further.
183:
184: Similar tasks are also studied in the domain of Computational \emph{Scientific}
185: Discovery~\cite{king2004functional,Williams20141289,schmidt2009distilling}, for
186: example in the search for new drugs. The major difference between scientific and
187: mathematical discovery is that inductive reasoning and experimental testing are,
188: in principle, more important for the former, although we note that they are also
189: core principles in the mathematical tools we have investigated (in particular
190: for ensuring the \emph{plausibility} of conjectures).
191:
192: We limit ourselves to mathematical applications, but even here it is difficult
193: to measure and compare the success of approaches and tools. This is partially
194: due to their diversity, but also because of the inherent ambiguity of the task
195: (what counts as ``interesting''?), the different goals emphasised by their
196: designers and the variety of evaluation methods employed. We attempt to solve
197: this discrepancy, at least for the foreseeable future, by defining a standard,
198: unambiguous benchmarking approach with which to compare the conjecture
199: generation of MTE tools. Our contributions include:
200:
201: \begin{itemize}
202: \item A general methodology for benchmarking conjecture generation.
203: \item Resolving the issue of ``interestingness'' through the use of
204: theorem-proving benchmarks as a ground-truth.
205: \item A specific instantiation of this methodology, using the Tons of Inductive
206: Problems (TIP) benchmark as a corpus.
207: \item Automated tooling to perform this benchmarking.
208: \item Application of our methodology to the QuickSpec and IsaCoSy MTE tools,
209: and a comparison and discussion of the results.
210: \end{itemize}
211:
212: We describe the conjecture generation problem in more detail, along with the
213: difficulty of comparing existing solutions, in $\S$\ref{sec:background}. We
214: explain our proposal for a more general benchmark in $\S$\ref{sec:proposal} and
215: demonstrate its application to existing MTE tools in $\S$\ref{sec:application}.
216: Issues facing our approach, and the field in general, are discussed in
217: $\S$\ref{sec:discussion}; related work is surveyed in
218: $\S$\ref{sec:related-work}, including the diverse terminology found in the
219: literature; and concluding remarks are given in $\S$\ref{sec:conclusion}.
220:
221: \begin{sloppypar}
222: Our benchmark implementation is available at
223: \url{http://chriswarbo.net/projects/theory-exploration} and
224: \url{https://github.com/warbo}. All processes, including benchmark generation,
225: experimental runs and generation of this paper, are specified using the
226: Nix~\cite{dolstra2004nix} system to aid reproducibility.
227: \end{sloppypar}
228:
229: \section{Background}
230: \label{sec:background}
231:
232: \subsection{Motivation}
233: \label{sec:motivation}
234:
235: \begin{figure}
236: \begin{equation*}
237: \begin{split}
238: \forall a. \Nil &: \ListA \\
239: \forall a. \Cons &: a \rightarrow \ListA \rightarrow \ListA \\
240: \Head(\Cons(x, xs)) &= x \\
241: \Tail(\Cons(x, xs)) &= xs \\
242: \Append(\Nil, ys) &= ys \\
243: \Append(\Cons(x, xs), ys) &= \Cons(x, \Append(xs, ys)) \\
244: \Reverse(\Nil) &= \Nil \\
245: \Reverse(\Cons(x, xs)) &= \Append(\Reverse(xs), \Cons(x, \Nil)) \\
246: \Length(\Nil) &= \Zero \\
247: \Length(\Cons(x, xs)) &= \Succ (\Length(xs)) \\
248: \Map(f, \Nil) &= \Nil \\
249: \Map(f, \Cons(x, xs)) &= \Cons(f(x), \Map(f, xs)) \\
250: \Foldl(f, x, \Nil) &= x \\
251: \Foldl(f, x, \Cons(y, ys)) &= \Foldl(f, f(x, y), ys) \\
252: \Foldr(f, \Nil, y) &= y \\
253: \Foldr(f, \Cons(x, xs), y) &= f(x, \Foldr(f, xs, y))
254: \end{split}
255: \end{equation*}
256: \caption{A simple theory defining a $\List$ type and some associated
257: operations, taken from~\cite{Johansson.Dixon.Bundy:conjecture-generation}.
258: $\Zero$ and $\Succ$ are from a Peano encoding of the natural numbers}
259: \label{figure:list_theory}
260: \end{figure}
261:
262: Given a logical theory, such as the theory of lists shown in
263: Figure~\ref{figure:list_theory}, we may want to find properties (e.g.
264: invariants) describing its behaviour. This could be for mathematical curiosity,
265: or due to the theory's importance in some domain. In particular, when dealing
266: with definitions in a software library, we may regard a set of such properties
267: as the code's \emph{specification}. Even those properties which only hold by
268: coincidence, rather than by design, may still be useful for \emph{optimising}
269: uses of this library, by rewriting expressions into equivalent forms which
270: require less time, memory, network usage, etc. For example, our theory of lists
271: turns out to satisfy the following invariant:
272:
273: \begin{equation} \label{eq:mapreduce}
274: \forall f. \forall xs. \forall ys.
275: \Map(f, \Append(xs, ys)) = \Append(\Map(f, xs), \Map(f, ys))
276: \end{equation}
277:
278: Whilst these two expressions produce equivalent results, the form on the right
279: can execute both calls to $\Map$ in parallel (in a pure functional language, at
280: least); a fact which is exploited by the ``map/reduce'' parallel programming
281: paradigm.
282:
283: Such use cases require the ability to \emph{discover} true properties of some
284: particular definitions; this requires not only \emph{proving} their correctness,
285: but also \emph{posing} them as conjectures in the first place. This is a hard
286: problem in general, but presents opportunities for automation due to the
287: precise, symbolic nature of the domain. These processes can also feed into each
288: other, since proven properties can be used as lemmas for subsequent proofs. As
289: an example, proving Equation~\ref{eq:mapreduce} may require ``background
290: knowledge'', such as the property $\forall x. \Append(x, \Nil) = x$, which may
291: be tedious to produce by hand but possible to discover automatically.
292:
293: \subsection{Automating Construction and Exploration of Theories}
294: \label{sec:te}
295:
296: The task of automatically proving or disproving a \emph{given} conjecture has
297: been studied extensively in the field of Automated Theorem Proving (ATP). Less
298: attention has been paid to \emph{generating} such conjectures automatically,
299: either from a given set of definitions (the task we will focus on) or where the
300: definitions are also generated. This research has been pursued under the names
301: of Automated Theory Formation (ATF) and Mathematical Theory Exploration (MTE);
302: we will use the MTE terminology, and discuss their relationship further in
303: $\S$\ref{sec:related-work}.
304:
305: A major challenge when generating conjectures is choosing how to narrow down the
306: resulting set to only those deemed ``interesting'', since this is an imprecise
307: term with many different interpretations. For example, all existing approaches
308: agree that simple tautologies are ``uninteresting'', but differ when it comes to
309: more complex statements. Colton \etal{} surveyed tools for theory formation
310: and exploration, and their associated notions of ``interestingness'' for
311: concepts and conjectures~\cite{colton2000notion}. Six qualities were identified,
312: which are applied to conjectured properties as follows:
313:
314: {\bf Empirical plausibility} checks whether a property holds across some
315: specific examples. This is especially useful for avoiding falsehoods, without
316: resorting to a deductive proof search.
317:
318: {\bf Novelty} depends on whether the property, or one isomorphic or more
319: general, has already been seen.
320:
321: {\bf Surprisingness} of a property is whether or not it is ``obvious'', for
322: example if it is an instance of a tautology.
323:
324: {\bf Applicability} depends on the number of models in which a
325: property holds. High applicability makes a property more interesting, but so
326: does zero applicability (i.e. non-existence).
327:
328: {\bf Comprehensibility} depends on the syntactic complexity of the property's
329: statement. Simpler statements are considered more interesting (which favours
330: tools adopting a small-to-large search order).
331:
332: {\bf Utility} is the relevance or usefulness of a property to the user's
333: particular task. For example, if we want to find properties which are useful for
334: optimisation, like Equation~\ref{eq:mapreduce}, then utility would include
335: whether the property justifies some rewrite rule, the difference in resource
336: usage of the expressions involved, and how common those expressions are in real
337: usage.
338:
339: Attempting to \emph{measure} such qualities is difficult, and having too many
340: measures complicates comparisons. System developers have employed a more
341: practical alternative in their evaluations, which is to perform
342: \emph{precision/recall} analysis against a \emph{ground-truth}. This requires
343: choosing a set of definitions (such as those in Figure~\ref{figure:list_theory})
344: and a set of properties (the ground truth) which represent the ``ideal'' result
345: of conjecture generation for those definitions (e.g. this might include
346: Equation~\ref{eq:mapreduce}). To analyse the quality of a tool's conjectures, we
347: run it on these chosen definitions and compare its output to the ground truth:
348:
349: \begin{itemize}
350: \item \emph{Precision} is the proportion of a tool's output which appears in
351: the ground truth (the ratio of true positives to all positives). This
352: penalises overly-liberal tools which output a large number of properties in
353: the hope that some turn out to be ``good''.
354: \item \emph{Recall} is the proportion of the ground truth which appears in the
355: tool's output (the ratio of true positives to actual positives). This
356: penalises overly-conservative tools which generate very few properties as a
357: way to avoid ``bad'' ones.
358: \end{itemize}
359:
360: To score 100\% on precision and recall, all of the properties which appear in
361: the ground truth must have been conjectured, and nothing else. This gives us a
362: simple method to evaluate and compare tools without requiring a general solution
363: to the question of what is ``interesting''; although we must still decide what
364: to put in the ground truth set, and what to leave out, for each measured set of
365: definitions.
366:
367: \begin{sloppypar}
368: The precision and recall of three MTE tools are compared
369: in~\cite{claessen2013automating}:
370: IsaCoSy~\cite{Johansson.Dixon.Bundy:conjecture-generation},
371: IsaScheme~\cite{Montano-Rivas.McCasland.Dixon.ea:2012} and
372: HipSpec~\cite{Claessen_hipspec:automating}.\footnote{Unfortunately this
373: comparison only reports prior results for each tool, rather than reproducing
374: them. This makes the numbers less comparable, since each tool was tested
375: with slightly different definitions (e.g. whether or not the exponential and
376: maximum functions were included).} The definitions and ground truths for all
377: three were taken from the standard library of the Isabelle proof assistant.
378: The fact that the library authors expended the effort of stating, proving and
379: bundling these properties with every copy of their software is a good
380: indication that they are interesting. Two sets of definitions were used: a
381: theory of natural numbers; and a theory of lists (which we show in
382: Figure~\ref{figure:list_theory} on page~\pageref{figure:list_theory}). With
383: only two datapoints, each containing only a few definitions and properties (4
384: functions and 12 properties, in the natural number example), it is difficult
385: to draw general conclusions; yet despite these limitations we believe that
386: this is a viable method to performing meaningful, objective analyses and
387: comparisons for such tools. It is only with larger data sets that we can
388: generalise the measured performance to different, especially \emph{novel},
389: domains (which, after all, is the purpose of MTE tools). In addition, we might
390: be concerned that such ``popular'' theories as the natural numbers may be
391: unrepresentative of typical usage. For example, we would not expect the
392: library functions in a typical verification problem to be as mathematically
393: rich as number theory.
394: \end{sloppypar}
395:
396: \subsection{Existing Tools}
397: \label{sec:existing-tools}
398:
399: Since our benchmarking methodology builds on the precision/recall analysis
400: described in $\S$\ref{sec:te}, we decided to use two of the same tools, IsaCoSy
401: and QuickSpec (the conjecture generating component of HipSpec), as our
402: demonstration.
403:
404: IsaCoSy (Isabelle Conjecture Synthesis) is written for the Isabelle proof
405: assistant, mostly in Standard
406: ML~\cite{Johansson.Dixon.Bundy:conjecture-generation}. It conjectures equations
407: by enumerating expressions involving a given set of (typed) constants and free
408: variables (a \emph{signature}). A constraint solver forbids certain
409: (sub)expressions from being synthesised, and these constraints are extended
410: whenever a new property is conjectured, to avoid generating any special-cases of
411: this property in the future. Conjectures are tested by looking for
412: counterexamples and, if none are found, sent to IsaPlanner which attempts to
413: prove them.
414:
415: QuickSpec emerged from work on the QuickCheck software testing framework for the
416: Haskell programming language~\cite{claessen2011quickcheck}. As well as
417: conjecturing properties of Haskell definitions for HipSpec, it has been applied
418: to Isabelle definitions via the Hipster tool~\cite{Hipster}. Like IsaCoSy,
419: QuickSpec takes a signature and enumerates well-typed terms. It collects
420: together those of the same type into equivalence classes, assuming them to be
421: equal, then uses QuickCheck to find counterexamples to this assumption by
422: randomly instantiating the variables and comparing the resulting closed
423: expressions. Any equivalence class whose elements don't compare equal are split
424: up, and the process is repeated with new random values.
425:
426: After 500 rounds of testing, any expressions still sharing the same equivalence
427: class are conjectured to be equal for all values of their variables. Rather than
428: using a constraint system to prevent redundancies (such as special-cases of
429: other properties), QuickSpec instead passes its output through a congruence
430: closure algorithm to achieve the same effect.
431:
432: \section{Theory Exploration Benchmark}
433: \label{sec:proposal}
434:
435: Our main contribution is a benchmarking methodology, shown in
436: Figure~\ref{figure:flow_chart} on page~\pageref{figure:flow_chart}, which both
437: generates a large definition/ground-truth corpus, and provides a scalable,
438: statistical approach to evaluating MTE tools using this corpus. We follow a
439: precision/recall approach similar to prior work, with the main difference being
440: the source of definitions and ground truths: we take existing problem sets
441: designed for automated theorem proving, and adapt their content for use in the
442: conjecture generation setting.
443:
444: \subsection{Preparation}
445: \label{section:prep}
446:
447: Automated theorem proving is an active area of research, with large problem sets
448: and regular competitions to prove as much as possible, as fast as
449: possible~\cite{pelletier2002development}. These problem sets are an opportunity
450: for MTE, as their definitions and theorems can be used as a corpus in the same
451: way that Isabelle libraries have been used in the past.
452:
453: Some problem sets are more amenable for this purpose than others. The most
454: suitable are those meeting the following criteria:
455:
456: \begin{itemize}
457: \item For each problem, there should be a clear distinction between the
458: statement to be proved and the definitions involved, such that the two can be
459: easily and meaningfully separated. This rules out problem sets like those of
460: SMT-COMP~\cite{barrett2005smt}, where many problems involve uninterpreted
461: functions, whose behaviour is \emph{implicit} in the logical structure of the
462: theorem statement but not separately \emph{defined}.
463: \item Definitions should be translatable into a form suitable for the MTE tools
464: under study. Our application in $\S$\ref{sec:application} requires Haskell and
465: Isabelle translations, and also benefits from having definitions be strongly
466: typed.
467: \item The problem set should be relevant to the desired domain. We focus on pure
468: functional programming, which requires higher-order functions and inductively
469: defined datatypes, which rules out first-order languages/logics (such as
470: TPTP~\cite{sutcliffe2009tptp}).
471: \item The problem set should ideally contain \emph{every} ``interesting''
472: property involving its included definitions, since non-membership in the
473: ground truth will be treated as being ``uninteresting''. More realistically,
474: we should aim for each definition to have \emph{multiple} properties; rather
475: than a mixture of unrelated problems, cherry-picked from different domains.
476: \item Larger problem sets are preferable as they give more robust statistics, all
477: else being equal (i.e. when this does not sacrifice quality).
478: \end{itemize}
479:
480: Once such a problem set has been chosen, we must separate the definitions from
481: the theorem statements which reference those definitions. The definitions will
482: be used as input to the MTE tools, whilst the theorem statements form the ground
483: truth corpus of properties (which tool output will be compared against).
484:
485: It is important to ensure that there are no duplicate definitions: we are only
486: concerned with the \emph{logical} content of the input, not the more arbitrary
487: aspects of their presentation like the names of functions. For example, consider
488: a problem set which includes a statement of commutativity for a \texttt{plus}
489: function, and of associativity for an \texttt{add} function, where the
490: definitions of \texttt{plus} and \texttt{add} are $\alpha$-equivalent. We would
491: expect an MTE tool to either conjecture commutativity and associativity for
492: \emph{both} functions, or for \emph{neither} function, since they are logically
493: equivalent. Yet a na\"ive precision/recall analysis would treat commutativity of
494: \texttt{add} and associativity of \texttt{plus} as \emph{uninteresting}, since
495: they don't appear in the ground truth.
496:
497: For this reason, duplicates should be removed, and any references to them
498: updated to use the remaining definition (e.g. chosen based on lexicographic
499: order). In the above example, the \texttt{plus} function would be removed, and
500: the commutativity statement updated to reference \texttt{add} instead.
501:
502: \subsection{Sampling}
503: \label{section:sampling}
504:
505: We could, in theory, send these de-duplicated definitions straight into an MTE
506: tool and use the entire set of properties (taken from the theorem statements) as
507: the ground truth for analysis. However, this would cause two problems:
508:
509: \begin{itemize}
510: \item The result would be a single data point, which makes it difficult to
511: infer performance \emph{in general}.
512: \item It is impractical to run existing MTE tools on inputs containing more
513: than a few dozen definitions.
514: \end{itemize}
515:
516: To solve both of these problems we instead \emph{sample} a subset of
517: definitions.\footnote{This could be done randomly, but for reproducibility we
518: use a deterministic order based on cryptographic hashes.} The sample size is a
519: free parameter so that users have some control over how long the runs will take
520: ($\S$\ref{sec:application} shows that larger samples take longer, but the
521: correlation is very noisy). Given a sample size, we need to choose a subset of
522: that many definitions from the corpus, and provide only those as input to the
523: tool.
524:
525: We cannot expect an MTE tool to produce all of the properties found in the
526: corpus, if it is only given a subset of definitions as input. Hence for each
527: sample we generate a corresponding ground truth by selecting those properties
528: from the corpus which ``depend on'' (contain references to) \emph{only} the
529: definitions in that sample. Transitive dependencies aren't required (e.g. a
530: property involving only a \texttt{times} function would not depend on a
531: \texttt{plus} function, even if \texttt{plus} occurs in the definition of
532: \texttt{times}).
533:
534: The question remains of how to pick definitions for each sample. One simple way
535: is to choose uniformly without replacement, but this will give many samples
536: whose corresponding ground truths are empty. The effect is similar to a lottery,
537: where our sample is a ticket and must contain all of some property's
538: dependencies in order to ``win'' (i.e. contain that property in its ground
539: truth). In particular, if we choose a sample of size $S$ uniformly at random
540: from a corpus containing $N$ definitions and $P$ properties which depend on an
541: average of $D$ definitions each (assuming no properties have the exact same
542: dependencies), then the probability that our sample contains the dependencies of
543: some particular property $p$ follows a hypergeometric distribution (similar to
544: the binomial distribution, but without replacement):
545:
546: \begin{equation*}
547: Pr(p) = \frac{\binom{S}{D} \binom{N - S}{S - D}}
548: {\binom{N}{S}}
549: \end{equation*}
550:
551: Choosing at least one property's dependencies is a Bernoulli process with $P$
552: repetitions, with probability $1 - (1 - Pr(p))^P$. This gets larger if we
553: increase the sample size $S$ or the number of properties $P$, but it gets
554: smaller if we increase the number of definitions $N$ or the average number of
555: dependencies $D$. This is problematic because we would like large corpora for
556: better statistics, but this would make empty ground truths more likely.
557:
558: Samples whose corresponding ground truth is empty are problematic because they
559: have $0$ precision and undefined recall (due to division by $0$). Such samples
560: are hence unsuitable for our benchmark, especially since these results would be
561: \emph{independent} of the tool's output (all tools would fare the same, so no
562: comparisons could be made). Other measurements could be used for such cases, but
563: since our choice of analysis comes from existing work we would rather avoid such
564: samples by only allowing those with a non-empty ground truth (i.e. containing
565: all dependencies of at least one corpus property).
566:
567: We can achieve this by instead performing uniform sampling \emph{with
568: rejection}, where samples with empty ground truth are rejected and a different
569: sample is drawn instead.
570:
571: One problem with performing rejection sampling is that it becomes very
572: inefficient when the probability of rejection is high. In our case there is a
573: more direct method of sampling, whose output is exactly equivalent to rejection
574: sampling but is far more efficient.
575:
576: Since non-rejected samples must contain the dependencies of some property, we
577: can consider each such sample to be the union of those dependencies and some
578: other ``padding'' definitions. By choosing these two components separately and
579: returning their union, we can obtain an identical distribution without having to
580: perform any rejection.
581:
582: We first choose the set of dependencies by picking from those of all corpus
583: properties, weighted according to the probability that they would appear under
584: uniform sampling. To do this we first discard any dependency sets which are
585: larger than the given sample size (since they cannot appear as subsets); we then
586: discard any dependency set which is a superset of another (since only the subset
587: is required to avoid rejection). Their probabilities under uniform sampling
588: depend only on their size, so we assign weights by calculating the least
589: common multiple (LCM) of their sizes, and dividing that by each set's size. For
590: example, given dependency sets $X = \{a, b, c\}$, $Y = \{b, d\}$ and
591: $Z = \{a, b, e, f\}$ the LCM of their lengths ($3$, $2$ and $4$) is 12, so the
592: weighting will be:
593:
594: \begin{equation*}
595: w(X) = \frac{12}{3} = 4
596: w(Y) = \frac{12}{2} = 6
597: w(Z) = \frac{12}{4} = 3
598: \end{equation*}
599:
600: Once we've chosen a dependency set, we know how much padding to choose (sample
601: size minus dependency set size). The padding will be uniformly distributed,
602: since it does not affect the rejection criterion.
603:
604: Using rejection sampling or, equivalently, our direct method, ensures that the
605: analysis is well defined and depends meaningfully on the tool's output. One
606: downside of applying this non-empty ground truth restriction is that we limit
607: the number of possible samples to measure; this is significant for sample sizes
608: less than~3.
609:
610: \subsection{Evaluation}
611: \label{section:evaluation}
612:
613: Given a sample of definitions and a corresponding ground truth, the actual
614: execution of the MTE tool proceeds as in prior work. We must translate the
615: chosen definitions into the required input format, then we time the execution
616: with a timeout (e.g. 5 minutes). We use wall-clock time since
617: \begin {enumerate*} [i) ]%
618: \item this is most relevant to a user's experience,
619: \item it has a straightforward interpretation and
620: \item measuring it does not require intrusive alterations to a tool's
621: implementation.
622: \end {enumerate*}
623: The downside is that comparisons must take hardware performance into account,
624: which would not be the case if time were measured by some proxy like number of
625: expressions evaluated.
626:
627: In our experiments we have found that memory usage is also an important part of
628: a tool's performance, but rather than complicating our analysis with an extra
629: dimension, we instead allow programs to use as much memory as they like, and
630: either get killed by the operating system or slow down so much from swapping
631: that they time out. This is in line with the expected usage of these tools:
632: either there is enough memory, or there isn't; implementations should not be
633: penalised for making use of available resources.
634:
635: To calculate precision and recall, the conjectures generated by each tool, as
636: well as the ground-truth properties, need to be parsed into a common format and
637: compared syntactically after normalising away ``irrelevant'' details. We
638: consider variable naming to be irrelevant, which can be normalised by numbering
639: free variables from left to right and using de Bruijn indices for bound
640: variables. We also consider the left/right order of equations to be irrelevant,
641: which we normalise by choosing whichever order results in the
642: lexicographically-smallest expression.
643:
644: We specifically \emph{ignore} other logical relationships between syntactically
645: distinct statements, such as one equation being implied by another. Whilst
646: logically sound, second-guessing the ground truth in this way would harm other
647: aspects which influence interestingness (for example, a more general statement
648: is more widely applicable, but it might also be harder to comprehend).
649:
650: This procedure gives us, for each sample, a set of generated conjectures and a
651: set of ground truth properties (from which precision and recall can be
652: calculated) and a single runtime. We propose two methods for analysing this
653: data: \emph{summarising} the performance of a single MTE tool and
654: \emph{comparing} the performance of two tools.
655:
656: \subsection{Summarising}
657:
658: Each sample is only explored once by each tool, so that we cover as many
659: independent samples as possible to better estimate how a tool's performance
660: generalises to unseen inputs. How we combine these data into an aggregate
661: summary depends on what we are interested in measuring. One general question we
662: might ask is how a tool's performance scales with respect to the input size (the
663: number of given definitions). This is straightforward to measure by varying the
664: sample size, but we need some way to combine the results from samples of the
665: same size.
666:
667: We can summarise a set of runtimes by choosing the median, as this is more
668: robust than the mean against long-running outliers, and hence represents
669: performance for a ``typical'' input of that size. Since our methodology measures
670: performance over many samples, we can also compute the \emph{spread} of our
671: data, for example the inter-quartile range. This is not possible using the
672: one-off measurements common in prior evaluation methods.
673:
674: Precision and recall are ratios, which can be averaged (separately) in two ways,
675: known as the \emph{Average of Ratios} (AoR) and the \emph{Ratio of Averages}
676: (RoA)~\cite{egghe2012averages}. The Average of Ratios is the mean value. Using
677: the definitions of precision and recall given in $\S$\ref{sec:te}, we can define
678: their means for $S$ samples indexed by $1 \leq i \leq S$ as follows:
679:
680: \begin{align*}
681: \overline{\text{precision}}_{AoR}
682: &= \frac{1}{S} \sum_i{\text{precision}_i} \\
683: &= \frac{1}{S} \sum_i{
684: \frac{\#\text{interesting}_i}
685: {\#\text{generated}_i}} \\[10pt]
686: \overline{\text{recall}}_{AoR}
687: &= \frac{1}{S} \sum_i{\text{recall}_i} \\
688: &= \frac{1}{S} \sum_i{
689: \frac{\#\text{interesting}_i}
690: {\#\text{groundtruth}_i}}
691: \end{align*}
692:
693: We interpret these mean values as the \emph{expected} precision and recall for
694: samples of this size. In particular, these tell us the quality of the \emph{set}
695: of conjectures that will be generated if we were to run the tool on such a
696: sample, but they don't give us direct information about the individual
697: conjectures themselves. Since these are mean values, they are the reference
698: point for measures of spread such as the variance, standard deviation and mean
699: absolute deviation.
700:
701: The Ratio of Averages is \emph{pooled}, combining the (multi)sets from each
702: sample before taking the ratios. The size of these pooled sets is simply the sum
703: of the individual set sizes, and ratios of these sums are equal to ratios of the
704: means (since the normalising factor $\frac{1}{S}$ appears in both numerator and
705: denominator):
706:
707: \begin{align*}
708: \overline{\text{precision}}_{RoA}
709: &= \frac{\sum_i{\#\text{interesting}_i}}
710: {\sum_i{\#\text{generated}_i}} \\
711: &= \frac{\frac{1}{S} \sum_i{\#\text{interesting}_i}}
712: {\frac{1}{S} \sum_i{\#\text{generated}_i}} \\
713: &= \frac{\hspace{5pt} \overline{\#\text{interesting}} \hspace{5pt}}
714: {\hspace{5pt} \overline{\#\text{generated}} \hspace{5pt}} \\[10pt]
715: \overline{\text{recall}}_{RoA}
716: &= \frac{\sum_i{\#\text{interesting}_i}}
717: {\sum_i{\#\text{groundtruth}_i}} \\
718: &= \frac{\frac{1}{S} \sum_i{\#\text{interesting}_i}}
719: {\frac{1}{S} \sum_i{\#\text{groundtruth}_i}} \\
720: &= \frac{\hspace{5pt} \overline{\#\text{interesting}} \hspace{5pt}}
721: {\hspace{5pt} \overline{\#\text{groundtruth}} \hspace{5pt}}
722: \end{align*}
723:
724: Larger sets contribute more to a Ratio of Averages, unlike the mean which treats
725: each set equally. This weighting gives us expected qualities of
726: \emph{conjectures} rather than sets: the RoA for precision is the expected
727: chance that some particular generated conjecture will appear in the ground
728: truth; the RoA for recall is the expected chance that some particular ground
729: truth property will appear in the generated output.
730:
731: These averages can differ when there is a large variation in set size. For
732: example, if we generate a single, interesting conjecture for one sample and 99
733: uninteresting conjectures for another sample, the AoR (mean) precision will be
734: $\frac{1}{2}$ but the RoA precision will be $\frac{1}{100}$. This highlights the
735: importance of knowing \emph{absolute} numbers of conjectures, such as the mean
736: output size $\overline{\#\text{generated}}$ and its spread, alongside the
737: proportions given by precision and recall.
738:
739: It is also possible to combine precision values \emph{with} recall values
740: (either individually or aggregated) by calculating their \emph{F-score}, defined
741: as:
742:
743: \begin{equation*}
744: F = 2 \times \frac{\text{precision} \times \text{recall}}
745: {\text{precision} + \text{recall}}
746: \end{equation*}
747:
748: This is the harmonic mean of precision and recall, ranging between 0 and 1. The
749: F-score summarises performance into a single number, which is useful for
750: purposes like ranking, but it obscures insights we might gain from looking at
751: precision and recall separately (e.g. whether a tool is generating too much
752: output or too little).
753:
754: \subsection{Comparison}
755:
756: To measure progress and encourage competition in the field, it is important to
757: compare the relative performance of different tools on the same task. Since the
758: aggregate statistics in the above summaries have obscured the details of
759: specific runs, any comparison based on them (such as comparing mean recall)
760: would have very low statistical power. More direct comparisons can be made using
761: \emph{paired} tests, since for each individual sample we have measurements for
762: \emph{both} tools.
763:
764: Running times do not follow a normal distribution, in particular due to the
765: lower bound at 0, which complicates any application of standard comparisons like
766: the paired z-test. An alternative which does not assume normality is the
767: Wilcoxon signed-rank test~\cite{wilcoxon1945individual}, which has been applied
768: to software benchmarking in the Speedup-Test protocol of Touati, Worms and
769: Briais~\cite{touati2013speedup}. Our methodology differs by measuring with a
770: different sample each time, rather than repeatedly measuring one sample of each
771: size; measuring both tools on the \emph{same} samples allows the paired form of
772: the Wilcoxon test to be used.
773:
774: Another complication with our time measurements is the censoring effect caused
775: by timeouts. Paired difference tests are robust to this, since the upper-bound
776: on total time imposes an upper-bound to the observable difference; this biases
777: our conclusions in a conservative way, \emph{towards} the null hypothesis of
778: indistinguishable performance.
779:
780: Quality can be compared using precision and recall, but we must choose how to
781: handle samples where one tool succeeded and the other failed (e.g. by timing
782: out). Failed runs can be treated as if they had succeeded with an empty set of
783: conjectures; this penalises failure-prone tools and simplifies analysis by
784: ensuring every sample produces a data point. An alternative is to discard
785: samples which caused either tool to fail, and compare only those where both
786: tools finished successfully. We follow this second approach, since it is more
787: charitable to the tools being evaluated (which, after all, were not designed
788: with our benchmark in mind). With fewer data points to analyse, we pool together
789: samples of all sizes (allowing duplicates) and compare proportions from these
790: aggregations.
791:
792: Recall relies on a fixed set of properties, the ground truth, so we can compare
793: tools using McNemar's test for paired samples~\cite{mcnemar1947note}. For each
794: tool, we count how many of the ground-truth properties it found which were
795: \emph{not} found by the other tool; i.e. those occasions where one tool was
796: objectively better than the other. McNemar's test determines whether we can
797: reject the null hypothesis that both counts are the same.
798:
799: Comparing precision cannot be done in the same way as recall, since we do not
800: have a fixed set of properties to compare both tools on (there are an unlimited
801: number of properties that may or may not appear in each tool's output). Instead,
802: we can count how many of the generated conjectures were interesting and
803: uninteresting for each tool, and use Boschloo's form of Barnard's
804: test~\cite{lydersen2009recommended} to determine if these proportions are
805: independent. This assumes that precision follows a binomial distribution, and
806: the test is conditioned on the number of generated conjectures (as if this were
807: fixed by design, rather than determined by the tool).
808:
809: \begin{figure}
810: \centering
811: \tikzstyle{startstop} = [rectangle, rounded corners, minimum width=3cm,
812: minimum height=1cm,text centered, draw=black]
813:
814: \tikzstyle{io} = [trapezium, trapezium left angle=70,
815: trapezium right angle=110, minimum width=1cm,
816: minimum height=1cm, text centered, draw=black]
817:
818: \tikzstyle{process} = [rectangle, minimum width=3cm, minimum height=1cm,
819: text centered, draw=black]
820: \tikzstyle{decision} = [diamond, minimum width=3cm, minimum height=1cm,
821: text centered, draw=black]
822:
823: \tikzstyle{arrow} = [thick,->,>=stealth]
824:
825: % Avoids too much padding in io shapes
826: \tikzset{trapezium stretches=true}
827:
828: \begin{tikzpicture}[node distance=1cm]
829:
830: % Preparation section
831:
832: \node (in) [io]{Theorem Proving Benchmark};
833: \node (sep) [process, below=1cm of in ]{Separate definitions from property statements};
834: \node (def) [process, below left=1.5cm and -1cm of sep]{Remove duplicate definitions};
835: \node (ref) [process, below right=1.5cm and -1cm of sep]{Update references};
836:
837: \node (thy) [startstop, below=1cm of def]{Distinct Definitions};
838: \node (thm) [startstop, below=1cm of ref]{Corpus of Properties};
839:
840: \draw [arrow] (in) -- (sep);
841: \draw [arrow] (def) -- (ref);
842: \draw [arrow] (def) -- (thy);
843: \draw [arrow] (ref) -- (thm);
844:
845: % Arrows with labels
846: \path[->]
847: (sep) edge [arrow, sloped, above] node {definitions} (def)
848: (sep) edge [arrow, sloped, above] node {properties} (ref);
849:
850: % Sampling section
851:
852: \node (choose) [process, below=of thm ]{Choose a property};
853: \node (deps) [process, below=of choose]{List property's dependencies};
854:
855: % Create dummy coordinate below deps, then use its y coordinate for pad
856: \coordinate [below=of deps] (padDummy);
857: \path let \p{dummy} = (padDummy),
858: \p{in} = (in)
859: in coordinate (padPos) at (\x{in}, \y{dummy});
860: \node (pad) [process, at=(padPos)]{Pad list to form sample};
861:
862: \node (sthm) [startstop, below right=1.5cm and -1cm of pad]{Ground Truth};
863: \node (sthy) [startstop, below left=1.5cm and -1cm of pad]{Sampled Theory};
864:
865:
866: % Calculate position of (size) using let, then define as normal
867: \path let \p{pad} = (pad),
868: \p{choose} = (choose)
869: in coordinate (sizePos) at (\x{pad},\y{choose});
870: \node (size) [io, at=(sizePos)] {Sample size};
871:
872: % Evaluation section
873: \path let \p{pad} = (pad),
874: \p{sthm} = (sthm)
875: in coordinate (runPos) at (\x{pad}, \y{sthm});
876: \node (run) [process, below=of runPos]{Run MTE tool};
877: \node (pr) [process, below=of run]{Analysis};
878:
879: \node (prec) [startstop, below=of pr ]{Precision};
880: \node (time) [startstop, left=of prec]{Time taken};
881: \node (rec) [startstop, right=of prec]{Recall};
882:
883: \draw [arrow] (thy) |- (pad);
884: \draw [arrow] (thm) -- (choose);
885: \draw [arrow] (choose) -- (deps);
886: \draw [arrow] (deps) |- (pad);
887: \draw [arrow] (size) -- (choose);
888: \draw [arrow] (size) -- (pad);
889: \draw [arrow] (pad) -- (sthm);
890: \draw [arrow] (pad) -- (sthy);
891: \draw [arrow] (sthy) |- (run);
892: \draw [arrow] (run) -- (pr);
893: \draw [arrow] (sthm) |- (pr);
894: \draw [arrow] (pr) -- (prec);
895: \draw [arrow] (pr) -- (rec);
896: \draw [arrow] (pr) -- (time);
897:
898: % Awkward arrow
899: \draw [arrow] (thm) -| ([shift={(7mm,-7mm)}]thm.east) |- (sthm);
900:
901: % Braces
902:
903: % Preparation
904: \draw
905: let \p{thm} = (thm.east),
906: \p{in} = (in.north),
907: \p{rec} = (rec.south east)
908: in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
909: (\x{rec}, \y{in}) -- (\x{rec}, \y{thm})
910: node [black, midway, right, xshift=0.3cm]
911: {Preparation $\S$\ref{section:prep}};
912:
913: % Sampling
914: \draw
915: let \p{thm} = (thm.east),
916: \p{gt} = (sthm.east),
917: \p{rec} = (rec.south east)
918: in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
919: (\x{rec}, \y{thm}) -- (\x{rec}, \y{gt})
920: node [black, midway, right, xshift=0.3cm]
921: {Sampling $\S$\ref{section:sampling}};
922:
923: % Evaluation
924: \draw
925: let \p{sthm} = (sthm.east),
926: \p{rec} = (rec.south east)
927: in [decorate,decoration={brace, amplitude=10pt}, xshift=0.5cm]
928: (\x{rec}, \y{sthm}) -- (\x{rec}, \y{rec})
929: node [black, midway, right, xshift=0.3cm]
930: {Evaluation $\S$\ref{section:evaluation}};
931:
932: \end{tikzpicture}
933: \caption[]{High-level view of the benchmarking methodology described in
934: $\S$\ref{sec:proposal}, showing
935: \begin{tikzpicture}
936: \node [rectangle, text centered, draw=black]{processes};
937: \end{tikzpicture},
938: \begin{tikzpicture}
939: \node [rectangle, rounded corners, text centered, draw=black]{data};
940: \end{tikzpicture} and
941: \begin{tikzpicture}
942: \node [trapezium, trapezium left angle=70, trapezium right angle=110,
943: text centered, draw=black]{inputs};
944: \end{tikzpicture}}
945: \label{figure:flow_chart}
946: \end{figure}
947:
948: \section{Application}
949: \label{sec:application}
950:
951: We have applied our benchmarking methodology to the QuickSpec and IsaCoSy MTE
952: tools, using version 0.2 of the TIP (Tons of Inductive Problems) theorem proving
953: benchmark as our ground truth corpus~\cite{claessen2015tip}.
954:
955: To determine our benchmarking parameters we ran some initial tests on both tools
956: for a few samples sized between 1 and 100, for an hour each on our benchmarking
957: machine with a 3.2GHz dual-core Intel i5 processor with hyper-threading and 8GB
958: of RAM. Most QuickSpec runs either finished within 200 seconds or not at all,
959: and sizes above 20 mostly timed out. IsaCoSy mostly finished within 300 seconds
960: on sample sizes up to 4, but by size 8 was mostly timing out; its few successes
961: above this took thousands of seconds each, which we deemed infeasibly long.
962:
963: Based on this we decided to benchmark sample sizes up to 20, since neither tool
964: seemed to perform well beyond that. The Speedup-Test protocol follows the
965: statistical ``rule of thumb'' of treating sample sizes $\leq$ 30 as ``small'',
966: so we pick 31 samples of each size in order to cross this threshold. This gave a
967: total of 1240 runs. To keep the benchmarking time down to a few days we chose a
968: timeout of 300 seconds, since that covered most of the successful QuickSpec and
969: IsaCoSy results we saw, and longer times gave rapidly diminishing
970: returns. During analysis, duplicate samples (caused by our requirement that
971: samples have a non-empty ground truth) were found and discarded, so only 14
972: samples of size 1 were used and 30 of size 2.
973:
974: \subsection{Tons of Inductive Problems}
975: \label{sec:tip}
976:
977: We chose the Tons of Inductive Problems (TIP) benchmark for our ground truth
978: since it satisfies the criteria specified in $\S$\ref{section:prep}: each
979: benchmark problem has standalone type and function definitions, making their
980: separation trivial; known examples from the software verification and inductive
981: theorem proving literature are included, ensuring relevance to those fields; the
982: format includes the higher-order functions and inductive datatypes we are
983: interested in; it is large enough to pose a challenge to current MTE tools; plus
984: it is accompanied by tooling to convert its custom format (an extension of
985: SMT-Lib~\cite{BarFT-SMTLIB}) into a variety of languages, including Haskell and
986: Isabelle.
987:
988: We use TIP version 0.2 which contains 343 problems, each stating a single
989: property and together defining a total of 618 datatypes and 1498 functions. Most
990: of these are duplicates, since each problem (re\nobreakdash-)defines all of the
991: datatypes and functions it involves.
992:
993: TIP datatypes can have several ``constructors'' (introduction forms) and
994: ``destructors'' (elimination forms; field accessors). For example the type of
995: lists from Figure~\ref{figure:list_theory} can be defined in the TIP format as
996: follows:
997:
998: \begin{samepage}
999: \begin{verbatim}
1000: (declare-datatypes
1001: (a) ;; Type parameter (element type)
1002: ((List ;; Type name
1003: (Nil) ;; Constructor (nullary)
1004: (Cons ;; Constructor (binary)
1005: (head a) ;; Field name and type
1006: (tail (List a)))))) ;; Field name and type
1007: \end{verbatim}
1008: \end{samepage}
1009:
1010: Our target languages (Haskell and Isabelle) differ in the way they handle
1011: constructors and destructors, which complicates comparisons. To avoid this, we
1012: generate a new function for each constructor (via $\eta$-expansion) and
1013: destructor (via pattern-matching) of the following form:
1014:
1015: \begin{samepage}
1016: \begin{verbatim}
1017: (define-fun
1018: (par (a) ;; Type parameter
1019: (constructor-Cons ;; Function name
1020: ((x a) (xs (List a))) ;; Argument names and types
1021: (List a) ;; Return type
1022: (as ;; Type annotation
1023: (Cons x xs) ;; Return value
1024: (List a))))) ;; Return type
1025: \end{verbatim}
1026: \end{samepage}
1027:
1028: \begin{samepage}
1029: \begin{verbatim}
1030: (define-fun
1031: (par (a) ;; Type parameter
1032: (destructor-head ;; Function name
1033: ((xs (List a))) ;; Argument name and type
1034: a ;; Return type
1035: (match xs ;; Pattern-match
1036: (case (Cons h t) h))))) ;; Return relevant field
1037: \end{verbatim}
1038: \end{samepage}
1039:
1040: \begin{sloppypar}
1041: We rewrite the TIP properties (our ground truth) to reference these expanded
1042: forms instead of the raw constructors and destructors, and use these functions
1043: in our samples in lieu of the raw expressions. Note that these destructor
1044: wrappers are \emph{partial} functions (e.g. \texttt{destructor-head} and
1045: \texttt{destructor-tail} are undefined for the input \texttt{Nil}), which
1046: complicates their translation to proof assistants like Isabelle.
1047: \end{sloppypar}
1048:
1049: Another complication is TIP's ``native'' support for booleans and integers,
1050: which allows numerals and symbols like \texttt{+} to appear without any
1051: accompanying definition. To ensure consistency in the translations, we replace
1052: all occurrences of such expressions with standard definitions written with the
1053: ``user-level'' \texttt{declare-datatypes} and \texttt{define-fun}
1054: mechanisms.~\footnote{\texttt{Boolean} has \texttt{true} and \texttt{false}
1055: constructors; \texttt{Natural} has \texttt{zero} and \texttt{successor};
1056: \texttt{Integer} has unary \texttt{positive} and \texttt{negative}
1057: constructors taking \texttt{Natural}s, and a nullary \texttt{zero} for
1058: symmetry.}
1059:
1060: When we add all of these generated types and functions to those in TIP, we get a
1061: total of 3598 definitions. Removing $\alpha$-equivalent duplicates leaves 269,
1062: and we choose to only sample from those 182 functions which are referenced by at
1063: least one property (this removes ambiguity about which \emph{definitions} count
1064: as interesting and which are just ``implementation details'' for other
1065: definitions).
1066:
1067: TIP comes with software to translate its definitions into Haskell and Isabelle
1068: code, including comparison functions and random data generators suitable for
1069: QuickCheck. We translate all 269 unique definitions into a single module/theory
1070: which is imported on each run of the tools, although only those functions which
1071: appear in the current sample are included in the signature and explored. We also
1072: encode all names in hexadecimal to avoid problems with language-specific naming
1073: rules, for example \texttt{add} becomes \texttt{global616464} (the prefix
1074: distinguishes these from local variables and prevents names from beginning with
1075: a digit). This ensures that the generated conjectures will be using the same
1076: names as the ground truth, rather than some language-specific variant.
1077:
1078: \subsection{QuickSpec}
1079:
1080: We benchmarked QuickSpec version 0.9.6, a tool written in Haskell for
1081: conjecturing equations involving Haskell functions, described in more detail in
1082: $\S$\ref{sec:existing-tools}. In order to thoroughly benchmark QuickSpec, we
1083: need to automate some of the decisions which are normally left up to the user:
1084:
1085: \begin{sloppypar}
1086: \begin{itemize}
1087: \item We must decide what variables to include. We choose to add three
1088: variables for each type that appears as a function argument, except for
1089: types which have no QuickCheck data generators.
1090: \item We must \emph{monomorphise} all types. For example, functions like
1091: \texttt{constructor-Cons} are \emph{polymorphic}: they build lists of any
1092: element type, but we need to pick a specific type in order to know which
1093: random value generator to use. We resolve this (arbitrarily) by picking
1094: \texttt{Integer}.~\footnote{We pick \texttt{Integer} for variables of kind
1095: \texttt{*} (types); for kind \texttt{* -> *} (type constructors) we pick
1096: \texttt{[]} (Haskell's list type constructor). If these violate some
1097: type class constraint, we pick a suitable type non-deterministically from
1098: those in scope during compilation; if no suitable type is found, we give
1099: up and don't include that function.}
1100: \item Haskell functions are ``black boxes'', which QuickSpec can't compare
1101: during its exploration process. They are also curried, always taking one
1102: argument but potentially returning another function. QuickSpec lets us
1103: assign an arity to each function in the signature, from 0 to 5, so we pick
1104: the highest that is type-correct, since this avoids a proliferation of
1105: incomparable, partially-applied functions.
1106: \end{itemize}
1107: \end{sloppypar}
1108:
1109: \begin{figure}
1110: \centering
1111: \includegraphics[scale=1]{Fig3}
1112: \caption{Running times of QuickSpec on theories sampled from TIP. Each point
1113: is a successful run (spread out horizontally to reduce overlaps). Red
1114: crosses show runs which timed out, which occurred for every size. Each size
1115: has 31 runs total, except for size 1 (14 runs) and size 2 (30 runs) due to
1116: sampling restrictions. Box plots show inter-quartile range, which reaches
1117: the timeout for sizes above 8. Medians are marked with a line and remain
1118: near zero until size 10, with those of size 13 and 20 timing out. Whiskers
1119: show 1.5$\times$~inter-quartile range}
1120: \label{figure:quickspec_runtimes}
1121: \end{figure}
1122:
1123: \begin{figure}
1124: \centering
1125: \includegraphics[scale=1]{Fig4}
1126: \caption{Precision and recall of successful QuickSpec runs (spread
1127: horizontally to reduce overlaps). Lines show the mean proportion for each
1128: sample size (Average of Ratios) and errorbars show the sample standard
1129: deviation. Mean precision steadily reduces from around $0.5$ at size 1 to
1130: around $0.1$ for size 20, whilst mean recall remains roughly flat between
1131: $0.2$ and $0.5$}
1132: \label{figure:quickspec_precRec}
1133: \end{figure}
1134:
1135: The time taken to explore samples of different sizes is shown in
1136: Figure~\ref{figure:quickspec_runtimes}. Failed runs are shown in red: all
1137: failures were due to timing out, and these occurred for all sample sizes but are
1138: more common for larger samples (over half the runs for sample sizes 13 and 20
1139: timed out). Runs which succeeded mostly finished within 30 seconds, generating
1140: few or no conjectures; those taking longer generated more conjectures, with the
1141: most being 133 conjectures from a sample of size 19.
1142:
1143: Precision and recall are shown in Figure~\ref{figure:quickspec_precRec}.
1144: Since QuickSpec generates monotonically more conjectures as definitions are
1145: added to a signature (assuming sufficient testing), its decreasing precision
1146: can't be due to finding fewer wanted conjectures at larger sizes. This is
1147: supported by the relative flatness of the recall results. Rather, the number of
1148: conjectures generated is increasing at a higher rate than the size of the ground
1149: truth. These extra conjectures may involve those ``padding'' definitions in a
1150: sample which don't contribute to its ground truth, or may be ``uninteresting''
1151: relationships between the dependencies of different ground truth properties.
1152:
1153: This indicates two potential improvements to the QuickSpec algorithm (as far as
1154: this benchmark is concerned). The deluge of generated conjectures could be
1155: filtered down to a more desirable sub-set by another post-processing step.
1156: Alternatively, rather than exploring all of the given definitions together,
1157: multiple smaller signatures could be selected from the input by predicting which
1158: combinations are likely to lead to interesting conjectures; this would avoid
1159: both the ``padding'' and the cross-dependency relationships. Both of these
1160: methods could improve the precision, although care would be needed to avoid a
1161: large reduction in recall. The latter option could also improve the running
1162: time, since (based on Figure~\ref{figure:quickspec_runtimes}) multiple smaller
1163: signatures may be faster to explore than a single large one. Such improvements
1164: would also need to be ``generic'', to avoid over-fitting to this particular
1165: benchmark.
1166:
1167: QuickSpec's recall is limited by two factors: the algorithm is unable to
1168: synthesise some properties, such as conditional equations, inequalities, terms
1169: larger than the search depth and those containing anonymous functions. The
1170: congruence closure algorithm used as a post-processor may also be removing
1171: ``interesting'' results, for example if we found an ``uninteresting'' result
1172: which is more general.
1173:
1174: \subsection{IsaCoSy}
1175:
1176: We took IsaCoSy from version 2015.0.3 of the IsaPlanner project, and ran it
1177: with the 2015 version of Isabelle. The following issues had to be overcome to
1178: make our benchmark applicable to IsaCoSy:
1179:
1180: \begin{itemize}
1181: \item TIP includes a benchmark called \texttt{polyrec} whose types cannot be
1182: encoded in Isabelle. We strip out this type and the functions which depend on
1183: it before translating. It still appears in samples and contributes to the
1184: ground truth, which penalises IsaCoSy for being unable to explore such
1185: definitions.
1186: \item When using a type in an IsaCoSy signature, that type's constructors will
1187: automatically be included in the exploration. Since those constructors will
1188: not appear in the ground truth (we use $\eta$-expanded wrappers instead, and
1189: even those may not be present in the current sample) this will unfairly reduce
1190: the calculated precision. To avoid this, we add a post-processing step which
1191: replaces all occurrences of a constructor with the corresponding wrapper, then
1192: discards any conjectures which involve functions other than those in the
1193: current sample. This presumably results in more work for IsaCoSy, exploring
1194: constructors unnecessarily, but it at least does not bring down the quality
1195: measures.
1196: \item Since Isabelle is designed for theorem proving rather than programming, it
1197: requires every definition to be accompanied by proofs of exhaustiveness and
1198: termination. These are difficult to generate automatically, and don't exist in
1199: the case of partial functions like destructor wrappers. Hence we use the
1200: ``quick and dirty'' option in Isabelle, which lets us skip these proofs with
1201: the \texttt{sorry} keyword.
1202: \item Partial functions cause problems during exploration, since they can throw
1203: an ``undefined'' exception which causes IsaCoSy to abort. We avoid this by
1204: pre-populating IsaCoSy's constraint set with these undefined expressions
1205: (for example \texttt{(destructor-head constructor-Nil)}), hence preventing
1206: IsaCoSy from ever generating them.
1207: \end{itemize}
1208:
1209: \begin{figure}
1210: \centering
1211: \includegraphics[scale=1]{Fig5}
1212: \caption{Running times of IsaCoSy on theories sampled from TIP. Each point
1213: is a successful run (spread out horizontally to reduce overlaps). Red
1214: crosses are failed runs, caused by timeouts or out-of-memory. All runs of
1215: size 1 succeeded, whilst nothing succeeded above size 6. Each size has 31
1216: runs total, except for size 1 (14 runs) and size 2 (30 runs) due to sampling
1217: restrictions. Box plots show inter-quartile range, which grows with size,
1218: and lines mark the median time, which increases rapidly from around 100
1219: seconds at size 1 to timing out at size 4 and above. Whiskers show
1220: 1.5$\times$~inter-quartile range}
1221: \label{figure:isacosy_runtimes}
1222: \end{figure}
1223:
1224: \begin{figure}
1225: \centering
1226: \includegraphics[scale=1]{Fig6}
1227: \caption{Precision and recall of all successful IsaCoSy runs (spread
1228: horizontally to reduce overlaps); there were no successes above size 6.
1229: Lines show the mean proportion for each sample size (Average of Ratios),
1230: which trends downwards for precision and appears flat for recall, although
1231: there are too few datapoints to be definitive. Errorbars show the sample
1232: standard deviation}
1233: \label{figure:isacosy_precRec}
1234: \end{figure}
1235:
1236: The most striking result is how rapidly IsaCoSy's running time, shown in
1237: Figure~\ref{figure:isacosy_runtimes}, increases with sample size: all runs above
1238: size 6 failed, with most timing out and a few aborting early due to running out
1239: of memory. Like in our preliminary testing, this increase appears exponential,
1240: so even large increases to the timeout would not produce many more successful
1241: observations. A significant amount of IsaCoSy's time (around 50 seconds) was
1242: spent loading the generated theory (containing all distinct datatypes and
1243: functions from TIP); this overhead is unfortunate, but since it is constant
1244: across all sample sizes it does not affect our conclusions.
1245:
1246: With so few successful datapoints it is difficult to make strong claims about
1247: the precision/recall quality of IsaCoSy's output, shown in
1248: Figure~\ref{figure:isacosy_precRec}. It is clear from the quantity of non-zero
1249: proportions that IsaCoSy is capable of discovering interesting conjectures, and
1250: the graphs appear to follow the same shapes as those of QuickSpec (decreasing
1251: precision, flat recall), although the error margins are too wide to be
1252: definitive.
1253:
1254: \subsection{Comparison}
1255:
1256: We compare the running times of QuickSpec and IsaCoSy using our paired variant
1257: of the Speedup-Test protocol, where each of our sample sizes is a separate
1258: ``benchmark''. We used version 2 of the Speedup-Test \texttt{R} implementation,
1259: which we patched to use the \emph{paired} form of the (one-sided) Wilcoxon
1260: signed-rank test~\cite{wilcoxon1945individual}. We use Pratt's
1261: method~\cite{pratt1959remarks} to handle ties, which occur when both tools time
1262: out.
1263:
1264: For each sample size (``benchmark'') the Speedup-Test protocol compares the
1265: distributions of each tool's times using a Kolmogorov-Smirnov test. If these are
1266: significantly different (with $\alpha < 0.05$), then the signed-rank test is
1267: only performed if more than 30 samples are available. This was the case for all
1268: sample sizes except for 1 and 2 (due to the removed duplicates), and hence those
1269: two speedups were not deemed statistically significant. For all other sample
1270: sizes QuickSpec was found to be significantly faster with $\alpha = 0.05$,
1271: putting the proportion of sizes with faster median times between 66.9\% and
1272: 98.2\% with 95\% confidence. The magnitude of each speedup (median IsaCoSy time
1273: divided by median QuickSpec time) is given in table~\ref{table:speedups}. We
1274: would predict the speedup to grow for larger sample sizes, but the increasing
1275: proportion of timeouts causes our estimates to become more conservative: by size
1276: 20 most runs are timing out, and hence are tied.
1277:
1278: \begin{table}
1279: \centering
1280: \begin{tabular}{ |r|l| }
1281: \hline
1282: \bfseries Sample Size & \bfseries Speedup
1283: \csvreader[]{speedups.csv}{}
1284: {\\\hline\csvcoli&\csvcolii} \\
1285: \hline
1286: \end{tabular}
1287: \caption{Speedup from using QuickSpec compared to IsaCoSy, i.e. the median
1288: time taken by IsaCoSy divided by that of QuickSpec. QuickSpec was found to
1289: be significantly faster ($\alpha = 0.05$) for all sizes, except 1 and 2 due
1290: to too few samples. These are conservative estimates, since the 300 second
1291: time limit acts to reduce the measured time difference (e.g. sizes 13 and
1292: 20, where the median for both tools timed out)}
1293: \label{table:speedups}
1294: \end{table}
1295:
1296: We compare the recall proportions by applying McNemar's test to only those
1297: samples where both tools succeeded. We pooled these together from all sample
1298: sizes to produce table~\ref{table:recall}, and found that the recall of
1299: QuickSpec ($37\%$) is significantly higher than IsaCoSy ($25\%$) with a p-value
1300: of 0.0026.
1301:
1302: \begin{table}
1303: \centering
1304: \begin{tabular}{ |r|l|l| }
1305: \hline
1306: & \bfseries Found by IsaCoSy & \bfseries Not found by IsaCoSy \\
1307: \hline
1308: \bfseries Found by QuickSpec & 28 & 19 \\
1309: \hline
1310: \bfseries Not found by QuickSpec & 4 & 75 \\
1311: \hline
1312: \end{tabular}
1313: \caption{Contingency table for ground truth properties, pooled from those
1314: samples where both QuickSpec and IsaCoSy finished successfully. The combined
1315: recall for QuickSpec is $37\%$ and for IsaCoSy is $25\%$}
1316: \label{table:recall}
1317: \end{table}
1318:
1319: McNemar's test is not applicable for comparing precision, since each tool
1320: generated different sets of conjectures. Instead, we add up the number of
1321: results which are ``interesting'' (appear in the ground truth) and those which
1322: are ``uninteresting'' (do not appear in the ground truth), with the results
1323: shown in table~\ref{table:precision}
1324:
1325: \begin{table}
1326: \centering
1327: \begin{tabular}{ |r|l|l| }
1328: \hline
1329: & \bfseries Interesting & \bfseries Uninteresting \\
1330: \hline
1331: \bfseries IsaCoSy & 32 & 137 \\
1332: \hline
1333: \bfseries QuickSpec & 47 & 301 \\
1334: \hline
1335: \end{tabular}
1336: \caption{Interesting and uninteresting conjectures generated by QuickSpec and
1337: IsaCoSy, pooled from those samples where both both finished successfully.
1338: The combined precision for QuickSpec is $14\%$ and for IsaCoSy is $19\%$}
1339: \label{table:precision}
1340: \end{table}
1341:
1342: We tested these totals for independence using Boschloo's test and find a p-value
1343: of 0.111, which exceeds our (conventional) significance threshold of
1344: $\alpha = 0.05$; hence we do not consider the difference between these
1345: proportions (14\% for QuickSpec, 19\% for IsaCoSy) to be significant.
1346:
1347: Note that precision is the only comparison where IsaCoSy scored higher, and even
1348: that was not found to be significant. Also, precision on its own is not the
1349: whole story: it can be increased at the expense of recall by making a tool more
1350: conservative. IsaCoSy may already be overly-conservative compared to QuickSpec,
1351: given that its recall is significantly lower.
1352:
1353: More importantly, the poor time and memory usage of IsaCoSy meant that very few
1354: samples finished successfully. If we include failed runs in our tables, treating
1355: them as if they succeeded with no output, the resulting statistics lean
1356: overwhelmingly in favour of QuickSpec simply because it generated so much more
1357: than IsaCoSy in the available time.
1358:
1359: \section{Discussion}
1360: \label{sec:discussion}
1361:
1362: Fundamentally, when designing any mathematical reasoning system we must decide
1363: on, and formalise, what counts as ``the good'' in mathematics. Obvious metrics
1364: such as ``true'' or ``provable'' include trivial tautologies, while at the same
1365: time failing to capture the ``almost true'', which can be a valuable trigger for
1366: theory change, as demonstrated by Lakatos in his case studies of mathematical
1367: development~\cite{lakatos}. ``Beautiful'' is another -- albeit vague -- commonly
1368: proposed metric. Neuro-scientists such as Zeki \etal{} have attempted to shed
1369: light on this by testing whether mathematicians' experiences of abstract beauty
1370: correlates with the same brain activity as experiences of sensory
1371: beauty~\cite{10.3389/fnhum.2014.00068}. Qualities from computer scientists like
1372: Colton (such as those in~\cite{colton2000notion}) are based largely on
1373: ``intuition'', plausibility arguments about why a metric would be important, and
1374: previous use of such metrics (in the case of ``surprisingness'', from a single
1375: quote from a mathematician). Opinions from mathematicians themselves include
1376: Gowers' suggestion that we can identify features which are commonly associated
1377: with good proofs~\cite{gowers2000two}, Erdos's famous idea of ``The
1378: Book''~\cite{aigner2010proofs} as well as McCasland's personal evaluation of the
1379: interestingness of MATHsAiD's output~\cite{roy} (of which he was the main system
1380: developer).
1381:
1382: All of these approaches rest upon the assumption that it makes sense to speak of
1383: ``the good'' in mathematics. However, empirical psychological studies
1384: call into question such assumptions: for example, work by Inglis and colleagues
1385: has shown that there is not even a single standard of \emph{validity} among
1386: contemporary mathematicians~\cite{inglis2013mathematicians}.
1387:
1388: Whilst these difficulties are real and important, we cannot ignore the fact that
1389: mathematics is nevertheless being practised around the world; and similarly that
1390: researchers have forged ahead to develop a variety of tools for automated
1391: exploratory mathematics. If we wish to see these tools head in a useful,
1392: fruitful direction then \emph{some} method is needed to compare their
1393: approximate ``quality'' in a concrete, measurable way.
1394:
1395: The key to our benchmarking methodology is to side-step much of this
1396: philosophical quagmire using the simplifying assumption that theorem proving
1397: problem sets are a good proxy for desirable input/output behaviour of MTE tools.
1398: As an extension of existing precision/recall analysis, this should hopefully not
1399: prove too controversial, although we acknowledge that there are compelling
1400: reasons to refute it.
1401:
1402: We do not claim that our use of corpora as a ground truth exactly captures all
1403: interesting conjectures of their definitions, or that those definitions exactly
1404: represent all theories we may wish to explore. Rather, we consider our approach
1405: to offer a pareto-optimal balance between theoretical rigour and experimental
1406: practicality, at least in the short term. Futhermore, since research is already
1407: on-going in these areas, we hope to at least improve on existing evaluation
1408: practices and offer a common ground for future endeavours.
1409:
1410: One notable weakness is that our methodology does not allow negative examples,
1411: such as particularly dull properties that tools should never produce. Everything
1412: outside the ground truth set is treated equally, whether it's genuinely
1413: uninteresting or was only left out due to oversight. In particular this limits
1414: the use of our approach to domains where existing human knowledge surpasses that
1415: discovered by the machine. Any \emph{truly novel} insights discovered during
1416: testing will not, by definition, appear in any existing corpus, and we would in
1417: fact \emph{penalise} tools for such output. We do not believe this to be a
1418: realistic problem for the time being, as long as evaluation is limited to
1419: well-studied domains and results can be generalised to real areas of
1420: application. This does emphasise the continued importance of testing these tools
1421: with real human users, rather than relying solely on artificial benchmarks.
1422:
1423: Another practical limitation of our benchmarking approach is that it only
1424: applies to tools which act in ``batch mode'', i.e. those which choose to halt
1425: after emitting some output. Whilst all of the systems we have encountered are of
1426: this form, some (such as IsaCoSy, QuickSpec 2 and Speculate) could conceivably
1427: be run without a depth limit, and form part of the ``scientist assistant'' role
1428: which Lenat envisaged for AM, or McCarthy's earlier ``advice
1429: taker''~\cite{McCarthy_Programs59}. Analogous benchmarks could be designed for
1430: such long-running, potentially interactive programs, but that is beyond the
1431: scope of this project.
1432:
1433: \section{Related Work}
1434: \label{sec:related-work}
1435:
1436: The automation of mathematical tasks has been pursued since at least the time of
1437: mechanical calculators like the Pascaline~\cite{d'ocagne}. A recurring theme in
1438: these efforts is the separation between those undertaken by mathematicians like
1439: Pascal and Babbage~\cite{bowden}, and those of engineers such as
1440: M\"uller~\cite[p. 65]{lindgren}. This pattern continues today, with the tasks we
1441: are concerned with (automatically constructing and evaluating concepts,
1442: conjectures, theorems, axioms, examples, etc.) being divided into two main
1443: fields: Mathematical Theory Exploration (MTE)~\cite{buchberger:06} (also
1444: sometimes prefaced with ``Computer-Aided'', ``Automated'' or
1445: ``Algorithm-Supported''), which is championed by mathematicians such as
1446: Buchberger~\cite{buchberger}; and Automated Theory Formation
1447: (ATF)~\cite{lenat:77,colton:book}, pursued by AI researchers including Lenat.
1448: Other related terms include ``Automated Mathematical
1449: Discovery''~\cite{epstein:91,colton2000notion,esarm2008},
1450: ``Concept Formation in Discovery Systems''~\cite{haase}, and
1451: ``Automated Theorem Discovery''~\cite{roy}.
1452:
1453: Such a plethora of terminology can mask similarities and shared goals between
1454: these fields. Even notable historical differences, such as the emphasis of MTE
1455: on user-interaction and mathematical domains, in contrast to the full automation
1456: and more general applications targeted by ATF, are disappearing in recent
1457: implementations.
1458:
1459: An important historical implementation of ATF is Lenat's AM (Automated
1460: Mathematician) system. Unlike prior work, such as
1461: Meta-Dendral~\cite{buchanan:75} and those described in~\cite{winston}, AM aims
1462: to be a general purpose mathematical discovery system, designed to both
1463: construct new concepts and conjecture relationships between them. AM is a
1464: rule-based system which represents knowledge using a frame-like scheme, enlarges
1465: its knowledge base via a collection of heuristic rules, and controls the firing
1466: of these rules via an agenda mechanism. Evaluation of AM considered generality
1467: (performance in new domains) and how finely-tuned various aspects of the program
1468: are (the agenda, the interaction of the heuristics, etc). Most of this
1469: evaluation was qualitative, and has subsequently been
1470: criticised~\cite[chap.~13]{colton:book}. In their case study in methodology,
1471: Ritchie and Hanna found a large discrepancy between the theoretical claims made
1472: of AM and the implemented program~\cite{ritchie1984case}; for example, AM ``invented''
1473: natural numbers from sets, but did so using a heuristic specifically designed to
1474: make this connection.
1475:
1476: The prototypical implementation of MTE is the Theorema system of Buchberger and
1477: colleagues~\cite{buchberger,buchberger2016theorema}, which also places a strong
1478: emphasis on user interface and output presentation. Theory exploration in the
1479: Theorema system involves the user formalising their definitions in a consistent,
1480: layered approach; such that reasoning algorithms can exploit this structure in
1481: subsequent proofs, calculations, etc. The potential of this strategy was
1482: evaluated by illustrating the automated synthesis of Buchberger's own Gr\"obner
1483: bases algorithm~\cite{buchberger:04}.
1484:
1485: A similar ``layering'' approach is found in the IsaScheme system of
1486: Monta{\~n}o-Rivas \etal{}~\cite{Montano-Rivas.McCasland.Dixon.ea:2012}, which
1487: has also been quantitatively compared against IsaCoSy and HipSpec using
1488: precision/recall analysis~\cite{claessen2013automating}. The name comes from its
1489: embedding in the Isabelle proof assistant and its use of ``schemes'':
1490: higher-order formulae which can be used to generate new concepts and
1491: conjectures. Variables within a scheme are instantiated automatically and this
1492: drives the invention process. For example, the concept of ``repetition'' can be
1493: encoded as a scheme, and instantiated with existing encodings of zero, successor
1494: and addition to produce a definition of multiplication. The same scheme can be
1495: instantiated with this new multiplication function to produce exponentiation.
1496:
1497: IsaCoSy and QuickSpec (the conjecture generation component of HipSpec) are
1498: described in more detail in $\S$\ref{sec:existing-tools}, since these are the
1499: tools we chose to evaluate and compare for $\S$\ref{sec:application}. QuickSpec
1500: has since evolved to version 2~\cite{smallbone2017quick}, which replaces the
1501: distinct enumeration and testing steps with a single, iterative algorithm
1502: similar to that of IsaCoSy. Generated conjectures are fed into a Knuth-Bendix
1503: completion algorithm to form a corresponding set of rewrite rules. As
1504: expressions are enumerated, they are simplified using these rules and discarded
1505: if equal to a known expression. If not, QuickCheck tests whether the new
1506: expression can be distinguished from the known expressions through random
1507: testing: those which can are added to the set of known expressions. Those which
1508: cannot be distinguished are conjectured to be equal, and the rewrite rules are
1509: updated.
1510:
1511: QuickSpec has also inspired another MTE tool for Haskell called
1512: Speculate~\cite{braquehais2017speculate}, which operates in a similar way but
1513: also makes use of the laws of total orders and Boolean algebra to conjecture
1514: \emph{in}equalities and conditional relations between expressions.
1515:
1516: Another notable MTE implementation, distinct from those based in Isabelle and
1517: Haskell, is the MATHsAiD project (Mechanically Ascertaining Theorems from
1518: Hypotheses, Axioms and Definitions)~\cite{roy}. Unlike the tools above, which
1519: generate \emph{conjectures} that may later be sent to automated provers,
1520: MATHsAiD directly generates \emph{theorems}, by making logically valid
1521: inferences from a given set of axioms and definitions. Evaluation of the
1522: interestingness of these theorems was performed qualitatively by the system's
1523: developer, which highlights how these tools could benefit from the availability
1524: of an objective, repeatable, quantitative method of evaluation and comparison
1525: such as ours.
1526:
1527: Whilst there are many reasonably objective benchmarks for mathematical tasks
1528: such as automated theorem proving, the precision/recall analysis shown
1529: in~\cite{claessen2013automating}, and described further in $\S$\ref{sec:te}, is
1530: the only quantitative comparison of these recent MTE tools we have found in the
1531: literature. Our work is essentially an extension of this approach, to a larger
1532: and more diverse set of examples. The suitability of the TIP theorem proving
1533: benchmark for our purposes, detailed in $\S$\ref{sec:tip}, is not coincidental,
1534: since its developers are also those of the IsaCoSy and QuickSpec tools we have
1535: tested. This goes some way to ensuring that our demonstration is a faithful
1536: representation of the task these tools were intended to solve; our independent
1537: repurposing of this problem set, in a way it was not designed for, reduces the
1538: risk that the benchmark is tailor-made for these tools (or that the tools
1539: over-fit to these particular problems).
1540:
1541: \section{Conclusion}
1542: \label{sec:conclusion}
1543:
1544: We propose a general benchmarking methodology for measuring, summarising and
1545: comparing performance of diverse approaches to the conjecture generation
1546: problem, whilst avoiding some of the philosophical complications and ambiguities
1547: of the field. This methodology can be tailored for specific interests, such as
1548: the choice of problem set and the focus of the resulting analysis.
1549:
1550: We have also presented an example application for the domain of higher-order
1551: inductive theories (which is immediately applicable to problems in functional
1552: programming). Using the TIP problem set as a corpus, we have evaluated the
1553: performance of the QuickSpec and IsaCoSy tools and demonstrated QuickSpec to be
1554: both significantly faster and to also output more desirable conjectures than
1555: IsaCoSy; although more \emph{undesirable} output may be generated as well. We
1556: found that they both fail, due to time and memory constraints, for a large
1557: proportion of inputs; that this gets worse as input size increases; and that
1558: IsaCoSy fails more often than QuickSpec. Based on these results we proposed two
1559: possible directions to improve the QuickSpec algorithm: a post-processing filter
1560: to remove more ``uninteresting'' conjectures and a pre-processing filter to
1561: ``focus'' on promising subsets of the input.
1562:
1563: Other promising directions for future work include the evaluation of other MTE
1564: tools, the use of other corpora more suited to different problem domains, and
1565: the extension of existing corpora with new definitions and properties (which
1566: would also be of benefit to the original ATP benchmarks).
1567:
1568: We believe that a standard approach to benchmarking and comparison such as ours
1569: will ease the burden on researchers wanting to evaluate different potential
1570: approaches to this task, and provide a common goal to pursue in the short term.
1571:
1572: ``Solving'' this benchmark would not solve the problem of conjecture generation
1573: in general, so more ambitious goals must be set in the future. For now, we
1574: believe that our approach provides a compelling challenge and will encourage
1575: healthy competition to improve the field.
1576:
1577: \bibliographystyle{plain}
1578: \bibliography{./Bibtex}
1579:
1580: \end{document}
Generated by git2html.