The Web as an Oracle
I’ve tried a few times to write up my programming thoughts to this
blog, each time from a different perspective and getting rather
off-topic most of the time. Those posts will probably appear soon, since
I’ve invested the time to write them I may as well publish them, but a
thought struck me about how to clarify my meaning in a concise
way.
My ideas have been revolving around formal specification
and verification of programs. Since I approached the subject from a
Python perspective it meant living without the luxury of an enforced
type system, definite syntax and semantics, immutability of variables,
etc. that one gets with a functional language like ML or Haskell. Since
anything in Python can be redefined at any point, for example swapping
an object’s class via my_object.__class__ = Foo, I started by
considering a system that broke down code into black-box style atomic
operations (assuming a single thread, because I’m against multithreading
as a usable model for concurrent processing).
For example, we
could look at a function and work out conservative bounds on the
preconditions and post conditions. For example if we have a function
which runs “argument.foo += 1” then we know that the argument “argument”
must contain something called “foo”, unless we’re catching a runtime
exception, it must be mutable, have a runnable method
“add” which accepts the argument “1” and so
on.
It struck me that bundling such specifications with the
code would be useful, but it would be a far cry from a verified system.
For example, one can replace an object’s methods at any time, so the
specifications of code snippets like functions would have to be
self-describing only and make reference to the specifications of their
arguments and global names, etc. since they can’t be relied upon to be
known before run-time. That way, at runtime the full specification can
be derived (once the arguments are definitely known), but not
before.
In essence, a function’s specification is built upon
the specification of its arguments, accessible namespace members (eg.
globals) and the semantics of the language. If this is the case then
even having a complete, correct, verified specification of the language
(eg. Python) does not let us make assumptions about functions, modules,
etc. without knowing their arguments. The system’s state is just as
powerful as the language; in fact, the system’s state can be regarded AS
the language! Running the same function in different environments isn’t
just an impure, non-functional way of working, it may as well be running
the same function in two different languages!
With this in
mind it became obvious that the lack of ability to make valid
specifications is a boon: current state, libraries, programming
language, etc. can be combined to give a combinatorial explosion of
languages to predicate a function’s specification on. Thus we don’t NEED
a specification for Python at all, since it would be tainted by whatever
modules we import, and those would be tainted by the running program’s
state, so we just stick to our conservative specifications like “it must
contain foo” rather than “it must be an (int x int) -> float” and we
can refine these as we go by making more and more of them for each
library, language, etc. we come across. Thus we don’t get a formal
specification, but we do get an algorithm to piece together a
specification given at least one set of loose, conservative
specifications. We don’t need to verify all of the way down to the
metal; we can verify what we can and leave the rest for later (ie. use
them as axioms).
What is the algorithm for doing this? It’s
reasoning. Predicate logic. It’s just the thing RDF/OWL is good for! So
what happens if we have an RDF database of code along with
specifications? First of all we do a massive Web crawl and fill our
database with code. Then we do a massive spec-finding session
(automatically of course). Then we do a load of cross-referencing and
such (ie. if function foo’s spec says it requires XYZ and function bar’s
spec says it returns XYZ then we know that foo(bar) is a reasonable
expression to consider). Essentially we can make a program synthesis
engine which, rather than writing a program from scratch and thus
searching the entire space of possible programs, we piece one together
from stuff people have bothered to write and upload (which should remove
a lot of the trivial and dead-end branches from the search tree). Also,
by abstracting to a search, rather than using a deterministic
pattern-matcher (for example) we make the algorithms to build such a
system much simpler, at the cost of large resource overhead. Worth
building in my opinion, as a more powerful evolution of the “code
snippets” databases that are springing up.