The Web as an Oracle

Posted on 2010-05-09 by Chris Warburton

I’ve tried a few times to write up my programming thoughts to this blog, each time from a different perspective and getting rather off-topic most of the time. Those posts will probably appear soon, since I’ve invested the time to write them I may as well publish them, but a thought struck me about how to clarify my meaning in a concise way.

My ideas have been revolving around formal specification and verification of programs. Since I approached the subject from a Python perspective it meant living without the luxury of an enforced type system, definite syntax and semantics, immutability of variables, etc. that one gets with a functional language like ML or Haskell. Since anything in Python can be redefined at any point, for example swapping an object’s class via my_object.__class__ = Foo, I started by considering a system that broke down code into black-box style atomic operations (assuming a single thread, because I’m against multithreading as a usable model for concurrent processing).

For example, we could look at a function and work out conservative bounds on the preconditions and post conditions. For example if we have a function which runs “argument.foo += 1” then we know that the argument “argument” must contain something called “foo”, unless we’re catching a runtime exception, it must be mutable, have a runnable method “add” which accepts the argument “1” and so on.

It struck me that bundling such specifications with the code would be useful, but it would be a far cry from a verified system. For example, one can replace an object’s methods at any time, so the specifications of code snippets like functions would have to be self-describing only and make reference to the specifications of their arguments and global names, etc. since they can’t be relied upon to be known before run-time. That way, at runtime the full specification can be derived (once the arguments are definitely known), but not before.

In essence, a function’s specification is built upon the specification of its arguments, accessible namespace members (eg. globals) and the semantics of the language. If this is the case then even having a complete, correct, verified specification of the language (eg. Python) does not let us make assumptions about functions, modules, etc. without knowing their arguments. The system’s state is just as powerful as the language; in fact, the system’s state can be regarded AS the language! Running the same function in different environments isn’t just an impure, non-functional way of working, it may as well be running the same function in two different languages!

With this in mind it became obvious that the lack of ability to make valid specifications is a boon: current state, libraries, programming language, etc. can be combined to give a combinatorial explosion of languages to predicate a function’s specification on. Thus we don’t NEED a specification for Python at all, since it would be tainted by whatever modules we import, and those would be tainted by the running program’s state, so we just stick to our conservative specifications like “it must contain foo” rather than “it must be an (int x int) -> float” and we can refine these as we go by making more and more of them for each library, language, etc. we come across. Thus we don’t get a formal specification, but we do get an algorithm to piece together a specification given at least one set of loose, conservative specifications. We don’t need to verify all of the way down to the metal; we can verify what we can and leave the rest for later (ie. use them as axioms).

What is the algorithm for doing this? It’s reasoning. Predicate logic. It’s just the thing RDF/OWL is good for! So what happens if we have an RDF database of code along with specifications? First of all we do a massive Web crawl and fill our database with code. Then we do a massive spec-finding session (automatically of course). Then we do a load of cross-referencing and such (ie. if function foo’s spec says it requires XYZ and function bar’s spec says it returns XYZ then we know that foo(bar) is a reasonable expression to consider). Essentially we can make a program synthesis engine which, rather than writing a program from scratch and thus searching the entire space of possible programs, we piece one together from stuff people have bothered to write and upload (which should remove a lot of the trivial and dead-end branches from the search tree). Also, by abstracting to a search, rather than using a deterministic pattern-matcher (for example) we make the algorithms to build such a system much simpler, at the cost of large resource overhead. Worth building in my opinion, as a more powerful evolution of the “code snippets” databases that are springing up.