Some Python Stats

Posted on by Chris Warburton

I realised today that there's a really easy-to-fix bottleneck in my Python Abstract Syntax Tree decompiler. The decompiler looks for "things", and for each "thing" it checks against a list of alternatives to see what it might be (and hence how to decompile it). This list is ordered, so it checks against the first rule, if it doesn't match then it checks against the next rule and so on until it either finds a match, or it runs out of rules to check against (and throws an error).

This ordering of the checks, specifically ordered choice, is what defines Parsing Expression Grammars as a deterministic counterpart to Backus-Naur Form, along with greedy matching.

There are two quirks of my Python decompiler, however, that can be used to our advantage to make it faster really easily.

Firstly, if the choice were not ordered, as in Backus-Naur Form, there is only one instance of nondeterminism, which happens for deletion. Thus, as long as deletion comes first, we can actually rearrange the other rules however we like.

The second quirk is that I have implemented the rules pretty much in accordance with the classes of AST nodes used in Python's compiler module. This means that we can predict the chances of each match succeeding by simply parsing existing Python code and counting up the number of times each node occurs.

I've written a quick script to do this counting, which you can find here. To use it, you need to provide a list of Python files for it to check. I did this for my whole operating system by running the command 'find / -name "*.py" > ALL_PYTHON_FILES' which will (after some time and "permission denied" warnings) give you a list called "ALL_PYTHON_FILES". For me this list contains 25,087 files.

Whilst the node_counter.py script runs reasonably quickly, I keep getting a segmentation fault with around 23,700 files to go. For this reason, I've also given it an optional second argument, which is the number of files to use from the given list. These are picked at random from throughout the file, to prevent it grabbing everything from the same project and thus biasing the results.

The script outputs a spreadsheet which contains each type of node and the number of times it was found in the files it checked. I used Gnumeric to work out some percentages and collate the results from running it against 1 file, 10 files, 100 files and 1000 files, and generated the following (which you can find the source for here)

The green lines are the most accurate, since they're from the sample of 1000 files; the others are there to give an indication of the variance, because I couldn't be bothered to work out the standard deviation.

What this shows us is that we can't guess what kind of node a given "thing" will be, since none of these is above 0.5 (ie. 50%), but what it does show is that some types of node are far more common than others. We can see that Name nodes (which represent the "x" in "x = 5", for example) are used all over the place, such that around 30% of the nodes are Names. Getattr is probably the next-most-used node (which represents the ".foo" in "bar.foo = 8") and the frequency rapidly decreases until we get some very niche nodes like Ellipsis which don't occur at all. Note that some classes, like "Expr" and "Node" will always show zero since they're superclasses that aren't directly instantiated by the compiler.

So what does this mean for performance increases? It means that by putting the node types in order of frequency, we can make the most common cases match after only a few attempts, whilst sacrificing the performance of rarely used classes.

Since this script is part of the Python decompiler, it's Public Domain. If it's of any use to you, go ahead and take it :)