Some Python Stats
I realised today that there’s a really easy-to-fix bottleneck in my
Python Abstract Syntax Tree
decompiler. The decompiler looks for “things”, and for each “thing”
it checks against a list of alternatives to see what it might be (and
hence how to decompile it). This list is ordered, so it checks against
the first rule, if it doesn’t match then it checks against the next rule
and so on until it either finds a match, or it runs out of rules to
check against (and throws an error).
This ordering of the
checks, specifically ordered
choice, is what defines
Parsing
Expression Grammars as a deterministic counterpart to
Backus-Naur
Form, along with greedy matching.
There are two quirks of
my Python decompiler, however, that can be used to our advantage to make
it faster really easily.
Firstly, if the choice were not
ordered, as in Backus-Naur Form, there is only one instance of
nondeterminism, which happens for deletion. Thus, as long as deletion
comes first, we can actually rearrange the other rules however we
like.
The second quirk is that I have implemented the rules
pretty much in accordance with the classes of AST nodes used in Python’s
compiler module. This means that we can predict the chances of each
match succeeding by simply parsing existing Python code and counting up
the number of times each node occurs.
I’ve written a quick
script to do this counting, which you can find
here.
To use it, you need to provide a list of Python files for it to check. I
did this for my whole operating system by running the command ’find /
-name “*.py” > ALL_PYTHON_FILES’ which will (after some time and
“permission denied” warnings) give you a list called “ALL_PYTHON_FILES”.
For me this list contains 25,087 files.
Whilst the
node_counter.py script runs reasonably quickly, I keep getting a
segmentation fault with around 23,700 files to go. For this reason, I’ve
also given it an optional second argument, which is the number of files
to use from the given list. These are picked at random from throughout
the file, to prevent it grabbing everything from the same project and
thus biasing the results.
The script outputs a spreadsheet
which contains each type of node and the number of times it was found in
the files it checked. I used
Gnumeric to work out
some percentages and collate the results from running it against 1 file,
10 files, 100 files and 1000 files, and generated the following (which
you can find the source for
here)
The green lines are the most accurate, since they’re from the
sample of 1000 files; the others are there to give an indication of the
variance, because I couldn’t be bothered to work out the standard
deviation.
What this shows us is that we can’t guess what
kind of node a given “thing” will be, since none of these is above 0.5
(ie. 50%), but what it does show is that some types of node are far more
common than others. We can see that Name nodes (which represent the “x”
in “x = 5”, for example) are used all over the place, such that around
30% of the nodes are Names. Getattr is probably the next-most-used node
(which represents the “.foo” in “bar.foo = 8”) and the frequency rapidly
decreases until we get some very niche nodes like Ellipsis which don’t
occur at all. Note that some classes, like “Expr” and “Node” will always
show zero since they’re superclasses that aren’t directly instantiated
by the compiler.
So what does this mean for performance
increases? It means that by putting the node types in order of
frequency, we can make the most common cases match after only a few
attempts, whilst sacrificing the performance of rarely used
classes.
Since this script is part of the Python decompiler,
it’s Public Domain. If it’s of any use to you, go ahead and take it
:)