Last updated: 2016-04-23 01:38:07 +0100

Upstream URL: git clone


View repository

View issue tracker

Contents of follows

Generic feature extraction from XML documents.

For machine learning problems, we often need our inputs to be the same, fixed size. When we have a recursive structure, like a tree, we can fold over the structure to obtain a single value.

This is a very basic implementation of this idea: we take arbitrary XML documents, which are tree structured, and assign each element a value based on the md5 of its name and attributes concatenated together. We fold sub-trees together using bitwise circular convolution, to obtain a value for the whole tree.

Circular convolution is a linear operation, so it can't preserve as much information as, for example, auto-encoding, but it is reasonably fast, requires no learning and is largely non-commutative/associative, so sub-trees should be distinguishable to a certain extent.