The Sorry State of the Web

Posted on 2016-05-30 by Chris Warburton

Being a programmer, I have a bunch of scripts to automate everyday tasks. Some of those scripts provide commandline interfaces for various Web sites, which I use both directly and from within other scripts.

Recently, one of these sites has changed its pages such that the page’s markup doesn’t occur in the result of a HTTP GET of the page’s URL. Instead the resulting markup contains Javascript, which inserts new elements into the page’s DOM when executed in a browser (e.g. Firefox).

To try and fix my scripts, I tried to replace their previous use of wget, curl, xidel, etc. with PhantomJS, which I’ve used in other projects before.

Unfortunately, even with timeouts, and spoofing the user agent and window size, the elements weren’t being fetched and inserted into the DOM, which I could verify by dumping out screenshots.

Since it works in Firefox, but not in PhantomJS, I tried the same script using SlimerJS, which is backed by the same Gecko rendering engine as Firefox, but that didn’t work either.

Since the only working browser I could find was Firefox, I tried using Selenium, but that seems to be very flaky regarding versions, and I couldn’t get it to run (Firefox would load, but Selenium wouldn’t connect to it).

I made a brief attempt at using SikuliX and xautomation, I tried using Firefox’s “Save Page As” feature, but that only stores the original HTML without the modifications.

Eventually, I cobbled together a script based on the following, which seems to work:

Use xvfb-run to launch a second script, where all the following take place
Run firefox in the background, using timeout to kill the process after 30 seconds. Firefox itself is launched in “safe mode” with the desired URL.
Use xdotool to send keyboard events which close the “safe mode” information message, open the developer console and wait long enough for the DOM to be updated.
Again, using xdotool, type a jQuery-based expression into the console, to extract the desired text. This is fed into a call to prompt, which causes the text to appear, selected, in an overlay.
One final call to xdotool presses ctrl+c to copy the text to the X clipboard.
xsel is used to paste the text to stdout, and echo adds a newline.
We kill the background firefox process using its process ID, stored from when it was invoked.

Needless to say, this is a very convoluted and fragile approach, and it speaks to the sorry state of the Web as a medium for the exchange of hyperlinked documents between a proliferation of applications on disparate machines.