The Sorry State of the Web

Posted on by Chris Warburton

Being a programmer, I have a bunch of scripts to automate everyday tasks. Some of those scripts provide commandline interfaces for various Web sites, which I use both directly and from within other scripts.

Recently, one of these sites has changed its pages such that the page’s markup doesn’t occur in the result of a HTTP GET of the page’s URL. Instead the resulting markup contains Javascript, which inserts new elements into the page’s DOM when executed in a browser (e.g. Firefox).

To try and fix my scripts, I tried to replace their previous use of wget, curl, xidel, etc. with PhantomJS, which I’ve used in other projects before.

Unfortunately, even with timeouts, and spoofing the user agent and window size, the elements weren’t being fetched and inserted into the DOM, which I could verify by dumping out screenshots.

Since it works in Firefox, but not in PhantomJS, I tried the same script using SlimerJS, which is backed by the same Gecko rendering engine as Firefox, but that didn’t work either.

Since the only working browser I could find was Firefox, I tried using Selenium, but that seems to be very flaky regarding versions, and I couldn’t get it to run (Firefox would load, but Selenium wouldn’t connect to it).

I made a brief attempt at using SikuliX and xautomation, I tried using Firefox’s “Save Page As” feature, but that only stores the original HTML without the modifications.

Eventually, I cobbled together a script based on the following, which seems to work:

Needless to say, this is a very convoluted and fragile approach, and it speaks to the sorry state of the Web as a medium for the exchange of hyperlinked documents between a proliferation of applications on disparate machines.