The Sorry State of the Web
Being a programmer, I have a bunch of scripts to automate everyday tasks. Some of those scripts provide commandline interfaces for various Web sites, which I use both directly and from within other scripts.
Recently, one of these sites has changed its pages such that the page’s markup doesn’t occur in the result of a HTTP GET of the page’s URL. Instead the resulting markup contains Javascript, which inserts new elements into the page’s DOM when executed in a browser (e.g. Firefox).
To try and fix my scripts, I tried to replace their previous use of
wget
, curl
, xidel
, etc. with
PhantomJS, which I’ve used in other projects before.
Unfortunately, even with timeouts, and spoofing the user agent and window size, the elements weren’t being fetched and inserted into the DOM, which I could verify by dumping out screenshots.
Since it works in Firefox, but not in PhantomJS, I tried the same script using SlimerJS, which is backed by the same Gecko rendering engine as Firefox, but that didn’t work either.
Since the only working browser I could find was Firefox, I tried using Selenium, but that seems to be very flaky regarding versions, and I couldn’t get it to run (Firefox would load, but Selenium wouldn’t connect to it).
I made a brief attempt at using SikuliX and xautomation, I tried using Firefox’s “Save Page As” feature, but that only stores the original HTML without the modifications.
Eventually, I cobbled together a script based on the following, which seems to work:
- Use
xvfb-run
to launch a second script, where all the following take place - Run
firefox
in the background, usingtimeout
to kill the process after 30 seconds. Firefox itself is launched in “safe mode” with the desired URL. - Use
xdotool
to send keyboard events which close the “safe mode” information message, open the developer console and wait long enough for the DOM to be updated. - Again, using
xdotool
, type a jQuery-based expression into the console, to extract the desired text. This is fed into a call toprompt
, which causes the text to appear, selected, in an overlay. - One final call to
xdotool
pressesctrl+c
to copy the text to the X clipboard. xsel
is used to paste the text to stdout, andecho
adds a newline.- We kill the background
firefox
process using its process ID, stored from when it was invoked.
Needless to say, this is a very convoluted and fragile approach, and it speaks to the sorry state of the Web as a medium for the exchange of hyperlinked documents between a proliferation of applications on disparate machines.