Recently I’ve been re-writing my web scrapers after a senior dev advised that using regex to parse HTML is just a terrible way to do it and I should really be using the language made for parsing XML that is XPath, so I’ve been busy with this and using what I’ve learnt written some C subroutines to easily parse web pages as it takes an awful lot of code to just download a web page to a string and run an XPath query on it. These subroutines are now working in my Arbitrage betting software.
To get the XPath queries the the easiest way I’ve found is to right click the bit of the web page I’m interested in grabbing in Chrome or Firefox then “Inspect element”, just below the source code in the new window it has the node we selected which we can then turn into a query.
I’ve also been working with the Perl module HTML::TreeBuilder::XPath for parsing web pages, in contrast to libxml this only requires 5 lines of code to return the results to an array from a web page. I’ve increased the storage for the server this website is hosted on and written a script using this module to automatically download videos from TempleOS.org and upload them here because Terry regularly deletes the videos and the Youtube re-uploaders have all stopped.
Another small script I originally wrote in C but then ported to Perl scrapes the website allkeyshop.com according to a config file and sends an email for any games which are selling for below a set price threshold.
I imagine these examples will help anyone getting started with XPath, lets just hope Microsoft doesn’t make it illegal for us to scrape publicly available content.
The TempleOS script was only put in place on 2017 August 16th and was turned off on 2017 October 27th, more complete archives have been compiled on archive.org.