jonrob.net


Working with XPath

Recently I've been re-writing my web scrapers after a senior dev advised that using regex to parse HTML is just a terrible way to do it and I should really be using the language made for parsing XML that is XPath, so I've been busy with this and using what I've learnt written some C subroutines to easily parse web pages as it takes an awful lot of code to just download a web page to a string and run an XPath query on it. These subroutines are now working in my Arbitrage betting software.

To get the XPath queries, the the easiest way I've found is to right click the bit of the web page I'm interested in grabbing in Chrome or Firefox then "Inspect element", just below the source code in the new window it has the node we selected which we can then turn into a query.

I've also been working with the Perl module HTML::TreeBuilder::XPath for parsing web pages, in contrast to libxml this only requires 5 lines of code to return the results to an array from a web page. I've increased the storage for the server this website is hosted on and written a script using this module to automatically download videos from TempleOS.org and upload them here because Terry regularly deletes the videos and the Youtube re-uploaders have all stopped.

Another small script I originally wrote in C but then ported to Perl scrapes the website allkeyshop.com according to a config file and sends an email for any games which are selling for below a set price threshold.

I imagine these examples will help anyone getting started with XPath, lets just hope Microsoft doesn't make it illegal for us to scrape publicly available content.

--note--
The TempleOS script has now been turned off, more complete archives have been compiled such as archive.org.
The highlighted section can be grabbed using XPath with /html/body/div/h1