When my friend wanted to buy a house in South England she began by researching average house prices and how they varied across different towns, in the hope of finding someplace affordable. She looked at 100s of listings online and visited 20-ish of them in-person, and eventually found something that worked for her.
The interface my friend used for her online research looked like this (screenshot from rightmove.co.uk):
I noticed three interface issues that affected the "market research" user story:
I wondered whether something better could be created, had a go at it and created this, a map of house sale prices from 1995 to 2020 (this is interactive btw):fullscreen
I spoke to an expert first. I shared a coworking space with a property developer, someone who buys houses and makes them nicer so they can sell them for more money, and he was my guy. He gave me a tour of the four different products he used for market research, their features and pricing tiers, along with his product-specific likes, dislikes, and productivity tricks.
Together they solved the problems I'd identified and more. However, their pricing was exorbitant for a non-professional like my friend: from £500/year to £10,000/year depending on what you wanted. All of these companies essentially did the same thing: they licensed data from the government and made it easy to access. I figured I could get hold of that same data, put it on a map, and make it free for everyone.
Of course, it wasn't that simple. The government has all that data locked away in a vault, and they do give keys out, but only to corporations with fat wallets. To make the map I did with government data you would need licenses for the National Polygon Service (£24,000 per year) and AddressBase. AddressBase is £16,245 for internal business use, but costs more if you want to build a product with it. There's no price listed for that license, but it includes the Royal Mail Postcode Address File and when sold separately that costs £138,500 ($187,500). There's no way for an individual tinkerer to access this data.
Undettered, I dug around and found two free-to-use datasets that'd work:
And then I:
The HMRC Price Paid dataset (download) was pretty straightforward, it's just a 4GB CSV with 26 million rows like this:
|250000||2020-02-26 00:00||LE8 8BX||4 SHAKESPEARE ROAD|
|240000||2019-01-09 00:00||LE8 8BX||5 SHAKESPEARE ROAD|
The OpenStreetMap dataset (download) came as a 25GB XML file that looked like this:
<osm> <way id="38078393"> <tag k="addr:housenumber" v="4 Shakespeare Road"/> <nd ref="447864834"/> <nd ref="447864835"/> <nd ref="447864836"/> <nd ref="447864837"/> <nd ref="447864834"/> </way> <node id="447864834" lat="52.53038542" lon="-1.0436888"/> <node id="447864835" lat="52.53037399" lon="-1.0436391"/> <node id="447864836" lat="52.53037938" lon="-1.0433103"/> <node id="447864837" lat="52.5303908" lon="-1.0433600"/> </osm>
It's unclear why, but OpenStreetMap doesn't embed building outlines into the so-called
<way /> and instead references them indirectly. Normally, I'd just go ahead and embed them without thinking much, but the 25GB filesize meant that wouldn't be so straightforward. So I began by just splitting the XML out into two tables, like so:
|4 Shakespeare Road||447864834, 447864835, 447864836, 447864837|
So this gives us a house number to connect the OpenStreetMap and HMRC Price Paid datasets, but OpenStreetMap lacks post codes, and a house number alone isn't sufficient to make a match. Afterall, most towns have a "High Street", and there are 11 streets called "High Street" in London alone. Post codes, in case you don't know, are areas that encompass 15-ish neighbouring houses. So in order to uniquely identify a house in the UK one needs both the house number and post code, as in "4 Shakespeare Road, LE8 8BX". There are exceptions, but this works for >99% of them.
I used proximity to associate a post code with each OpenStreetMap building. I downloaded a list of post codes, picked out the five closest post codes to each building, and figured exactly one of them would match a HMRC Price Paid row (where the house number already matched).
With these four tables I had all the data needed to stitch things together. With just a few hundred-thousand rows this would've been easy to do, for example, in Jupyter Notebook with Pandas. But with a few ten-million rows things got a bit more complicated. If faced with this problem today I'd solve it with a data warehouse (Google BigQuery or AWS RedShift). I'd pop the spreadsheets in, write my query, and press "run". The data warehouse would split that query across 1000s of servers and be done in 30 seconds. The query would:
At the time however, I wanted an excuse to learn a new and cool programming language: Rust. Rust had three things that made it well-suited for this project: it's fast, it has great support for multi-threading, it uses a predictable amount of memory. With the predictable memory thing I was able to design an algorthim, count up how much memory it'd need, and trust everything would fit on a single server. Rust would stick to that number, but other languages may have used 200% or 500% more memory than they really need, and would've made predicting how big my server needed to be tricky.
So I spent a few days learning Rust and writing out my program to join everything. I rented a big server with 64 CPUs (multi-threading!), ran my script, it finished 40 minutes later, and I shut the server down. OpenStreetMap is far from comprehensive, but the program still matched 1,400,000 of the 26,000,000 houses sales from HMRC to building outlines.
I used a bit of Machine Learning for this, a random forest specifically. Scroll down for the Jupyter Notebook.
This was trivial.
I extended Leaflet to make the map. A piece of software by the same people that created OpenStreetMap. To improve the map, I would begin by showing details about the relevant property on mouseover.