Ashton Six's pricepaid, go back to the portfolio

When my friend wanted to buy a house in South England she began by researching average house prices and how they varied across different towns, in the hope of finding someplace affordable. She looked at 100s of listings online and visited 20-ish of them in-person, and eventually found something that worked for her.

The interface my friend used for her online research looked like this (screenshot from rightmove.co.uk):

rightmove.co.uk

I noticed three interface issues that affected the "market research" user story:

  1. Users cannot get a visual overview of prices in an area, as they're required to click into properties one-by-one
  2. The map only shows what's currently active on rightmove, which reflects a small subset of house sales in the UK
  3. Users can only see the price listed, but it's the price paid that's important in a negotiation

I wondered whether something better could be created, had a go at it and created this, a map of house sale prices from 1995 to 2020 (this is interactive btw):

fullscreen

How I did it

I spoke to an expert first. I shared a coworking space with a property developer, someone who buys houses and makes them nicer so they can sell them for more money, and he was my guy. He gave me a tour of the four different products he used for market research, their features and pricing tiers, along with his product-specific likes, dislikes, and productivity tricks.

Together they solved the problems I'd identified and more. However, their pricing was exorbitant for a non-professional like my friend: from £500/year to £10,000/year depending on what you wanted. All of these companies essentially did the same thing: they licensed data from the government and made it easy to access. I figured I could get hold of that same data, put it on a map, and make it free for everyone.

Of course, it wasn't that simple. The government has all that data locked away in a vault, and they do give keys out, but only to corporations with fat wallets. To make the map I did with government data you would need licenses for the National Polygon Service (£24,000 per year) and AddressBase. AddressBase is £16,245 for internal business use, but costs more if you want to build a product with it. There's no price listed for that license, but it includes the Royal Mail Postcode Address File and when sold separately that costs £138,500 ($187,500). There's no way for an individual tinkerer to access this data.

Undettered, I dug around and found two free-to-use datasets that'd work:

And then I:

  1. Stitched the datasets together
  2. Adjusted the prices paid for inflation
  3. Split the new dataset into map tiles
  4. Made an interactive map that could load and display the data

Stitching the datasets together

The HMRC Price Paid dataset (download) was pretty straightforward, it's just a 4GB CSV with 26 million rows like this:

price_paidtransaction_datepostcodehousenumber
2500002020-02-26 00:00LE8 8BX4 SHAKESPEARE ROAD
2400002019-01-09 00:00LE8 8BX5 SHAKESPEARE ROAD

The OpenStreetMap dataset (download) came as a 25GB XML file that looked like this:

<osm>
  <way id="38078393">
    <tag k="addr:housenumber" v="4 Shakespeare Road"/>
    <nd ref="447864834"/>
    <nd ref="447864835"/>
    <nd ref="447864836"/>
    <nd ref="447864837"/>
    <nd ref="447864834"/>
  </way>
  <node id="447864834" lat="52.53038542" lon="-1.0436888"/>
  <node id="447864835" lat="52.53037399" lon="-1.0436391"/>
  <node id="447864836" lat="52.53037938" lon="-1.0433103"/>
  <node id="447864837" lat="52.5303908" lon="-1.0433600"/>
</osm>

It's unclear why, but OpenStreetMap doesn't embed building outlines into the so-called <way /> and instead references them indirectly. Normally, I'd just go ahead and embed them without thinking much, but the 25GB filesize meant that wouldn't be so straightforward. So I began by just splitting the XML out into two tables, like so:

housenumberoutline
4 Shakespeare Road447864834, 447864835, 447864836, 447864837

outline_pointlatlon
44786483452.53038542-1.0436888
44786483552.53037399-1.0436391
44786483652.53037938-1.0433103
44786483752.5303908-1.0433600

So this gives us a house number to connect the OpenStreetMap and HMRC Price Paid datasets, but OpenStreetMap lacks post codes, and a house number alone isn't sufficient to make a match. Afterall, most towns have a "High Street", and there are 11 streets called "High Street" in London alone. Post codes, in case you don't know, are areas that encompass 15-ish neighbouring houses. So in order to uniquely identify a house in the UK one needs both the house number and post code, as in "4 Shakespeare Road, LE8 8BX". There are exceptions, but this works for >99% of them.

I used proximity to associate a post code with each OpenStreetMap building. I downloaded a list of post codes, picked out the five closest post codes to each building, and figured exactly one of them would match a HMRC Price Paid row (where the house number already matched).

postcodelatlon
LE8 8BT52.530336-1.042581
LE8 8BU52.530000-1.043251
LE8 8BW52.529556-1.043954
LE8 8BX52.530478-1.043630
LE8 8BY52.528261-1.044580

With these four tables I had all the data needed to stitch things together. With just a few hundred-thousand rows this would've been easy to do, for example, in Jupyter Notebook with Pandas. But with a few ten-million rows things got a bit more complicated. If faced with this problem today I'd solve it with a data warehouse (Google BigQuery or AWS RedShift). I'd pop the spreadsheets in, write my query, and press "run". The data warehouse would split that query across 1000s of servers and be done in 30 seconds. The query would:

  1. Embed the outlines into the OpenStreetMap documents
  2. Do a geospatial query to associate five postcodes with each OpenStreetMap document
  3. Join the OpenStreetMap and HMRC Price Paid datasets on housenumber with a where clause on postcode

At the time however, I wanted an excuse to learn a new and cool programming language: Rust. Rust had three things that made it well-suited for this project: it's fast, it has great support for multi-threading, it uses a predictable amount of memory. With the predictable memory thing I was able to design an algorthim, count up how much memory it'd need, and trust everything would fit on a single server. Rust would stick to that number, but other languages may have used 200% or 500% more memory than they really need, and would've made predicting how big my server needed to be tricky.

So I spent a few days learning Rust and writing out my program to join everything. I rented a big server with 64 CPUs (multi-threading!), ran my script, it finished 40 minutes later, and I shut the server down. OpenStreetMap is far from comprehensive, but the program still matched 1,400,000 of the 26,000,000 houses sales from HMRC to building outlines.

Adjusting the prices for inflation

I used a bit of Machine Learning for this, a random forest specifically. Scroll down for the Jupyter Notebook.

Splitting the new dataset into map tiles

This was trivial.

Making an Interactive Map

I extended Leaflet to make the map. A piece of software by the same people that created OpenStreetMap. To improve the map, I would begin by showing details about the relevant property on mouseover.

Adjusting the prices for inflation: Jupyter Notebook