Developer Dan Baker on drawing on the Royal Mail’s postcode database to help users find their local court or a legal adviser in their area.
When we took over the development of the Legal Services Commission’s legal aid checker website we also inherited access to the Royal Mail’s Postcode Address File. Rather than making an additional payment for a readymade external postcode finding service, like Postcode Anywhere, we decided to build our own in-house service utilising the PAF.
“The Royal Mail’s Postcode Address File (PAF®) is the most up-to-date and complete database in the UK, containing over 28 million addresses.” (Royal Mail website)
In the modern age of celebrity, where millions and billionaires are mentioned in every other article, it’s all too easy to forget just how big a million of anything actually is. I got a stark reminder of this when I looked at the raw data in the Royal Mail’s Postcode Address File. When you need to transfer and use this data in a database, it’s big!
There’s a 205-page programmers’ guide to get you started, but ultimately it comes down to understanding just what level of detail you require for the application that’s being developed. The PAF lists every single business and residential address in the UK, right up to old Mrs. Albertson at no. 52, but do we really need that level of detail?
In short, no!
The court finder and check if you can get legal aid applications, to name but two, will already have access to a database of ‘resources’, such as courts or legal advisers, along with their full address details and geographic coordinates. We’re just interested in locating which of these is in the vicinity of the user’s postcode, town or borough. This is where the postcode/address look-up service comes into play.
The PAF is split into a series of different text files that deal with streets, buildings, sub-buildings, organisations, towns and addresses (including postcodes), all laid out in a relational format to each other.
The ‘address’ file is where the 28 million addresses are stored and each new line looks something like this:
This basically lists the individual postcode along with a series of pointers to unique identifiers that go to make up an individual delivery address point.
If we were trying to deliver goods to someone’s door that information would be invaluable. As we’re only really interested in the general area of a town or postcode we can instantly strip away most of this excess data, to leave just the postcode and the town identifier. However, that still leaves us with 28 million lines of information – just shorter, more manageable lines.
We can reduce the size of the dataset by removing details relating to any postcodes in Scotland or Northern Ireland, because the services will only cover England and Wales.
In addition, each postcode can be repeated on average 15 times for the various delivery points it covers. For example, the postcode PE1 4QD covers numbers 1 to 43 on Acacia Avenue in Peterborough. If we only need to handle unique postcodes, rather than full addresses, we can make the dataset much smaller.
(We could make the dataset smaller still by only using the ‘outward postcode’ – the first half of any postcode, such as PE1 – rather than the full postcode, PE1 4QD. This would increase query response time but at a cost of absolute accuracy.)
At the end of all this consolidation the address file is a considerably more manageable 28,000 lines of data.
The ‘towns’ file is where around 40,000 ‘localities’ are stored, an example of which is below:
002344 HUNTINGDON ST. NEOTS LITTLE BARFORD
Each line consists of a town ID followed by the name of the town it refers to, and may also include a ‘dependency’ (such as a suburb name) and ‘double dependency’ (neighbourhoods, or similarly small areas, by another name).
As with the address file this data can be rationalised, for our specific requirements, to filter out any locations in Scotland and Northern Ireland and, because we just don’t need that level of detail, the double dependencies.
The final piece of the puzzle, and most useful with regards to finding a user’s geographic location, is to utilise the Royal Mail’s Postzon™ data, which is supplied along with the PAF.
Postzon contains a list of around 1.7 million postcodes currently in use along with their easting/northing geographic coordinates (accurate to 100 metres) and a whole host of additional information about the country, county, district, and ward (to name but a few) that the postcode belongs to, an example of which is below:
Again the data is laid out in a relational format as a series of reference codes. Unlike PAF, where extra info is supplied in another text file by Royal Mail, the majority of the Postzon data is supplied by external data partners (most of it by Ordnance Survey and Office for National Statistics). To be able to get meaningful information from these numeric codes therefore requires several additional downloads from the data partners’ websites.
There’s not much you can do to consolidate this dataset, aside from removing the areas that the services don’t cover, as every possible postcode needs to return a result.
There are some anomalies in the data, where only a postcode is provided without the more useful geographic coordinates and area code information. In these cases it’s easier just to strip out the row entirely, rather than return NULL values later. Even doing that still leaves the dataset coming in at an unwieldy 1.5 million rows.
The only advice I can offer there is to split up the data into a series of smaller tables dealing with specific postcode ranges ([a-c], [d-f], etc…) or invest in a higher-spec database server. Some beasts were never meant to be tamed!
Hopefully I’ve managed to shed some light on the slightly daunting process of grappling with the Royal Mail’s datasets. Of course that’s only half the battle, the real work begins with programming the postcode look-up service, but maybe I’ll save blogging about that for another day.
In the meantime feel free to leave any feedback on your own experiences handling the PAF/Postzon or with any finished postcode look-up services you’ve worked on.