My Journey into UK Postcodes, Spatial Data, and Scaling Challenges

# My Journey into UK Postcodes, Spatial Data, and Scaling Challenges [Postcodes around me - V2.2 | Frank Observable Lab](https://fcontrepois.observablehq.cloud/lab/p) ## Introduction What started as a simple curiosity—understanding how to identify UK postcodes around me—quickly turned into a deep dive into large datasets, spatial indexing, and cloud-based optimisations. Here’s my journey, step by step. --- ## Step 0: constraints - Using Observable Framework to generate static pages - Each page must load in less than half a second - The data files per page need to be counted in Kilos not Megas (London being an exception) - Need to cover all of the UK - It should take less than 5 seconds to see my current position and the postcodes around me. --- ## 📍 Step 1: Discovering the ONS Postcode Dataset - I found the **ONS (Office for National Statistics) dataset**, which contains an incredible level of detail, including: - Postcodes - Latitude & longitude (down to the meter!) - Additional metadata like easting and northing coordinates - The dataset was massive—**2.8 million rows**. - I realized that many of these were **terminated postcodes**, so filtering for active ones brought it down to **1.7 million**. --- ## 📏 Step 2: Attempting a Simple Bounding Box Approach - I first tried a naive approach: **grouping postcodes by their prefixes** (e.g., `SW`, `N`, `IV`) and defining broad rectangular regions. - However, issues quickly arose: - **PO Boxes don’t match their postcode area**—some postcodes don’t correspond to real-world locations. - **Regions were too imprecise**—some bounding boxes were too large, making them unreliable. --- ## 🗺️ Step 3: Splitting the UK into Grid-Based Regions - I attempted a **manual division** of the UK into 20 rectangles, then subdivided further. - This quickly became **unmanageable** as the number of regions grew exponentially. - **Spatial SQL extensions** in **Postgres and SQL Server** seemed promising, but I struggled to get the results I wanted. --- ## 🔄 Step 4: Flipping the Approach – Grid First, Then Assign Postcodes - Instead of trying to infer bounding boxes from postcodes, I **flipped the problem**: 1. Generate a grid of squares over the UK. 2. Assign each postcode to its corresponding square. - I started with **20×20 (400 squares)** but quickly scaled to **40,000 squares** for better granularity. - **DuckDB** proved to be incredibly useful for processing: - Extracted only the needed columns. - Partitioned data efficiently. - Converted CSV into **Parquet** for size optimization. --- ## 🏗️ Step 5: The Observable Bottleneck > [!Update] > By using Duckdb in a file (not in memory) and using the normal parametrised approach to data loader and files everything started working great. > I also reduced the areas to 23*23 and that's enough :D - While DuckDB allowed **rapid dataset generation (10 seconds for 40,000 files!)**, Observable wasn’t scaling well. - **40,000 pages** were simply too much for the platform. - One second per page is simply **too slow**. --- ## ☁️ Step 6: Moving to Cloud Storage (S3/R2) > [!Update] > I now have only 23*23 areas and the generation is super fast. No need for this bit anymore. - The next logical step: **host the 40,000+ datasets in the cloud** and load them dynamically. - Plan: - Upload datasets to **R2 or S3**. - Serve them via a lightweight API or direct file access. - Load only **relevant data** in Observable instead of pre-generating everything. - This should significantly improve performance by **fetching data on demand** instead of loading huge datasets upfront. --- ## Final Thoughts What started as a curiosity turned into an experiment in **spatial data processing, optimisation, and cloud scalability**. Each approach taught me something new, from **ONS data structures** to **DuckDB performance tricks**. I will find a way. --- ### 🚀 Tech Stack Summary: - **Data Source:** [ONS Postcode dataset](https://www.ons.gov.uk/methodology/geography/geographicalproducts/postcodeproducts) - **Processing:** [DuckDB](https://duckdb.org/), Parquet - **Visualization:** [Observable](https://observablehq.com/framework/) - **Hosting:** [R2](https://www.cloudflare.com/en-gb/developer-platform/products/r2/)/[S3](https://aws.amazon.com/s3/) (planned) - **Optimizations:** Grid-based indexing, partitioning, and dynamic data loading