# My Journey into UK Postcodes, Spatial Data, and Scaling Challenges
[Postcodes around me - V2.2 | Frank Observable Lab](https://fcontrepois.observablehq.cloud/lab/p)
## Introduction
What started as a simple curiosity—understanding how to identify UK postcodes around me—quickly turned into a deep dive into large datasets, spatial indexing, and cloud-based optimisations. Here’s my journey, step by step.
---
## Step 0: constraints
- Using Observable Framework to generate static pages
- Each page must load in less than half a second
- The data files per page need to be counted in Kilos not Megas (London being an exception)
- Need to cover all of the UK
- It should take less than 5 seconds to see my current position and the postcodes around me.
---
## 📍 Step 1: Discovering the ONS Postcode Dataset
- I found the **ONS (Office for National Statistics) dataset**, which contains an incredible level of detail, including:
- Postcodes
- Latitude & longitude (down to the meter!)
- Additional metadata like easting and northing coordinates
- The dataset was massive—**2.8 million rows**.
- I realized that many of these were **terminated postcodes**, so filtering for active ones brought it down to **1.7 million**.
---
## 📏 Step 2: Attempting a Simple Bounding Box Approach
- I first tried a naive approach: **grouping postcodes by their prefixes** (e.g., `SW`, `N`, `IV`) and defining broad rectangular regions.
- However, issues quickly arose:
- **PO Boxes don’t match their postcode area**—some postcodes don’t correspond to real-world locations.
- **Regions were too imprecise**—some bounding boxes were too large, making them unreliable.
---
## 🗺️ Step 3: Splitting the UK into Grid-Based Regions
- I attempted a **manual division** of the UK into 20 rectangles, then subdivided further.
- This quickly became **unmanageable** as the number of regions grew exponentially.
- **Spatial SQL extensions** in **Postgres and SQL Server** seemed promising, but I struggled to get the results I wanted.
---
## 🔄 Step 4: Flipping the Approach – Grid First, Then Assign Postcodes
- Instead of trying to infer bounding boxes from postcodes, I **flipped the problem**:
1. Generate a grid of squares over the UK.
2. Assign each postcode to its corresponding square.
- I started with **20×20 (400 squares)** but quickly scaled to **40,000 squares** for better granularity.
- **DuckDB** proved to be incredibly useful for processing:
- Extracted only the needed columns.
- Partitioned data efficiently.
- Converted CSV into **Parquet** for size optimization.
---
## 🏗️ Step 5: The Observable Bottleneck
> [!Update]
> By using Duckdb in a file (not in memory) and using the normal parametrised approach to data loader and files everything started working great.
> I also reduced the areas to 23*23 and that's enough :D
- While DuckDB allowed **rapid dataset generation (10 seconds for 40,000 files!)**, Observable wasn’t scaling well.
- **40,000 pages** were simply too much for the platform.
- One second per page is simply **too slow**.
---
## ☁️ Step 6: Moving to Cloud Storage (S3/R2)
> [!Update]
> I now have only 23*23 areas and the generation is super fast. No need for this bit anymore.
- The next logical step: **host the 40,000+ datasets in the cloud** and load them dynamically.
- Plan:
- Upload datasets to **R2 or S3**.
- Serve them via a lightweight API or direct file access.
- Load only **relevant data** in Observable instead of pre-generating everything.
- This should significantly improve performance by **fetching data on demand** instead of loading huge datasets upfront.
---
## Final Thoughts
What started as a curiosity turned into an experiment in **spatial data processing, optimisation, and cloud scalability**. Each approach taught me something new, from **ONS data structures** to **DuckDB performance tricks**.
I will find a way.
---
### 🚀 Tech Stack Summary:
- **Data Source:** [ONS Postcode dataset](https://www.ons.gov.uk/methodology/geography/geographicalproducts/postcodeproducts)
- **Processing:** [DuckDB](https://duckdb.org/), Parquet
- **Visualization:** [Observable](https://observablehq.com/framework/)
- **Hosting:** [R2](https://www.cloudflare.com/en-gb/developer-platform/products/r2/)/[S3](https://aws.amazon.com/s3/) (planned)
- **Optimizations:** Grid-based indexing, partitioning, and dynamic data loading