Skip to content

prateekmulye/geo-standardization-case-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Geo‑Standardization Service — Case Study

Last updated: Aug 13, 2025

This case study covers a backward‑compatible standardization of U.S./Canada State/Province fields and a bulk migration plan for 15–20M records. Proprietary details removed; approaches and code snippets are illustrative.

Problem

  • Inconsistent State/Province values (free‑text, abbreviations, misspellings) caused joins and analytics issues.
  • Need a service that standardizes geo fields while remaining compatible with existing consumers.
  • Phase 1 targets US and CA; later phases extend globally.

Approach

  1. Standardization Service: canonical mapping (e.g., CACalifornia) with validated aliases.
  2. Backward Compatibility: dual fields (state_raw, state_std) and read‑through logic.
  3. Incremental Rollout: enable per‑country; monitor read/write ratios and error rates.
  4. Bulk Migration: standardize 15–20M existing records with strong rollback guarantees.

Architecture

flowchart LR
A[Input: raw records] --> M[Mapper: alias → canonical]
M --> W[(DB: state_std)]
W --> S[Services & APIs]
S --> C[Downstream Consumers]
Loading
  • Mapper: curated alias table + fuzzy match for common typos.
  • DB: maintain both state_raw and state_std during transition; downstreams switch gradually.

Data Model (illustrative)

ALTER TABLE companies
  ADD COLUMN state_std TEXT,
  ADD COLUMN geo_version INT DEFAULT 1;

-- idempotent backfill procedure (batched)
UPDATE companies
SET state_std = geo_map(state_raw),
    geo_version = 2
WHERE country IN ('US','CA') AND state_std IS NULL
LIMIT 10000; -- looped in batches

Bulk Migration Plan

  • Batching: process in chunks with checkpoints; record counts and mismatch rates per batch.
  • Safety: snapshot before each major phase; maintain rollback SQL.
  • Monitoring: dashboards for standardization rate, error rate, and downstream read success.

Results

  • Cleaned US/CA state data at scale with no breaking changes to consumers.
  • Prepared a repeatable playbook for additional countries.

Lessons Learned

  • Dual‑write/dual‑read periods minimize risk.
  • A curated alias list covers 80–90% of mess; the last 10% needs targeted fixes and owner feedback.

About

Geo Standardization Case-Study

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors