CRITICAL RULES:
- NEVER modify existing columns in
votersorvoter_electionstables - NEVER delete or overwrite source data
- ONLY add new computed columns or create separate cache tables
- ALWAYS backup database before running optimization scripts
- All optimizations must be reversible
Core Data Tables (NEVER MODIFY):
voters- Voter registration data with geocoded addressesvoter_elections- Voting history records- These are the source of truth and must remain pristine
Safe Operations:
- ✅ ADD new columns for computed data
- ✅ CREATE new cache/summary tables
- ✅ CREATE indexes
- ✅ READ data for analysis
- ❌ UPDATE existing columns
- ❌ DELETE records
- ❌ DROP tables
Root Cause: Address matching with LIKE queries on 500K+ rows Current Fix: Add indexes Better Fix: Pre-compute household groupings
Root Cause: Correlated subqueries scanning entire voter_elections table
- New voter detection:
NOT EXISTS (SELECT ... WHERE vuid = ve.vuid AND date < current) - Flip detection:
SELECT MAX(date) WHERE vuid = ve.vuid AND date < currentCurrent Fix: Pre-compute and cache Better Fix: Denormalize - store flags directly on records
Root Cause: Loading 40K+ VUIDs into temp table, then joining Current Fix: Already optimized with temp tables Better Fix: Pre-compute per district, store in cache table
Add computed columns to voter_elections table:
ALTER TABLE voter_elections ADD COLUMN is_new_voter INTEGER DEFAULT 0;
ALTER TABLE voter_elections ADD COLUMN previous_party TEXT;
ALTER TABLE voter_elections ADD COLUMN previous_election_date TEXT;
ALTER TABLE voter_elections ADD COLUMN has_flipped INTEGER DEFAULT 0;Compute once during scraper:
# After importing new election data:
# 1. Mark new voters
UPDATE voter_elections SET is_new_voter = 1
WHERE election_date = '2026-03-03'
AND NOT EXISTS (
SELECT 1 FROM voter_elections ve2
WHERE ve2.vuid = voter_elections.vuid
AND ve2.election_date < '2026-03-03'
);
# 2. Set previous party
UPDATE voter_elections
SET previous_party = (
SELECT party_voted FROM voter_elections ve2
WHERE ve2.vuid = voter_elections.vuid
AND ve2.election_date < voter_elections.election_date
ORDER BY ve2.election_date DESC LIMIT 1
)
WHERE election_date = '2026-03-03';
# 3. Mark flips
UPDATE voter_elections
SET has_flipped = 1
WHERE election_date = '2026-03-03'
AND previous_party IS NOT NULL
AND previous_party != party_voted;Impact:
- Gazette queries: 5 minutes → 5 seconds
- No more correlated subqueries
- Simple WHERE clauses with indexes
Create pre-aggregated tables:
CREATE TABLE election_stats_cache (
election_date TEXT,
voting_method TEXT,
party TEXT,
county TEXT,
total_voters INTEGER,
new_voters INTEGER,
flipped_voters INTEGER,
female_count INTEGER,
male_count INTEGER,
age_18_24 INTEGER,
age_25_34 INTEGER,
-- etc
computed_at TIMESTAMP,
PRIMARY KEY (election_date, voting_method, party, county)
);Rebuild after each scraper run:
# Single INSERT with GROUP BY - fast!
INSERT OR REPLACE INTO election_stats_cache
SELECT
ve.election_date,
ve.voting_method,
ve.party_voted,
v.county,
COUNT(*) as total_voters,
SUM(ve.is_new_voter) as new_voters,
SUM(ve.has_flipped) as flipped_voters,
SUM(CASE WHEN v.sex='F' THEN 1 ELSE 0 END) as female_count,
SUM(CASE WHEN v.sex='M' THEN 1 ELSE 0 END) as male_count,
-- age groups
CURRENT_TIMESTAMP
FROM voter_elections ve
JOIN voters v ON ve.vuid = v.vuid
WHERE ve.election_date = '2026-03-03'
GROUP BY ve.election_date, ve.voting_method, ve.party_voted, v.county;Impact:
- Gazette: Query summary table instead of raw data
- District stats: Aggregate from summary table
- 100x faster queries
Create household lookup table:
CREATE TABLE household_groups (
address_key TEXT PRIMARY KEY,
vuids TEXT, -- JSON array of VUIDs at this address
lat REAL,
lng REAL,
voter_count INTEGER
);Rebuild after geocoding:
INSERT OR REPLACE INTO household_groups
SELECT
UPPER(TRIM(address)) as address_key,
json_group_array(vuid) as vuids,
AVG(lat) as lat,
AVG(lng) as lng,
COUNT(*) as voter_count
FROM voters
WHERE geocoded = 1 AND address != ''
GROUP BY UPPER(TRIM(address));Impact:
- Household popup: 30s → <100ms
- Single index lookup instead of LIKE scan
Replace correlated subqueries with window functions:
-- OLD (slow):
SELECT vuid, party_voted,
(SELECT MAX(election_date) FROM voter_elections ve2
WHERE ve2.vuid = ve.vuid AND ve2.election_date < ve.election_date) as prev_date
FROM voter_elections ve;
-- NEW (fast):
WITH ranked AS (
SELECT vuid, party_voted, election_date,
LAG(party_voted) OVER (PARTITION BY vuid ORDER BY election_date) as prev_party,
LAG(election_date) OVER (PARTITION BY vuid ORDER BY election_date) as prev_date
FROM voter_elections
)
SELECT * FROM ranked WHERE election_date = '2026-03-03';Impact:
- Single table scan instead of N subqueries
- 10-50x faster
-- Increase cache size
PRAGMA cache_size = -64000; -- 64MB
-- Enable memory-mapped I/O
PRAGMA mmap_size = 268435456; -- 256MB
-- WAL mode (already enabled)
PRAGMA journal_mode = WAL;
-- Optimize for read-heavy workload
PRAGMA synchronous = NORMAL;
PRAGMA temp_store = MEMORY;- Indexes (30 seconds, safe, immediate 2-5x speedup)
- Denormalization (5 minutes, one-time, 10-50x speedup for gazette)
- Summary tables (10 minutes, one-time, 100x speedup for aggregates)
- Household pre-computation (2 minutes, 100x speedup for popups)
- Query rewriting (ongoing, case-by-case)
Pros:
- Better query optimizer
- Parallel queries
- Materialized views built-in
- Better for concurrent writes
Cons:
- More complex deployment
- Higher memory usage
- Overkill for current scale
Verdict: Stick with SQLite for now, but consider if you hit 10M+ records
Pros:
- Columnar storage (10x faster aggregates)
- Built for OLAP queries
- Can query SQLite directly
Cons:
- Another database to manage
- Not for transactional data
Verdict: Interesting for gazette/analytics, but adds complexity
Pros:
- Sub-millisecond lookups
- Perfect for household popups
- TTL support
Cons:
- Another service to run
- Memory-only (unless persistence enabled)
Verdict: Good for production, but denormalization achieves similar results
Pros:
- Fast full-text search
- Geospatial queries
- Aggregations
Cons:
- Heavy resource usage
- Complex setup
Verdict: Overkill for current needs
Best bang for buck:
- Indexes (do now)
- Denormalization (do next)
- Summary tables (do after scraper integration)
This gives you 100x+ speedup with minimal complexity.
Don't need:
- PostgreSQL (SQLite is fine at this scale)
- Redis (denormalization is simpler)
- Elasticsearch (not needed)
Future considerations (if you hit 10M+ records):
- PostgreSQL with read replicas
- Partitioning by year
- Separate analytics database