This guide covers database performance topics for backend interviews: query optimization, indexing, schema design, caching, partitioning, sharding, replication, and operating large tables. Each question pairs the explanation with code, configuration, or query plans.
Answer:
Measure first, then optimize. Use EXPLAIN ANALYZE (Postgres) or EXPLAIN FORMAT=JSON (MySQL) to see the actual cost. Adding an index without confirming the planner needs one is guesswork.
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42 AND status = 'shipped';Output tells you whether it picked an index scan, sequential scan, or hash join, plus rows scanned vs returned. Optimization decisions follow the data, not intuition.
Answer:
A user-facing API endpoint should not exceed about 200 ms total DB time at p95. Breaking it down:
| Query type | Acceptable | Investigate | Critical |
|---|---|---|---|
| Primary key lookup | < 1 ms | > 5 ms | > 20 ms |
| Indexed read | < 5 ms | > 20 ms | > 100 ms |
| Aggregation (small) | < 50 ms | > 200 ms | > 1 s |
| Reporting query | < 5 s | > 30 s | move to OLAP |
If a PK lookup takes 50 ms, suspect connection latency, network, or cold cache, not the query itself.
Answer:
Most slow queries fall into a small set of patterns:
- Missing or stale index
- N+1 queries from the application
SELECT *on wide tables- Implicit type cast preventing index use
LIKE '%term%'(leading wildcard)- Function applied to a column without an expression index
- Large
OFFSETfor pagination - Stale planner statistics
- Lock contention
- Buffer pool eviction (cold data)
Example of a hidden killer:
-- "id" is INTEGER but bound as string -> implicit cast -> seq scan
SELECT * FROM users WHERE id = '42';Answer:
The logical plan describes what the query computes (joins, projections, filters). The physical plan describes how the engine will execute it: index scan vs sequential scan, hash join vs nested loop, parallel workers, sort operations.
The optimizer picks the physical plan based on table statistics. After bulk loads, statistics get stale and plans become bad.
-- Postgres: refresh stats
ANALYZE orders;
-- MySQL
ANALYZE TABLE orders;Answer:
SELECT * reads every column whether you need it or not. That means more bytes over the wire, more memory in result buffers, and you cannot use a covering index (the planner is forced to visit the heap).
-- BAD: pulls all columns
SELECT * FROM orders WHERE user_id = 42;
-- GOOD: only what the API actually returns
SELECT id, total_amount, status, created_at
FROM orders
WHERE user_id = 42;For wide tables with TEXT or JSONB columns, the difference can be 10x in I/O and bandwidth.
Answer:
A B-tree is a balanced tree of sorted keys with pointers to rows. Lookups, range scans, and ordered scans are O(log n). The leaves are linked, so range scans walk leaves directly.
CREATE INDEX idx_users_email ON users(email);A 100 million row table typically has a tree depth of 4 to 5 levels. With a warm cache, those reads are in memory, so a lookup costs microseconds.
Answer:
Common cases where the index gets bypassed:
-- Function on column - index unusable
SELECT * FROM users WHERE LOWER(email) = 'me@x.com';
-- Fix with expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));-- Leading wildcard - cannot use B-tree
SELECT * FROM products WHERE name LIKE '%phone';
-- Type mismatch
SELECT * FROM users WHERE id = '42'; -- if id is INTEGER
-- Negation
SELECT * FROM orders WHERE status != 'paid';
-- OR across columns from different indexes
SELECT * FROM users WHERE email = ? OR phone = ?;Answer:
A covering index includes every column the query needs, so the engine never visits the heap. Postgres calls this an "index-only scan."
-- Postgres: include extra columns without making them part of key
CREATE INDEX idx_orders_user_status ON orders (user_id, status)
INCLUDE (total_amount);
-- MySQL: just put them in the index
CREATE INDEX idx_orders_user_status ON orders (user_id, status, total_amount);Query that benefits:
SELECT total_amount FROM orders
WHERE user_id = 42 AND status = 'shipped';The downside: bigger index, slower writes.
Answer:
A partial index covers only a subset of rows that match a WHERE clause. Useful when most rows are inactive but queries always filter to the small active set.
-- Only index active orders (much smaller)
CREATE INDEX idx_active_orders ON orders (created_at)
WHERE status = 'active';
-- Soft delete pattern
CREATE INDEX idx_users_email ON users (email)
WHERE deleted_at IS NULL;If only 5 percent of rows match, the partial index is 20x smaller, fits in cache, and scans faster.
Answer:
Rule: equality columns first, range columns last, sort columns aligned with the order direction.
For this query:
SELECT * FROM events
WHERE tenant_id = 7
AND type = 'click'
AND created_at > NOW() - INTERVAL '7 days'
ORDER BY created_at DESC;The right index is:
CREATE INDEX idx_events_tenant_type_created
ON events (tenant_id, type, created_at DESC);The planner seeks to (7, 'click'), then walks leaves backwards by created_at. No extra sort step.
Answer:
In Postgres, dead tuples in indexes do not auto-shrink. After heavy update or delete cycles the index becomes 3 to 5 times larger than necessary, causing slower scans and memory pressure.
Detect:
SELECT schemaname, relname, indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;Fix without downtime:
REINDEX INDEX CONCURRENTLY idx_users_email;
-- Or use pg_repack from the command lineIn MySQL InnoDB, OPTIMIZE TABLE users; rebuilds and reclaims space.
Answer:
| Aspect | B-tree | Hash |
|---|---|---|
| Equality | Yes | Yes (slightly faster) |
| Range queries | Yes | No |
| Sorting (ORDER BY) | Yes | No |
| Real-world use | 99% of cases | Rare |
Default to B-tree. Hash indexes only help when the workload is purely equality on a high-cardinality column with no ordering needs.
Answer:
GIN (Generalized Inverted Index) is best for full-text search, JSONB key lookups, and array contains. Slower writes, fast reads.
GiST (Generalized Search Tree) is best for geospatial, range types, and fuzzy matching. Faster writes, slower reads than GIN.
-- JSONB key lookup
CREATE INDEX idx_payload_gin ON events USING GIN (payload);
-- Full-text
CREATE INDEX idx_doc_fts ON documents
USING GIN (to_tsvector('english', body));
-- Geospatial
CREATE INDEX idx_location ON places USING GIST (location);Answer:
Every index must be updated on every insert, update (to indexed columns), and delete. Eight indexes on a hot write table can make inserts 3 to 5 times slower. They also consume disk and buffer pool memory.
Audit unused indexes:
SELECT schemaname, relname, indexrelname, idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;If an index has zero scans for a month under representative load, drop it.
Answer:
Each line shows a plan node with its estimated cost, estimated rows, actual time, and actual rows. The interesting numbers are actual time, actual rows, and Buffers.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42;Sample output:
Index Scan using idx_orders_user on orders
(cost=0.42..8.44 rows=10 width=64)
(actual time=0.012..0.145 rows=12 loops=1)
Buffers: shared hit=4
Planning Time: 0.123 ms
Execution Time: 0.180 ms
Watch for:
- A big gap between estimated and actual rows means stale stats. Run
ANALYZE. Seq Scanon a large table means a missing index.SortwithDisk:meanswork_memis too low.
Answer:
| Node | When you see it |
|---|---|
| Seq Scan | Reads whole table — good for small tables, bad for large |
| Index Scan | Uses index, then reads heap |
| Index Only Scan | Best — needs covering index + visibility map |
| Bitmap Index Scan | Combines multiple indexes |
| Nested Loop | For each outer row, scan inner — good with small outer |
| Hash Join | Builds hash on one side — good for unsorted equi-join |
| Merge Join | Both sides sorted |
| Sort | Explicit sort — try to align index with ORDER BY |
Answer:
Focus on the type column. From best to worst: const < eq_ref < ref < range < index < ALL. ALL means a full table scan.
EXPLAIN SELECT * FROM orders WHERE user_id = 42;| id | type | key | rows | Extra |
|---|---|---|---|---|
| 1 | ref | idx_user_id | 12 | Using where |
Goals:
- Avoid
type: ALL - Avoid
Extra: Using filesorton large data - Avoid
Extra: Using temporary - Prefer
Extra: Using index(covering index)
For more detail, use EXPLAIN FORMAT=JSON.
Answer:
For each parent row, the ORM fires a separate child query. One parent query plus N child queries equals N+1 round trips.
Bad pattern:
$users = User::all(); // 1 query
foreach ($users as $user) {
echo $user->posts->count(); // N queries
}Fix with eager loading:
$users = User::withCount('posts')->get(); // 1 queryDetection:
- Laravel: enable
Model::preventLazyLoading()in development - Rails: use the
bulletgem - APM tools group similar repeated queries
Answer:
| Strategy | Best when | Cost |
|---|---|---|
| Nested Loop | Outer is small, inner has index | O(N × log M) |
| Hash Join | Both sides unsorted, equi-join | O(N + M), uses memory |
| Merge Join | Both sides pre-sorted | O(N + M) |
The planner picks based on stats. Force a different plan in MySQL with STRAIGHT_JOIN, in Postgres rewrite the query or use pg_hint_plan.
Answer:
Denormalize when:
- The join is on a hot read path with a strict latency budget (under 50 ms)
- Reference data changes rarely, like a copy of
user_nameonorders - The join would otherwise span shards
Example: instead of joining orders and users on every read, copy user_name to orders at insert time.
-- Schema with denormalized name
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT,
user_name VARCHAR(100), -- copied from users
total NUMERIC(10,2),
created_at TIMESTAMPTZ
);The trade-off: writes must keep both copies in sync. Use a background reconciliation job to detect drift.
Answer:
-- "Does the user have any orders?" -> EXISTS short-circuits
SELECT u.id FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);
-- "How many orders does each user have?" -> JOIN
SELECT u.id, COUNT(o.id)
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id;
-- "Which users have ordered specific items?" -> IN or JOIN
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE product_id = 5);Postgres rewrites most IN (subquery) into semi-joins, so performance is similar. MySQL historically prefers EXISTS.
Answer:
OR across different indexed columns often forces a full scan because the planner cannot combine indexes well.
-- Slow
SELECT * FROM users WHERE email = ? OR phone = ?;
-- Fast: rewrite as UNION ALL
SELECT * FROM users WHERE email = ?
UNION ALL
SELECT * FROM users WHERE phone = ? AND email <> ?;In Postgres the planner often handles this via Bitmap Or, so check the plan first before rewriting.
Answer:
Smaller types mean more rows fit per page, which means better cache hit rates.
| Type | Size | Notes |
|---|---|---|
| INT | 4 bytes | Use unless you exceed 2 billion |
| BIGINT | 8 bytes | For massive ID ranges |
| UUID (binary) | 16 bytes | Better than 36-char string |
| VARCHAR(N) | length | In MySQL, can move off-page if very long |
Switching a 1-billion-row table from BIGINT to INT saves 4 GB per indexed column.
-- BAD: storing UUIDs as text in MySQL
user_id CHAR(36)
-- GOOD
user_id BINARY(16)Answer:
A row that exceeds the page size (8 KB Postgres, 16 KB InnoDB) gets split off-page (TOAST in Postgres). The main row stores a pointer, and reading the large field requires extra I/O.
Strategies:
-- BAD: large blobs inline with hot data
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(100),
price NUMERIC(10,2),
description TEXT, -- can be 100 KB
full_html TEXT -- can be 1 MB
);
-- GOOD: split heavy fields
CREATE TABLE products (id, name, price);
CREATE TABLE product_content (
product_id BIGINT PRIMARY KEY REFERENCES products(id),
description TEXT,
full_html TEXT
);Answer:
| Aspect | INT/BIGINT auto | UUID v4 | UUID v7 |
|---|---|---|---|
| Insert speed | Fast (sequential) | Slow (random page splits) | Fast (sequential) |
| Index size | Small | Larger | Larger |
| Distributed-safe | No | Yes | Yes |
| Reveals counts | Yes | No | No |
For new systems, UUID v7 (time-ordered) gives you distributed IDs without the random-insertion penalty.
-- Postgres extension for UUID v7 (Postgres 18+ has built-in)
CREATE EXTENSION pg_uuidv7;
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
...
);Answer:
Use a composite PK when the natural key is short, immutable, and queries always use it. The classic example is a join table:
CREATE TABLE user_role (
user_id BIGINT,
role_id BIGINT,
PRIMARY KEY (user_id, role_id)
);Use a surrogate (auto-increment or UUID) when the natural key can change, when foreign keys would otherwise be wide, or when you need a single column for references.
Answer:
Each statement is a network round trip plus a separate transaction commit (with autocommit on). 10,000 inserts at 2 ms each is 20 seconds of pure latency.
Bad:
foreach ($events as $event) {
DB::table('events')->insert($event); // 10,000 round trips
}Good - batched insert:
DB::table('events')->insert($events); // 1 round tripEven better, single transaction:
BEGIN;
INSERT INTO events (a, b) VALUES (1,2),(3,4),(5,6),...;
COMMIT;Best for huge loads, bypass the SQL parser:
-- Postgres: 10-100x faster than INSERT
COPY events FROM '/tmp/events.csv' WITH (FORMAT CSV);
-- MySQL
LOAD DATA INFILE '/tmp/events.csv' INTO TABLE events;Answer:
A single DELETE FROM events WHERE created_at < ... on 50 million rows holds long locks, bloats WAL, and may block replication.
Chunk the delete:
DO $$
DECLARE deleted INT;
BEGIN
LOOP
DELETE FROM events
WHERE id IN (
SELECT id FROM events
WHERE created_at < NOW() - INTERVAL '1 year'
LIMIT 5000
);
GET DIAGNOSTICS deleted = ROW_COUNT;
EXIT WHEN deleted = 0;
PERFORM pg_sleep(0.1); -- breathing room
END LOOP;
END $$;For partitioned tables, dropping a partition is instant:
DROP TABLE events_2023_01; -- removes 50M rows in millisecondsAnswer:
-- Postgres
INSERT INTO counters (id, count) VALUES (1, 1)
ON CONFLICT (id) DO UPDATE
SET count = counters.count + 1;
-- MySQL
INSERT INTO counters (id, count) VALUES (1, 1)
ON DUPLICATE KEY UPDATE count = count + 1;Pitfalls:
- Postgres creates a new row version on every UPDATE, which causes bloat. Vacuum frequently or use
fillfactor=80. - A hot key (e.g., a single counter incremented thousands of times per second) becomes a serialization bottleneck. Use sharded counters.
-- Sharded counter pattern
CREATE TABLE counters (
bucket SMALLINT,
count BIGINT,
PRIMARY KEY (bucket)
);
-- 16 buckets; each writer picks a random bucketAnswer:
Use a read replica when read traffic exceeds the primary's capacity, or when you want analytics off the OLTP path.
Application pattern:
// Laravel
'mysql' => [
'read' => ['host' => ['replica1', 'replica2']],
'write' => ['host' => ['primary']],
...
]Pitfalls:
- Replication lag means replicas can be seconds behind. After a write, a read may show stale data.
- Stick the user's session to the primary for a few seconds after their write to avoid stale reads.
- Long-running queries on the replica can pause replication.
Answer:
| Layer | Hit speed | Invalidation |
|---|---|---|
Postgres shared_buffers |
< 1 ms | Automatic LRU |
| InnoDB buffer pool | < 1 ms | Automatic LRU |
| App in-process LRU | < 0.1 ms | Manual / TTL |
| Redis / Memcached | 1-2 ms | Manual / TTL |
| CDN | edge RTT | Headers / purge |
The DB caches hot data automatically. The question is whether you can skip the DB entirely with a separate cache layer.
Answer:
The most common cache pattern. The application reads from cache; on miss, reads from DB and populates the cache.
def get_user(user_id):
val = redis.get(f"user:{user_id}")
if val:
return json.loads(val)
user = db.query("SELECT * FROM users WHERE id=%s", user_id)
redis.setex(f"user:{user_id}", 300, json.dumps(user))
return user
def update_user(user_id, data):
db.execute("UPDATE users SET ... WHERE id=%s", user_id)
redis.delete(f"user:{user_id}") # invalidateSimple, but writes can race: between the DB update and the cache delete, another reader may repopulate stale data.
Answer:
| Pattern | Read path | Write path | Consistency |
|---|---|---|---|
| Cache-aside | App handles miss | App invalidates | Strong (with care) |
| Read-through | Cache fetches from DB | App writes to DB | Strong |
| Write-through | Cache fetches from DB | App writes both | Strong |
| Write-back | Cache fetches | Cache writes async to DB | Eventual |
Write-back is fastest but dangerous: a cache crash loses unflushed writes. Use only for non-critical data with replication.
Answer:
There are several strategies, in increasing complexity:
// 1. TTL only — eventual consistency
Cache::put("user:$id", $user, 300);
// 2. Explicit delete on write
DB::transaction(function() use ($user) {
$user->save();
Cache::forget("user:{$user->id}");
});
// 3. Versioned keys — bump version on write
Cache::put("user:$id:v" . $user->version, $user);
// 4. Tag-based purge
Cache::tags(["users", "user:$id"])->put($key, $val);
Cache::tags(["user:$id"])->flush();Common bug: invalidating before the DB commit, so a reader caches stale data. Always invalidate after commit.
Answer:
When a popular key expires, thousands of concurrent requests all hit the DB at once. Solutions:
# Single-flight with a lock
def get_with_lock(key, ttl=300):
val = redis.get(key)
if val:
return val
if redis.set(f"lock:{key}", "1", nx=True, ex=10):
try:
val = db.query(...)
redis.setex(key, ttl, val)
return val
finally:
redis.delete(f"lock:{key}")
else:
time.sleep(0.1)
return redis.get(key) or db.query(...)Other patterns:
- Probabilistic early refresh: refresh slightly before expiry with rising probability
- Two-tier TTL: soft expiry triggers async refresh, hard expiry forces sync
- Stale-while-revalidate: serve stale, refresh in background
Answer:
For low-cardinality public read endpoints, set the HTTP cache headers and let the CDN serve almost all traffic:
return response()->json($products)
->header('Cache-Control', 'public, max-age=60, stale-while-revalidate=300')
->header('Surrogate-Key', 'products');The CDN serves 90 percent or more of requests with no DB hit. Use the Surrogate-Key header to purge by tag when products change. Do not cache personalized content at the CDN unless you partition by user.
Answer:
Each Postgres backend uses about 10 MB of RAM. 1000 raw connections is 10 GB just for connection overhead, plus context switching kills performance.
Pool sizing rule of thumb:
pool_size = (core_count * 2) + effective_spindle_count
For an 8-core SSD server, around 20 connections per app instance. Counterintuitively, more connections often means worse throughput because of contention.
# pgbouncer.ini
[databases]
mydb = host=db port=5432
[pgbouncer]
pool_mode = transaction
default_pool_size = 20
max_client_conn = 1000Answer:
| Mode | Description | Limits |
|---|---|---|
| Session | Backend held for client's entire session | Same overhead as direct connections |
| Transaction | Backend held per transaction (default) | No session-level features (prepared statements pre-PG14, advisory locks across transactions) |
| Statement | Backend per statement | No multi-statement transactions |
Production default: transaction mode. Pool size matched to actual concurrent transactions, not concurrent users.
Answer:
- Latency spikes correlated with traffic spikes
- App logs show "timeout waiting for connection"
- DB CPU is low but the app is queueing
- A long-running transaction from a background job starves everyone
Detect:
-- Postgres: connections by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- Long-running queries
SELECT pid, now() - query_start AS dur, state, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY dur DESC;Kill the rogue:
SELECT pg_terminate_backend(<pid>);Answer:
Long transactions hold row and page locks, blocking other writers. In Postgres, they prevent vacuum from cleaning up dead tuples. In MySQL InnoDB they keep undo log entries around.
Bad pattern - HTTP call inside a transaction:
DB::transaction(function () use ($order) {
$order->save();
Http::post('https://payment.example.com/charge', $payload); // 2s
$order->update(['status' => 'paid']);
}); // transaction held for 2+ secondsGood - move I/O outside:
DB::transaction(fn() => $order->save());
$response = Http::post('https://payment.example.com/charge', $payload);
DB::transaction(fn() => $order->update(['status' => 'paid']));Answer:
| Level | Read consistency | Write conflicts | Performance |
|---|---|---|---|
| Read Uncommitted | Sees uncommitted writes | Lots | Fastest, dirty reads |
| Read Committed | Per-statement snapshot | Some | Default Postgres |
| Repeatable Read | Per-transaction snapshot | Some | Default MySQL |
| Serializable | Full serial order | Highest abort rate | Slowest |
In Postgres, Serializable uses SSI and aborts conflicting transactions. In MySQL, Serializable adds gap locks, hurting concurrency further.
Answer:
For pessimistic locking on mutually exclusive operations (transfers, inventory decrements):
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE; -- locks row
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
COMMIT;For high-contention scenarios, prefer optimistic locking:
UPDATE accounts
SET balance = balance - 50, version = version + 1
WHERE id = 1 AND version = 7;
-- If 0 rows updated, retryAnswer:
| Aspect | Partitioning | Sharding |
|---|---|---|
| Where | One DB instance | Multiple DB instances |
| Transparency | DB sees one logical table | App or proxy routes |
| Use case | Large tables, hot/cold data | Beyond single-server capacity |
Partitioning helps a single server. Sharding helps when one server is not enough.
Answer:
Three strategies: range, list, and hash.
-- Range partitioning by time (most common)
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Hash partitioning
CREATE TABLE users (...) PARTITION BY HASH (user_id);
CREATE TABLE users_p0 PARTITION OF users
FOR VALUES WITH (modulus 16, remainder 0);Benefits:
- Per-partition indexes (smaller, faster)
- Drop old partitions instantly instead of deleting 100M rows
- Partition pruning skips irrelevant partitions
Use pg_partman to auto-create future partitions.
Answer:
A bad shard key destroys you. Properties of a good shard key:
- High cardinality
- Even distribution (no hot shards)
- Aligned with most queries (so single-shard reads are common)
- Stable (does not change for a row)
| App type | Good shard key | Why |
|---|---|---|
| User-data app | user_id |
Most queries are per-user |
| Multi-tenant SaaS | tenant_id |
Watch for one giant tenant |
| Time-series | composite (entity, time) | Pure time creates a hot shard |
Anti-patterns: sequential ID (last shard is always hot), country code (US dominates), status (only a few values).
Answer:
With naive hash(key) % N sharding, adding a shard remaps almost every key. With consistent hashing, only 1/N keys move.
Ring (0..2^32):
Shard A: positions 100, 5000, 8000, ...
Shard B: positions 800, 3000, 9000, ...
Key K hashes to 4500 -> belongs to Shard A's 5000 position
Used by: Redis Cluster, Cassandra, DynamoDB, Vitess. Pair with virtual nodes (each physical shard owns many ring positions) for better balance.
Answer:
Options in order of preference:
- Avoid them: design schema so 99 percent of queries hit one shard.
- Scatter-gather: query all shards and merge results. Latency equals the slowest shard.
- Materialized index: denormalize into a separate "global" lookup table.
- Search engine alongside: push searchable data to Elasticsearch.
- Analytics pipeline: ETL all shards into a warehouse like Snowflake or ClickHouse.
# Scatter-gather example
async def get_orders_for_user(user_id):
shards = [shard1, shard2, shard3, shard4]
results = await asyncio.gather(*[
s.query("SELECT * FROM orders WHERE user_id = %s", user_id)
for s in shards
])
return [row for shard in results for row in shard]Answer:
- Full scans become unacceptable (minutes to hours)
- Index size starts pressuring buffer pool
ALTER TABLErewrites become a major operation (hours)- Backup time grows linearly
- Single-row latency stays fast IF indexes are right
Operational changes:
- Use
pg_repackorpt-online-schema-changefor online migrations - Add columns with default only on engines that support instant default (Postgres 11+, MySQL 8 InnoDB)
- Backfill in chunks
- Plan partitioning before crossing 100M
Answer:
Soft delete (deleted_at column) is easy to recover from but every query must filter, and the index carries tombstones.
-- Bad: every query filters, full index includes deleted rows
SELECT * FROM users WHERE deleted_at IS NULL AND email = ?;
CREATE INDEX idx_users_email ON users(email);
-- Good: partial index excludes deleted rows
CREATE INDEX idx_users_email ON users(email)
WHERE deleted_at IS NULL;Production hybrid:
- Soft delete with a 30-day retention
- Background job hard-deletes after retention
- Move expired data to an archive table or cold storage
Answer:
hot table (last 90d) -> warm table -> cold archive (S3) -> drop partition
Time-based partitioning makes archiving trivial:
-- Export a partition, then drop it
COPY events_2023_01 TO '/archive/events_2023_01.csv';
DROP TABLE events_2023_01;For analytics, replicate to a columnar warehouse like ClickHouse, Redshift, or BigQuery.
Answer:
Don't:
-- Full rewrite, hours of locks on large tables
ALTER TABLE huge ADD COLUMN x INT NOT NULL DEFAULT 0;Do (Postgres 11+):
-- 1. Instant: add nullable column
ALTER TABLE huge ADD COLUMN x INT;
-- 2. Backfill in chunks
DO $$
DECLARE last_id BIGINT := 0;
BEGIN
LOOP
UPDATE huge SET x = 0
WHERE id > last_id AND id <= last_id + 10000;
EXIT WHEN NOT FOUND;
last_id := last_id + 10000;
PERFORM pg_sleep(0.05);
END LOOP;
END $$;
-- 3. Add NOT NULL (instant in PG12+)
ALTER TABLE huge ALTER COLUMN x SET NOT NULL;For complex changes, use pg-osc, gh-ost (MySQL), or pt-online-schema-change.
Answer:
| Mode | Latency | Durability | Use |
|---|---|---|---|
| Async streaming | Low | Loss possible on primary failure | Default |
| Synchronous | Higher | Two-node durability | Financial / critical |
| Logical | Per-table | Selective | Heterogeneous, zero-downtime upgrades |
Synchronous replication waits for at least one replica to acknowledge before commit returns:
# postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'replica1,replica2'This adds 1-2 ms per commit but guarantees no data loss on primary failure.
Answer:
- Statement-based: small log, but non-deterministic statements like
NOW()orUUID()misbehave. - Row-based: larger log, exact, default since 5.7.
- Mixed: statement when safe, row otherwise.
For replicas falling behind, parallelize the apply:
# my.cnf on replica
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCKAnswer:
Causes:
- Long transactions on primary (replica must replay each one)
- Slow queries on replica blocking apply
- Network bottleneck
- Single-thread apply on older MySQL
Monitor:
-- Postgres on replica
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
-- MySQL
SHOW SLAVE STATUS \G
-- Watch Seconds_Behind_MasterAlert if lag exceeds 5 seconds for OLTP traffic.
Answer:
| Parameter | Description | Typical |
|---|---|---|
shared_buffers |
Postgres internal cache | 25% of RAM |
effective_cache_size |
Hint of total cache (OS+PG) | 50-75% of RAM |
work_mem |
Per-operation sort/hash memory | 4-64 MB |
maintenance_work_mem |
For VACUUM, CREATE INDEX | 256 MB - 2 GB |
wal_buffers |
WAL buffer | 16 MB |
random_page_cost |
SSD = 1.1, spinning = 4 | 1.1 (SSD) |
effective_io_concurrency |
Async I/O depth | 200 (SSD) |
Example for a 32 GB server:
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 32MB
maintenance_work_mem = 2GB
random_page_cost = 1.1
effective_io_concurrency = 200Answer:
Default autovacuum is too lazy for big, high-write tables. Bloat grows.
Global tuning:
autovacuum_vacuum_scale_factor = 0.05 # default 0.2
autovacuum_analyze_scale_factor = 0.02
autovacuum_vacuum_cost_limit = 2000 # default 200
autovacuum_naptime = 30sPer-table:
ALTER TABLE big_table SET (
autovacuum_vacuum_scale_factor = 0.01,
fillfactor = 90
);Symptoms of bad autovacuum:
- Bloat growing over time
- Wraparound warnings (
age(datfrozenxid)near 200M) - Query plans degrading
Answer:
| Parameter | Description | Typical |
|---|---|---|
innodb_buffer_pool_size |
Cache for data and index | 50-75% of RAM |
innodb_log_file_size |
Redo log size | 1-4 GB |
innodb_flush_log_at_trx_commit |
1 = full ACID; 2 = slight risk, faster | 1 |
innodb_flush_method |
I/O method | O_DIRECT |
innodb_io_capacity |
Disk IOPS hint | 2000+ for SSD |
innodb_buffer_pool_instances |
For very large pools | 8 |
Example for a 32 GB server:
innodb_buffer_pool_size = 24G
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity = 4000Answer:
work_mem is per-sort-or-hash, per-backend. A query with three sorts uses three times work_mem. Total possible memory is work_mem × max_connections × ops_per_query.
For 1 GB headroom, 100 connections, 3 ops per query: work_mem = 3 MB.
Better: keep work_mem modest (8-16 MB) globally and override for analytical sessions:
-- Per-session boost for a heavy report
SET work_mem = '512MB';
SELECT ... GROUP BY ... ORDER BY ...;Answer:
Postgres supports parallel sequential scan, parallel hash join, and parallel aggregate.
max_worker_processes = 16
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
parallel_setup_cost = 1000
parallel_tuple_cost = 0.1
min_parallel_table_scan_size = 8MBParallel only triggers above the table size threshold. Small tables stay serial. Inspect with EXPLAIN:
Gather (cost=... rows=...)
Workers Planned: 4
-> Parallel Seq Scan on big_table
Answer:
Postgres:
log_min_duration_statement = 200 # log queries > 200 ms
log_temp_files = 0 # any spill to disk
log_lock_waits = on
log_checkpoints = on
# Auto-explain logs the plan automatically
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 500
auto_explain.log_analyze = onMySQL:
slow_query_log = ON
long_query_time = 0.2
log_queries_not_using_indexes = ONAggregate with pg_stat_statements (Postgres) or performance_schema (MySQL).
Answer:
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;Focus on total_exec_time. A 1 ms query called 10 million times causes more pain than a 1-second query called 100 times.
Answer:
-- Postgres: who is blocking whom
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocked.query AS blocked_query,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));Red flags:
- Long
idle in transactionsessions — investigate the app code AccessExclusiveLockon a popular table — DDL during traffic
Answer:
OLTP is many small, fast transactions. OLAP is few huge analytical queries. If they share a buffer pool:
- OLAP loads 100 GB into the buffer, evicting OLTP hot data, causing user-facing latency spikes
- OLAP holds long transactions, preventing vacuum, causing bloat
- OLAP locks block OLTP
Solutions:
- Dedicated read replica for analytics
- ETL to a warehouse (Snowflake, BigQuery, ClickHouse, DuckDB)
- Materialized views refreshed periodically
Answer:
Use for expensive aggregations queried many times.
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT date_trunc('day', created_at) AS day,
SUM(total) AS revenue
FROM orders
GROUP BY 1;
CREATE UNIQUE INDEX ON daily_revenue (day);
-- Refresh without locking readers
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;For continuous freshness, use incremental refresh tools like pg_ivm, or maintain a rollup table via triggers.
Answer:
pg_dump runs in a long transaction, blocking vacuum and causing bloat. For large DBs:
- Run
pg_basebackupfrom a replica, not the primary - Or
pg_dump --jobs=N --format=directoryfor parallelism - Or filesystem-level snapshots (ZFS, EBS) — fastest, but need
pg_start_backup/pg_stop_backup
For MySQL, use xtrabackup for online physical backups without locks.
Answer:
-- Postgres 12+
REINDEX INDEX CONCURRENTLY idx_orders_user;
-- Or pg_repack from the command line
pg_repack -t orders -j 4 mydbAnti-pattern: dropping and recreating an index during traffic. Queries that used the old index will degrade until the rebuild finishes.
Answer:
Checklist in order:
- Recent deploy? Roll back or check new code paths
- CPU or I/O saturation? Check
iostat,top - Connection pool exhausted? Check
pg_stat_activity - Long-running query holding locks? Kill it
- Vacuum or autovacuum running? Wait or raise resources
- Disk full? Check
df -h - Replica promoted unexpectedly? Check current primary
-- Quick health check
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
SELECT pg_size_pretty(pg_database_size(current_database()));Answer:
This is usually a plan flip. The planner picked a different plan because:
- Statistics changed (auto-analyze ran after a bulk write)
- Prepared statement parameter sniffing (Postgres uses generic plan after 5 executions)
- Different data shape — querying a user with 1M rows uses a different plan than one with 10
Fix:
-- Force custom plans every time
SET plan_cache_mode = 'force_custom_plan';
-- Refresh stats
ANALYZE big_table;
-- Add an index that's optimal for both shapes
CREATE INDEX idx_x_y ON t (x, y);Answer:
The DB is fast but the app sees latency:
- App-side connection pool too small — request queueing in the app
- Network latency between app and DB (cross-AZ, VPN)
- TLS handshake on every connection — use connection pooling and persistent connections
- DNS lookup on every connection — cache or use IP
- ORM is doing N+1 even though each query is fast
- Slow JSON serialization on the app side
// Bad: opens a new connection per request
foreach ($items as $item) {
DB::connection('reports')->select(...); // reconnects often
}Answer:
Symptoms: WAL fills disk, replication breaks, primary refuses writes.
Common causes:
- WAL retention too high or replication slot held by a dead replica
- Disabled autovacuum so bloat grows forever
- Long-running transaction prevents WAL recycle
- Logical replication slot inactive, accumulating WAL
Recovery:
-- Find inactive slots
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots WHERE NOT active;
-- Drop a dead slot (be careful)
SELECT pg_drop_replication_slot('dead_replica');
-- Find long transactions
SELECT pid, now() - xact_start AS dur, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY dur DESC;Answer:
Counter-style updates on one row create a serialization point.
Bad:
-- 10k req/s all hitting the same row -> lock queue
UPDATE counters SET count = count + 1 WHERE id = 1;Sharded counter:
CREATE TABLE counters (
bucket SMALLINT,
count BIGINT,
PRIMARY KEY (bucket)
);
-- Writer picks a random bucket
UPDATE counters SET count = count + 1
WHERE bucket = floor(random() * 16);
-- Reader sums all buckets
SELECT SUM(count) FROM counters;Or use Redis for the increment and persist asynchronously:
$total = Redis::incr("post:$id:views");
if ($total % 100 === 0) {
Posts::where('id', $id)->update(['views' => $total]);
}Answer:
Use JSONB when the schema is genuinely flexible (custom user fields, extensible event payload), or when some query reads inside the document.
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB
);
-- GIN index for key existence and contains queries
CREATE INDEX idx_events_payload ON events USING GIN (payload);
-- Query
SELECT * FROM events WHERE payload @> '{"button": "checkout"}';Pitfalls:
- JSONB key extraction is slower than dedicated columns
- GIN indexes are heavier than B-tree
- TOAST'd JSONB causes extra reads
- ORM eager-loading JSON fields can blow up bandwidth
Best practice: extract frequently queried keys to dedicated columns, keep tail data in JSONB.
Answer:
NULL itself is cheap to store. But:
WHERE col IS NULLmay not use an index unless it's a partial indexNOT IN (subquery with NULL)returns no rows, a common bug- NULL semantics in joins surprise people
-- Common bug: returns no rows if any user_id is NULL
SELECT * FROM products
WHERE user_id NOT IN (SELECT user_id FROM blocked_users);
-- Correct
SELECT * FROM products p
WHERE NOT EXISTS (
SELECT 1 FROM blocked_users b WHERE b.user_id = p.user_id
);If a column is semantically never null, declare NOT NULL for better stats and plans.
Answer:
SELECT u.id FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.id IS NULL;It returns users with no orders. The LEFT JOIN plus IS NULL is the classic anti-join idiom. Equivalent:
SELECT id FROM users u
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id
);Postgres usually generates the same plan for both. MySQL historically prefers NOT EXISTS.
Answer:
Walk through the steps in order, from cheapest to most complex:
- Measure current bottleneck (CPU, I/O, connections, latency by query)
- Optimize the top 5 slow queries (often 80% of load)
- Cache hot reads (Redis, application cache)
- Add read replicas for read scaling
- Add a connection pooler (PgBouncer, ProxySQL)
- Vertical scale (bigger box)
- Async non-critical writes (queues)
- Partition huge tables
- Shard only when one box cannot keep up
- Move analytics off to a warehouse
Each step buys time before the next, in increasing operational complexity.
| Query pattern | Recommended index |
|---|---|
WHERE a = ? |
(a) |
WHERE a = ? AND b = ? |
(a, b) |
WHERE a = ? ORDER BY b |
(a, b) |
WHERE a = ? AND b > ? ORDER BY b |
(a, b) |
WHERE a IN (...) |
(a) |
LIKE 'abc%' |
B-tree (col text_pattern_ops) in PG |
LIKE '%abc%' |
trigram (pg_trgm) GIN |
| JSONB key | GIN on (jsonb_col) |
| Geospatial | GiST |
| Full-text | GIN on tsvector |
| Operation | Time |
|---|---|
| L1 cache | 0.5 ns |
| L2 cache | 7 ns |
| Main memory | 100 ns |
| SSD random read | 100 µs |
| Same-AZ network round-trip | 0.5 ms |
| HDD random read | 10 ms |
| Cross-region round-trip | 100+ ms |
| HTTP API call | 50-500 ms |
If your DB is on SSD in the same AZ, even a "cold" indexed lookup is under 1 ms. If a query takes 100 ms, something is doing many disk reads, scanning large data, or contending for a lock.