Free, structured prep for data engineering interviews. Chapters by round, worked examples, runnable problems.
Chapters · Study plans · Companion repos · Live sandboxes
Most "interview prep" material was written for software engineers, data scientists, or analysts. This handbook is calibrated for data engineers. Each chapter covers one round of a real DE loop with the specific patterns interviewers test, plus a practice set you can run in a browser at datadriven.io.
If you have a loop coming up soon, jump to the study plans. If you want to drill a specific topic, pick a chapter.
| Round | Chapter |
|---|---|
| SQL screen | 1. SQL |
| Coding | 2. Python |
| Modeling | 3. Data modeling and schema design |
| System design | 4. Pipeline architecture |
| System design | 5. The eight beat framework |
| Behavioral | 6. Behavioral |
| Pre-onsite | 7. Company guides |
| Timeline | Use when | Outline |
|---|---|---|
| 4 week sprint | Onsite next month | Week 1 SQL, week 2 Python, week 3 modeling, week 4 design and behavioral |
| 8 week build | Two months out | Run the sprint twice. First pass for breadth, second for the patterns you missed. Add a weekly mock from week 5. |
| 12 week ramp | Switching from analytics or SWE | Weeks 1 to 4 on foundation lessons. Weeks 5 to 8 on modeling and pipelines. Weeks 9 to 12 on company prep. |
Printable 12 week version: datadriven.io/data-engineering-study-plan.
DE SQL is not analyst SQL. Bar:
- Write a window function from memory.
- Reason about query plans, partitioning, and join strategy.
- Find the bug in someone else's query.
- Recognize when a SQL question is secretly a data quality question.
Lessons in order: joins, aggregating, window functions, filtering, dates, optimization.
| Problem | Difficulty | Tests |
|---|---|---|
| 10 Lowest Uptime Services | Easy | TOP N with ties |
| 2FA Confirmation Rate | Easy | Conditional aggregation |
| 30 Day Page View Counts | Medium | Date filtering |
| 7 Check Rolling Average | Medium | Rolling window, ROWS vs RANGE |
| Active Users by Month | Hard | Cohort logic |
| 2nd Most Common Content Type | Hard | Tie breaking |
Full set: 854 problems at datadriven.io/sql-interview-questions.
DE Python is not LeetCode. It is data manipulation: chunking, sessionization, dedup, retries, interval merging, hash partitioning, schema evolution.
Lessons: foundations, collections, complexity.
| Problem | Difficulty | Pattern |
|---|---|---|
| Batch Records | Easy | Chunking iterables |
| Column Sum | Easy | Dict aggregation |
| Activity Time Ledger | Medium | Interval merging |
| Batch Partitioner | Medium | Hash bucketing |
| Batch With Metadata | Medium | Stateful iteration |
| Caesar Shift Check | Hard | String transforms |
Full set: 388 problems at datadriven.io/python-interview-questions.
Senior loops are won and lost here. Reward goes to candidates who pick the right grain for fact tables, defend an SCD type, and validate the schema with sample queries.
Lessons: keys, data types, relationships, normalization, dimensional modeling, SCD, event streams, nested data.
| Problem | Tests |
|---|---|
| A/B Experiment Assignment Schema | SCD type 2, sticky bucketing |
| Customer Address History | Effective dates |
| Insurance Claims Lifecycle | State machines |
| Clickstream and Session Schema | Sessionization, late events |
| Loan Management Schema | Bridge tables |
| Financial Trading Warehouse | Late arriving facts |
Full set: 56 problems at datadriven.io/data-modeling-interview-questions.
End to end design questions ("design Netflix viewing history") reward depth in batch vs stream tradeoffs, storage choice, idempotency, late data, and on-call burden.
| Case study | Domain |
|---|---|
| Card Transaction Streaming Pipeline | Real time, exactly once |
| Cellular Connectivity and App Log Data Warehouse | High cardinality |
| Capital Markets Intraday Risk Pipeline | Regulatory lineage |
| Database Replication and Schema Normalization Pipeline | CDC |
| Cost Optimized Clickstream Data Lake | Storage tradeoffs |
| Connected Vehicle Telemetry Pipeline | High volume IoT |
Full set: 120 case studies at datadriven.io/data-pipeline-interview-questions.
Use this on every system design question. In order:
- Clarify volume, latency, freshness, retention, read pattern.
- Estimate records per second, bytes per record, total per day.
- Pick freshness target (real time, near real time, hourly, daily).
- Pick batch vs stream with arithmetic from beat 2.
- Pick storage (lake, warehouse, lakehouse, OLAP, kv).
- Sketch topology (source, ingest, transform, serve).
- Address failure modes (backfills, replays, late data, dedup).
- Talk cost and operations (monthly spend, on-call burden).
Long form: datadriven.io/data-engineering-system-design. Companion repo: system-design-for-data-engineers (120 case studies).
Build six STAR stories before your loop, three minutes each. Cover these themes:
- Owned an ambiguous problem end to end.
- Disagreed with a stakeholder and changed their mind.
- Broke production and recovered.
- Mentored or raised the bar.
- Killed a project that was not worth doing.
- Shipped fast then cleaned up.
Practice each one out loud. Twice.
50 common DE behavioral questions: datadriven.io/behavioral-interview-questions.
| Company | Guide |
|---|---|
| Netflix | companies/netflix/interview |
| Uber | companies/uber/interview |
| Amazon | companies/amazon/interview |
| companies/google/interview | |
| Meta | companies/meta/interview |
Full company index: datadriven.io/companies.
- data-engineer-interview-handbook. 7 day sprint version of this handbook.
- data-engineering-interview-questions. 1418 tagged practice problems.
- system-design-for-data-engineers. 120 long form pipeline case studies.
- data-engineering-cheatsheet. One page recall reference for the night before.
- data-engineer-interview-prep. 8 week structured practice schedule.
- awesome-data-engineering-interview. Curated list of books, blogs, and tools.
- awesome-data-engineering-interviews. The DataDriven 75, a focused subset of must-do problems.
PRs welcome. Add a worked example, fix a broken link, or share a war story. Run markdownlint before opening a PR.
CC BY-SA 4.0. Linked sandboxes and lessons hosted at datadriven.io.