Data Engineering Interview Handbook

Free, structured prep for data engineering interviews. Chapters by round, worked examples, runnable problems.

Chapters · Study plans · Companion repos · Live sandboxes

Most "interview prep" material was written for software engineers, data scientists, or analysts. This handbook is calibrated for data engineers. Each chapter covers one round of a real DE loop with the specific patterns interviewers test, plus a practice set you can run in a browser at datadriven.io.

If you have a loop coming up soon, jump to the study plans. If you want to drill a specific topic, pick a chapter.

Chapters

Round	Chapter
SQL screen	1. SQL
Coding	2. Python
Modeling	3. Data modeling and schema design
System design	4. Pipeline architecture
System design	5. The eight beat framework
Behavioral	6. Behavioral
Pre-onsite	7. Company guides

Study plans

Timeline	Use when	Outline
4 week sprint	Onsite next month	Week 1 SQL, week 2 Python, week 3 modeling, week 4 design and behavioral
8 week build	Two months out	Run the sprint twice. First pass for breadth, second for the patterns you missed. Add a weekly mock from week 5.
12 week ramp	Switching from analytics or SWE	Weeks 1 to 4 on foundation lessons. Weeks 5 to 8 on modeling and pipelines. Weeks 9 to 12 on company prep.

Printable 12 week version: datadriven.io/data-engineering-study-plan.

1. SQL

DE SQL is not analyst SQL. Bar:

Write a window function from memory.
Reason about query plans, partitioning, and join strategy.
Find the bug in someone else's query.
Recognize when a SQL question is secretly a data quality question.

Lessons in order: joins, aggregating, window functions, filtering, dates, optimization.

Problem	Difficulty	Tests
10 Lowest Uptime Services	Easy	TOP N with ties
2FA Confirmation Rate	Easy	Conditional aggregation
30 Day Page View Counts	Medium	Date filtering
7 Check Rolling Average	Medium	Rolling window, ROWS vs RANGE
Active Users by Month	Hard	Cohort logic
2nd Most Common Content Type	Hard	Tie breaking

Full set: 854 problems at datadriven.io/sql-interview-questions.

2. Python

DE Python is not LeetCode. It is data manipulation: chunking, sessionization, dedup, retries, interval merging, hash partitioning, schema evolution.

Lessons: foundations, collections, complexity.

Problem	Difficulty	Pattern
Batch Records	Easy	Chunking iterables
Column Sum	Easy	Dict aggregation
Activity Time Ledger	Medium	Interval merging
Batch Partitioner	Medium	Hash bucketing
Batch With Metadata	Medium	Stateful iteration
Caesar Shift Check	Hard	String transforms

Full set: 388 problems at datadriven.io/python-interview-questions.

3. Data modeling and schema design

Senior loops are won and lost here. Reward goes to candidates who pick the right grain for fact tables, defend an SCD type, and validate the schema with sample queries.

Lessons: keys, data types, relationships, normalization, dimensional modeling, SCD, event streams, nested data.

Problem	Tests
A/B Experiment Assignment Schema	SCD type 2, sticky bucketing
Customer Address History	Effective dates
Insurance Claims Lifecycle	State machines
Clickstream and Session Schema	Sessionization, late events
Loan Management Schema	Bridge tables
Financial Trading Warehouse	Late arriving facts

Full set: 56 problems at datadriven.io/data-modeling-interview-questions.

4. Pipeline architecture

End to end design questions ("design Netflix viewing history") reward depth in batch vs stream tradeoffs, storage choice, idempotency, late data, and on-call burden.

Case study	Domain
Card Transaction Streaming Pipeline	Real time, exactly once
Cellular Connectivity and App Log Data Warehouse	High cardinality
Capital Markets Intraday Risk Pipeline	Regulatory lineage
Database Replication and Schema Normalization Pipeline	CDC
Cost Optimized Clickstream Data Lake	Storage tradeoffs
Connected Vehicle Telemetry Pipeline	High volume IoT

Full set: 120 case studies at datadriven.io/data-pipeline-interview-questions.

5. The eight beat framework

Use this on every system design question. In order:

Clarify volume, latency, freshness, retention, read pattern.
Estimate records per second, bytes per record, total per day.
Pick freshness target (real time, near real time, hourly, daily).
Pick batch vs stream with arithmetic from beat 2.
Pick storage (lake, warehouse, lakehouse, OLAP, kv).
Sketch topology (source, ingest, transform, serve).
Address failure modes (backfills, replays, late data, dedup).
Talk cost and operations (monthly spend, on-call burden).

Long form: datadriven.io/data-engineering-system-design. Companion repo: system-design-for-data-engineers (120 case studies).

6. Behavioral

Build six STAR stories before your loop, three minutes each. Cover these themes:

Owned an ambiguous problem end to end.
Disagreed with a stakeholder and changed their mind.
Broke production and recovered.
Mentored or raised the bar.
Killed a project that was not worth doing.
Shipped fast then cleaned up.

Practice each one out loud. Twice.

50 common DE behavioral questions: datadriven.io/behavioral-interview-questions.

7. Company guides

Company	Guide
Netflix	companies/netflix/interview
Uber	companies/uber/interview
Amazon	companies/amazon/interview
Google	companies/google/interview
Meta	companies/meta/interview

Full company index: datadriven.io/companies.

Companion repos

data-engineer-interview-handbook. 7 day sprint version of this handbook.
data-engineering-interview-questions. 1418 tagged practice problems.
system-design-for-data-engineers. 120 long form pipeline case studies.
data-engineering-cheatsheet. One page recall reference for the night before.
data-engineer-interview-prep. 8 week structured practice schedule.
awesome-data-engineering-interview. Curated list of books, blogs, and tools.
awesome-data-engineering-interviews. The DataDriven 75, a focused subset of must-do problems.

Contributing

PRs welcome. Add a worked example, fix a broken link, or share a war story. Run markdownlint before opening a PR.

License

CC BY-SA 4.0. Linked sandboxes and lessons hosted at datadriven.io.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Interview Handbook

Chapters

Study plans

1. SQL

2. Python

3. Data modeling and schema design

4. Pipeline architecture

5. The eight beat framework

6. Behavioral

7. Company guides

Companion repos

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Interview Handbook

Chapters

Study plans

1. SQL

2. Python

3. Data modeling and schema design

4. Pipeline architecture

5. The eight beat framework

6. Behavioral

7. Company guides

Companion repos

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages