Skip to content

datadriven-io/data-engineering-interview-handbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Data Engineering Interview Handbook

Free, structured prep for data engineering interviews. Chapters by round, worked examples, runnable problems.

Stars License PRs welcome Sandbox

Chapters · Study plans · Companion repos · Live sandboxes


Most "interview prep" material was written for software engineers, data scientists, or analysts. This handbook is calibrated for data engineers. Each chapter covers one round of a real DE loop with the specific patterns interviewers test, plus a practice set you can run in a browser at datadriven.io.

If you have a loop coming up soon, jump to the study plans. If you want to drill a specific topic, pick a chapter.

Chapters

Round Chapter
SQL screen 1. SQL
Coding 2. Python
Modeling 3. Data modeling and schema design
System design 4. Pipeline architecture
System design 5. The eight beat framework
Behavioral 6. Behavioral
Pre-onsite 7. Company guides

Study plans

Timeline Use when Outline
4 week sprint Onsite next month Week 1 SQL, week 2 Python, week 3 modeling, week 4 design and behavioral
8 week build Two months out Run the sprint twice. First pass for breadth, second for the patterns you missed. Add a weekly mock from week 5.
12 week ramp Switching from analytics or SWE Weeks 1 to 4 on foundation lessons. Weeks 5 to 8 on modeling and pipelines. Weeks 9 to 12 on company prep.

Printable 12 week version: datadriven.io/data-engineering-study-plan.

1. SQL

DE SQL is not analyst SQL. Bar:

  1. Write a window function from memory.
  2. Reason about query plans, partitioning, and join strategy.
  3. Find the bug in someone else's query.
  4. Recognize when a SQL question is secretly a data quality question.

Lessons in order: joins, aggregating, window functions, filtering, dates, optimization.

Problem Difficulty Tests
10 Lowest Uptime Services Easy TOP N with ties
2FA Confirmation Rate Easy Conditional aggregation
30 Day Page View Counts Medium Date filtering
7 Check Rolling Average Medium Rolling window, ROWS vs RANGE
Active Users by Month Hard Cohort logic
2nd Most Common Content Type Hard Tie breaking

Full set: 854 problems at datadriven.io/sql-interview-questions.

2. Python

DE Python is not LeetCode. It is data manipulation: chunking, sessionization, dedup, retries, interval merging, hash partitioning, schema evolution.

Lessons: foundations, collections, complexity.

Problem Difficulty Pattern
Batch Records Easy Chunking iterables
Column Sum Easy Dict aggregation
Activity Time Ledger Medium Interval merging
Batch Partitioner Medium Hash bucketing
Batch With Metadata Medium Stateful iteration
Caesar Shift Check Hard String transforms

Full set: 388 problems at datadriven.io/python-interview-questions.

3. Data modeling and schema design

Senior loops are won and lost here. Reward goes to candidates who pick the right grain for fact tables, defend an SCD type, and validate the schema with sample queries.

Lessons: keys, data types, relationships, normalization, dimensional modeling, SCD, event streams, nested data.

Problem Tests
A/B Experiment Assignment Schema SCD type 2, sticky bucketing
Customer Address History Effective dates
Insurance Claims Lifecycle State machines
Clickstream and Session Schema Sessionization, late events
Loan Management Schema Bridge tables
Financial Trading Warehouse Late arriving facts

Full set: 56 problems at datadriven.io/data-modeling-interview-questions.

4. Pipeline architecture

End to end design questions ("design Netflix viewing history") reward depth in batch vs stream tradeoffs, storage choice, idempotency, late data, and on-call burden.

Case study Domain
Card Transaction Streaming Pipeline Real time, exactly once
Cellular Connectivity and App Log Data Warehouse High cardinality
Capital Markets Intraday Risk Pipeline Regulatory lineage
Database Replication and Schema Normalization Pipeline CDC
Cost Optimized Clickstream Data Lake Storage tradeoffs
Connected Vehicle Telemetry Pipeline High volume IoT

Full set: 120 case studies at datadriven.io/data-pipeline-interview-questions.

5. The eight beat framework

Use this on every system design question. In order:

  1. Clarify volume, latency, freshness, retention, read pattern.
  2. Estimate records per second, bytes per record, total per day.
  3. Pick freshness target (real time, near real time, hourly, daily).
  4. Pick batch vs stream with arithmetic from beat 2.
  5. Pick storage (lake, warehouse, lakehouse, OLAP, kv).
  6. Sketch topology (source, ingest, transform, serve).
  7. Address failure modes (backfills, replays, late data, dedup).
  8. Talk cost and operations (monthly spend, on-call burden).

Long form: datadriven.io/data-engineering-system-design. Companion repo: system-design-for-data-engineers (120 case studies).

6. Behavioral

Build six STAR stories before your loop, three minutes each. Cover these themes:

  1. Owned an ambiguous problem end to end.
  2. Disagreed with a stakeholder and changed their mind.
  3. Broke production and recovered.
  4. Mentored or raised the bar.
  5. Killed a project that was not worth doing.
  6. Shipped fast then cleaned up.

Practice each one out loud. Twice.

50 common DE behavioral questions: datadriven.io/behavioral-interview-questions.

7. Company guides

Company Guide
Netflix companies/netflix/interview
Uber companies/uber/interview
Amazon companies/amazon/interview
Google companies/google/interview
Meta companies/meta/interview

Full company index: datadriven.io/companies.

Companion repos

Contributing

PRs welcome. Add a worked example, fix a broken link, or share a war story. Run markdownlint before opening a PR.

License

CC BY-SA 4.0. Linked sandboxes and lessons hosted at datadriven.io.

About

The complete free handbook for data engineering interviews. Covers SQL, Python, schema design, pipeline architecture, system design, and behavioral rounds. With study plans and 1400+ practice problems.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors