Awesome Data Engineering Interviews

The DataDriven 75. A focused, hand picked set of 75 real data engineering interview questions, organized by the patterns interviewers actually test.

SQL · Python · Schema · Pipelines · Companion repos

Every problem links to a runnable browser sandbox at datadriven.io. No login required.

If you want a wider net, the sibling repo data-engineering-interview-questions has the full 1418 question bank.

SQL (33)

Aggregation and filtering

GROUP BY, HAVING, conditional aggregation, and the difference between filtering rows and filtering groups.

Joins and subqueries

Inner, left, anti, and self joins. The choice between joining and using a subquery or EXISTS.

NULL handling and dedup

COALESCE, NULL safe comparisons, ROW_NUMBER deduplication, DISTINCT vs GROUP BY.

Date and time

Date arithmetic, day of week and hour bucketing, time window filters, timezone gotchas.

Window functions

ROW_NUMBER, RANK, LAG, LEAD, running aggregates, partitioned ORDER BY.

CTEs and recursion

WITH clauses for query layering, recursive CTEs for hierarchies and graphs, multi step funnels.

Set ops and pivots

UNION, INTERSECT, EXCEPT, conditional aggregation pivots that turn rows into columns.

Python and PySpark (30)

Data structures and iteration

Dicts, sets, defaultdicts, nested data, iteration patterns that turn raw records into lookups.

Aggregation and bucketing

Group by counting, running totals, histograms, bucket assignment in pure Python.

Streaming and I/O

Reading large files line by line, chunked CSV processing, parsing log lines, partitioned output.

Merge and reconcile

Joining two record streams in Python, reconciling schema drift, computing snapshot diffs.

Functional patterns and generators

Generators, decorators, context managers, higher order functions for throttles and timers.

Data processing and transforms

Multi step record transformation, schema cleanup, column wise transforms, pipeline composition.

Data Modeling (6)

Normalization and ERDs

Designing normalized tables, picking primary and foreign keys, modeling slowly changing dimensions.

Star schemas and warehouses

Fact and dimension tables, grain selection, conformed dimensions, star vs snowflake tradeoffs.

Pipeline Architecture (6)

Batch vs streaming

Choosing between batch and streaming, handling late arriving data, designing for consistency and replay.

ETL and connectors

CDC connectors, incremental sync strategies, schema drift handling, auto scaling for variable load.

How to use this list

You have	Strategy
4+ weeks	Work through every question in order. Each pattern builds on the last.
1 to 2 weeks	Focus on SQL and pipeline architecture. The two highest weight rounds.
48 hours	Skim every pattern description. Solve only the patterns you have not seen before.
Stuck on a question	Read the prompt, attempt a solution, then check the discussion.

Companion repos

data-engineering-interview-questions. The full 1418 question bank.
data-engineering-interview-handbook. The flagship handbook with chapters and study plans.
awesome-data-engineering-interview. Curated resource list (singular).
system-design-for-data-engineers. 120 long form pipeline case studies.
data-engineer-interview-prep. 8 week structured practice schedule.
data-engineering-cheatsheet. One page recall reference.
data-engineer-interview-handbook. 7 day sprint version of the handbook.

Contributing

Curated list. To suggest a question, open an issue with:

The pattern the question tests
A real source (interview report, blog, your own experience)
Why it belongs in the 75 rather than the long tail

License

CC BY 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Data Engineering Interviews

Contents

SQL (33)

Aggregation and filtering

Joins and subqueries

NULL handling and dedup

Date and time

Window functions

CTEs and recursion

Set ops and pivots

Python and PySpark (30)

Data structures and iteration

Aggregation and bucketing

Streaming and I/O

Merge and reconcile

Functional patterns and generators

Data processing and transforms

Data Modeling (6)

Normalization and ERDs

Star schemas and warehouses

Pipeline Architecture (6)

Batch vs streaming

ETL and connectors

How to use this list

Companion repos

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Engineering Interviews

Contents

SQL (33)

Aggregation and filtering

Joins and subqueries

NULL handling and dedup

Date and time

Window functions

CTEs and recursion

Set ops and pivots

Python and PySpark (30)

Data structures and iteration

Aggregation and bucketing

Streaming and I/O

Merge and reconcile

Functional patterns and generators

Data processing and transforms

Data Modeling (6)

Normalization and ERDs

Star schemas and warehouses

Pipeline Architecture (6)

Batch vs streaming

ETL and connectors

How to use this list

Companion repos

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages