Skip to content

datadriven-io/awesome-data-engineering-interviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Data Engineering Interviews Awesome

The DataDriven 75. A focused, hand picked set of 75 real data engineering interview questions, organized by the patterns interviewers actually test.

Stars License PRs welcome Sandbox

SQL · Python · Schema · Pipelines · Companion repos


Every problem links to a runnable browser sandbox at datadriven.io. No login required.

If you want a wider net, the sibling repo data-engineering-interview-questions has the full 1418 question bank.

Contents

SQL (33)

Aggregation and filtering

GROUP BY, HAVING, conditional aggregation, and the difference between filtering rows and filtering groups.

Joins and subqueries

Inner, left, anti, and self joins. The choice between joining and using a subquery or EXISTS.

NULL handling and dedup

COALESCE, NULL safe comparisons, ROW_NUMBER deduplication, DISTINCT vs GROUP BY.

Date and time

Date arithmetic, day of week and hour bucketing, time window filters, timezone gotchas.

Window functions

ROW_NUMBER, RANK, LAG, LEAD, running aggregates, partitioned ORDER BY.

CTEs and recursion

WITH clauses for query layering, recursive CTEs for hierarchies and graphs, multi step funnels.

Set ops and pivots

UNION, INTERSECT, EXCEPT, conditional aggregation pivots that turn rows into columns.

Python and PySpark (30)

Data structures and iteration

Dicts, sets, defaultdicts, nested data, iteration patterns that turn raw records into lookups.

Aggregation and bucketing

Group by counting, running totals, histograms, bucket assignment in pure Python.

Streaming and I/O

Reading large files line by line, chunked CSV processing, parsing log lines, partitioned output.

Merge and reconcile

Joining two record streams in Python, reconciling schema drift, computing snapshot diffs.

Functional patterns and generators

Generators, decorators, context managers, higher order functions for throttles and timers.

Data processing and transforms

Multi step record transformation, schema cleanup, column wise transforms, pipeline composition.

Data Modeling (6)

Normalization and ERDs

Designing normalized tables, picking primary and foreign keys, modeling slowly changing dimensions.

Star schemas and warehouses

Fact and dimension tables, grain selection, conformed dimensions, star vs snowflake tradeoffs.

Pipeline Architecture (6)

Batch vs streaming

Choosing between batch and streaming, handling late arriving data, designing for consistency and replay.

ETL and connectors

CDC connectors, incremental sync strategies, schema drift handling, auto scaling for variable load.

How to use this list

You have Strategy
4+ weeks Work through every question in order. Each pattern builds on the last.
1 to 2 weeks Focus on SQL and pipeline architecture. The two highest weight rounds.
48 hours Skim every pattern description. Solve only the patterns you have not seen before.
Stuck on a question Read the prompt, attempt a solution, then check the discussion.

Companion repos

Contributing

Curated list. To suggest a question, open an issue with:

  1. The pattern the question tests
  2. A real source (interview report, blog, your own experience)
  3. Why it belongs in the 75 rather than the long tail

License

CC BY 4.0

Releases

No releases published

Packages

 
 
 

Contributors