The DataDriven 75. A focused, hand picked set of 75 real data engineering interview questions, organized by the patterns interviewers actually test.
SQL · Python · Schema · Pipelines · Companion repos
Every problem links to a runnable browser sandbox at datadriven.io. No login required.
If you want a wider net, the sibling repo data-engineering-interview-questions has the full 1418 question bank.
- SQL (33)
- Python and PySpark (30)
- Data Modeling (6)
- Pipeline Architecture (6)
- How to use this list
- Companion repos
GROUP BY, HAVING, conditional aggregation, and the difference between filtering rows and filtering groups.
- Spending by Account Status
- Power Users by Session Activity
- Daily Spam Impression Rate
- API Call Distribution Fraction
- Active User Penetration Rate
Inner, left, anti, and self joins. The choice between joining and using a subquery or EXISTS.
- Average Spending by Account Status
- Content Recommendation Engine
- First Day Session Retention
- Joined Employee Details
- NULL Keys in Joins
COALESCE, NULL safe comparisons, ROW_NUMBER deduplication, DISTINCT vs GROUP BY.
- Deduplicate and Keep Latest
- Deduplicated Sales Volume by Category
- Distinct Blog Referrers
- Distinct Chat Conversations
- Duplicate DQ Check Records
Date arithmetic, day of week and hour bucketing, time window filters, timezone gotchas.
- 7 Day Onboarding Conversion
- Active Tokens on Target Date
- After Hours API Calls
- Average Response Time by Hour
- 30 Day Page View Counts
ROW_NUMBER, RANK, LAG, LEAD, running aggregates, partitioned ORDER BY.
- Cloud Cost Trend Analysis
- 7 Check Rolling Average
- Longest Visit Streaks
- Previous Day Top Service
- Cumulative Sales Per Customer
WITH clauses for query layering, recursive CTEs for hierarchies and graphs, multi step funnels.
UNION, INTERSECT, EXCEPT, conditional aggregation pivots that turn rows into columns.
- Combined Cloud Spend by Region and Service
- Experiment Conversion Pivot
- Feature Name Intersection
- Push Notification Status Pivot
- Funnel Leakage Report
Dicts, sets, defaultdicts, nested data, iteration patterns that turn raw records into lookups.
- Distribute Values Into Container Types
- The Consecutive Sequence Finder
- The Eviction Policy
- The Hierarchy Builder
- The File Tree Builder
Group by counting, running totals, histograms, bucket assignment in pure Python.
Reading large files line by line, chunked CSV processing, parsing log lines, partitioned output.
Joining two record streams in Python, reconciling schema drift, computing snapshot diffs.
Generators, decorators, context managers, higher order functions for throttles and timers.
- The Chunked Reader
- The Dependency Resolver
- Execution Timer Wrapper
- The Throttle Wall
- The Timing Decorator
Multi step record transformation, schema cleanup, column wise transforms, pipeline composition.
- The Record Reconciler
- Merge Overlapping Time Ranges
- The Change Data Capture
- Stream Process a Large CSV
- The Column Transformer
Designing normalized tables, picking primary and foreign keys, modeling slowly changing dimensions.
Fact and dimension tables, grain selection, conformed dimensions, star vs snowflake tradeoffs.
Choosing between batch and streaming, handling late arriving data, designing for consistency and replay.
- Hourly ETL Pipeline with Consistency
- Database Replication and Schema Normalization Pipeline
- Gaming Event Pipeline: Streaming vs Batch Architecture Decision
CDC connectors, incremental sync strategies, schema drift handling, auto scaling for variable load.
- CDC Connector: Log Based vs Trigger Based
- AWS Pipeline Auto Scaling for Variable Volume
- City Wide Bicycle Demand Forecasting Pipeline
| You have | Strategy |
|---|---|
| 4+ weeks | Work through every question in order. Each pattern builds on the last. |
| 1 to 2 weeks | Focus on SQL and pipeline architecture. The two highest weight rounds. |
| 48 hours | Skim every pattern description. Solve only the patterns you have not seen before. |
| Stuck on a question | Read the prompt, attempt a solution, then check the discussion. |
- data-engineering-interview-questions. The full 1418 question bank.
- data-engineering-interview-handbook. The flagship handbook with chapters and study plans.
- awesome-data-engineering-interview. Curated resource list (singular).
- system-design-for-data-engineers. 120 long form pipeline case studies.
- data-engineer-interview-prep. 8 week structured practice schedule.
- data-engineering-cheatsheet. One page recall reference.
- data-engineer-interview-handbook. 7 day sprint version of the handbook.
Curated list. To suggest a question, open an issue with:
- The pattern the question tests
- A real source (interview report, blog, your own experience)
- Why it belongs in the 75 rather than the long tail