Data Engineering Take-Home Challenge

Overview

You are joining the data team at Venatus, a programmatic advertising company. Raw data from our ad-serving platform lands as files in a data lake landing zone.

A previous engineer started building the ingestion pipeline and some dbt staging models, but left before the work was tested. The code in dagster_project/ and dbt/models/staging/ has not been run against the actual data.

Your task is to:

Get the existing pipeline working — the Python ingestion code has bugs that you'll need to find and fix by running it.
Extend the dbt layer — staging models are started but incomplete. Build the mart models.
Investigate the data — find and document as many data quality issues as you can. There are over 20.
Document your design decisions and trade-offs in DESIGN.md.
Commit frequently - we want to observe the process of discovery, not just a single final commit.

AI use is encouraged, but this challenge requires running code, hitting real errors, and debugging iteratively. Generating code without testing it will not work. AI will not find all of the data quality issues on its own.

This exercise is designed to assess the core skills of a data engineer: debugging existing code, handling messy real-world data, data modeling, and writing clean Python and SQL.

The Stack

Tool	Purpose	Access
ClickHouse	Columnar data warehouse (empty tables pre-created)	`http://localhost:8123/play`
Dagster	Pipeline orchestration framework	Runs locally via `dagster dev`
dbt (dbt-clickhouse)	Data transformation framework	Runs via Docker
Python	Ingestion pipeline code	Local environment

Everything runs locally or in Docker. A few commands to get started.

New to ClickHouse or Dagster? That's completely fine — we don't expect prior experience with either.

ClickHouse is very similar to PostgreSQL for SQL purposes. The main difference is the table engine syntax (already set up for you in clickhouse/init-db.sql). The SQL dialect is nearly standard — most things you'd write in Postgres will work. See ClickHouse SQL Reference if you get stuck.

Dagster is a Python-first orchestration framework similar to Airflow but with a more modern API. Instead of DAGs and operators, you define assets (decorated Python functions). The Dagster tutorial covers the basics in ~20 minutes. If you're more comfortable with Airflow concepts, think of assets as tasks that produce named outputs.

The code in dagster_project/ is a partially-complete implementation. The ClickHouse resource (resources.py) and Dagster wiring (definitions.py) are working — the bugs are in the data handling code (assets/ingest.py). You should not need to learn ClickHouse internals to fix them.

Getting Started

Prerequisites

Docker and Docker Compose (v2+)
Python 3.10+
make (standard on macOS / Linux)
Git
DBeaver optional UI tool for Clickhouse if desired.

1. Clone and set up

git clone <this-repo>
cd data-coding-challenge

# Start ClickHouse
make up

# Install Python dependencies
make setup

2. Inspect the raw data

Before writing any code, explore the raw data files:

# See what's in the raw data
ls -la data/raw/
ls -la data/raw/events/ | head -20

# Look at the structure of event files
head -5 data/raw/events/*.jsonl | head -20

# Look at dimension files
head -5 data/raw/publishers.csv
head -5 data/raw/campaigns/campaigns_export.csv
head -5 data/raw/ad_units.csv

Also verify ClickHouse has empty tables ready:

-- Open http://localhost:8123/play
SHOW TABLES FROM raw;
-- Should show: ad_events, ad_units, campaigns, publishers (all empty)

3. Run the existing pipeline

The ingestion pipeline has already been started. Try running it:

4. Run your pipeline

# Start Dagster dev server
make dagster
# Open http://localhost:3000 and try to materialize the assets
# Some will fail — that's expected. Debug and fix the issues.

# After data is loaded, run dbt
make dbt-deps
make dbt-run    # Some models may fail — fix them
make dbt-test   # Some tests may fail — investigate why

Raw Data

The following files are provided in data/raw/:

`data/raw/events/*.jsonl`

Daily ad-serving event files in JSON Lines format. One file per day, covering ~30 days.

Each line is a JSON object representing one ad event. Inspect the files carefully — the schema is not entirely consistent across all files.

Typical fields include:

Field	Type	Notes
`event_id`	string	UUID event identifier
`event_type`	string	`impression`, `click`, `viewable_impression`
`timestamp`	string	ISO 8601 timestamp
`publisher_id`	integer	Publisher (website)
`ad_unit_id`	string	Ad placement identifier
`campaign_id`	integer	Advertiser campaign — absent for unfilled events
`advertiser_id`	integer	Advertiser — absent for unfilled events
`device_type`	string	`desktop`, `mobile`, `tablet`, `ctv`
`browser`	string	Browser name
`revenue_usd`	float	Revenue in USD
`bid_floor_usd`	float	Minimum bid price
`is_filled`	boolean	Whether an ad was served
`site_domain`	string	Domain where ad was served
Geography fields	varies	Inspect the data — structure changes between files

`data/raw/publishers.csv`

Publisher / website dimension data (20 publishers). Standard CSV format.

`data/raw/campaigns/campaigns_export.csv`

Advertiser campaign dimension data (27 campaigns). Exported from a different system — inspect the format carefully.

`data/raw/ad_units.csv`

Ad unit / placement configuration (60 ad units across all publishers). Standard CSV format.

`data/raw/redelivery/` — Second Delivery Batch

In ad-tech, revenue figures for recent days are not final when first reported. SSP/DSP reconciliation means that the last 2–3 days of data routinely get corrected as final figures settle. Our production pipeline re-ingests the most recent days every run.

The redelivery/ directory simulates this. It contains a second delivery of data that arrived one day later:

redelivery/events/ — The last 3 days of event files, re-delivered with corrections. Some events have updated revenue figures. Some new late-arriving events appear that weren't in the initial batch.
redelivery/publishers.csv — Updated publisher data where some attributes have changed (e.g., account manager reassignment, category reclassification).
redelivery/campaigns/campaigns_export.csv — Updated campaign data where some attributes have changed (e.g., budget increase, status change from completed to active when a campaign is extended).
redelivery/ad_units.csv — Unchanged (included for completeness).

Your pipeline should be designed to handle being run against both the initial data and the redelivery data. Think carefully about how your dbt models should handle:

Event data that gets corrected and re-delivered (same event_id, different values)
Dimension data where attributes change over time

This is a core design decision — document your reasoning in DESIGN.md.

Note: Treat this data as you would data from a production system. Explore it thoroughly before writing your pipeline — not all of it is clean and not all files have identical schemas.

Your Tasks

Part 1: Fix and Extend the Ingestion Pipeline (40%)

The ingestion code in dagster_project/assets/ingest.py was started by a previous engineer but has never been tested against the actual data files. The ClickHouse resource (resources.py) and Dagster definitions (definitions.py) are working — the bugs are in the data handling logic.

Required:

Get all four ingestion assets (publishers, ad_units, campaigns, ad_events) materializing successfully
Data must be loaded correctly — not just without errors, but with accurate values
Handle both the initial delivery and the redelivery batch
Document what you fixed and why in DESIGN.md

Expectations:

Debug by running the pipeline, reading error messages, and inspecting the data
Clean, readable Python fixes with proper typing
Verify loaded data is correct (not just that the pipeline didn't crash)
Idempotent loading (running the pipeline twice should be safe)

Nice to have (if time permits):

Partition-aware assets (one partition per event day file)
Data quality checks within Dagster (row counts, schema validation)
A sensor or schedule for detecting new files

Part 2: Data Transformation with dbt (40%)

Staging models have been started in dbt/models/staging/ but may have issues. Fix any problems, then build the mart layer.

Staging models (models/staging/) — provided, may need fixes:

stg_ad_events — Deduplicated, cleaned ad events with consistent schema
stg_publishers — Cleaned publisher dimension
stg_campaigns — Cleaned campaign dimension with normalized dates
stg_ad_units — Cleaned ad unit dimension

Required mart model (models/marts/) — you build this:

fct_ad_events_daily — Daily aggregated ad metrics at the grain of your choosing (document it)
- Required metrics: impressions, clicks, revenue_usd, fill_rate

Materialization strategy — a key design decision:

The redelivery batch means your pipeline runs multiple times with updated data. Consider carefully which dbt materialization strategy is appropriate for each model type:

How should event models handle re-delivered events with corrected revenue?
How should dimension models handle attributes that change over time?
Document your choices and reasoning in DESIGN.md.

Nice to have (if time permits):

dim_publishers, dim_campaigns, dim_ad_units
Additional metrics: viewable_impressions, ctr, viewability_rate
fct_publisher_performance — Publisher-level daily summary

Expectations:

Follow dbt best practices: source(), ref(), staging → marts pattern, consistent naming
Handle data quality issues you discover — and document your decisions
Write tests in schema.yml files:
- unique and not_null tests on primary / surrogate keys
- At least one other test type (accepted_values, relationships, or a custom singular test)
Write documentation:
- Descriptions for all models and key columns
- Document any assumptions or business logic applied

Part 3: Design Document (20%)

Write a DESIGN.md in the repo root covering:

What you fixed — What bugs did you find in the existing pipeline? How did you identify and fix each one?
Data quality issues — What problems did you find in the raw data? How did you handle each?
Trade-offs — What shortcuts did you take? What would you change with more time?

Investigation Task

While exploring the raw data and building your pipeline, you should notice anomalous patterns that suggest data quality or business integrity issues.

For each issue you find, document in your DESIGN.md:

What the issue is — with specific evidence: exact row counts, affected IDs, date ranges, or example values. Vague descriptions (e.g., "there might be duplicates") are not sufficient.
Why it matters from a business/revenue perspective
How you handled it in your pipeline (or would handle it in production)
The query or command you used to detect it — include the actual SQL or Python snippet

We expect candidates to find issues across multiple data sources and types. Some issues are visible from inspecting individual files; others require loading the data and running analytical queries across tables. The strongest candidates will find patterns that are only visible through aggregation or cross-referencing between different data sources.

Hint: Not all issues are in the event files. Inspect everything. Look at distributions, not just individual records. Cross-reference between sources.

Timebox

2–4 hours. We are not judging line count — focus on clarity, pipeline design, correctness, and simplicity.

Evaluation Criteria

Area	Weight	What we look for
Python / Dagster	40%	Debugging the existing code, clean fixes, proper typing, error handling, idempotency, verifying data correctness
dbt	40%	Best practices (sources, staging, marts), tests, documentation, data quality handling
Design Document	20%	Clarity of reasoning, data quality findings, production readiness thinking

We evaluate your engineering thinking, not your familiarity with specific tools. If you've never used ClickHouse or Dagster before, that's fine — we're looking at how you approach data problems, write clean Python, and model data in SQL. Candidates who come from a Postgres + Airflow background (or any other stack) are not at a disadvantage.

Bonus Points

ClickHouse-aware design (MergeTree engine choice, ordering key selection)
Incremental loading in dbt
Dagster partitions for daily event files
Data quality checks in both Dagster and dbt
Clear, well-structured code that's easy to review

Suggested Repo Structure

.
├── Makefile
├── README.md
├── DESIGN.md               ← your design document
├── pyproject.toml
├── docker-compose.yml
├── clickhouse/
│   ├── init-db.sql
│   └── users.xml
├── data/
│   └── raw/                 # raw data files (pre-populated)
├── dagster_project/         ← partially-built pipeline (has bugs)
│   ├── __init__.py
│   ├── definitions.py
│   ├── resources.py
│   └── assets/
│       ├── __init__.py
│       └── ingest.py
└── dbt/                     ← your dbt project
    ├── Dockerfile
    ├── dbt_project.yml
    ├── profiles.yml
    ├── packages.yml
    ├── models/
    │   ├── sources.yml
    │   ├── example/
    │   ├── staging/         ← started (may have issues)
    │   └── marts/           ← you build these
    ├── tests/
    ├── macros/
    └── snapshots/

How We'll Run It

make up                       # start ClickHouse
make setup                    # install Python deps
make dagster                  # start Dagster → materialize assets from initial data
make dbt-deps                 # install dbt packages
make dbt-run                  # build models
make dbt-test                 # run tests

# Then we'll re-run your pipeline against the redelivery data and
# verify that corrected events and changed dimensions are handled correctly.

Then we'll verify:

Raw tables in ClickHouse contain rows
dbt models build and tests pass
Re-delivered events correctly update (not duplicate) existing data
Dimension attribute changes are handled appropriately
Code is clean, well-structured, and handles the data quality issues
Design document shows thoughtful engineering decisions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Take-Home Challenge

Overview

The Stack

Getting Started

Prerequisites

1. Clone and set up

2. Inspect the raw data

3. Run the existing pipeline

4. Run your pipeline

Raw Data

`data/raw/events/*.jsonl`

`data/raw/publishers.csv`

`data/raw/campaigns/campaigns_export.csv`

`data/raw/ad_units.csv`

`data/raw/redelivery/` — Second Delivery Batch

Your Tasks

Part 1: Fix and Extend the Ingestion Pipeline (40%)

Part 2: Data Transformation with dbt (40%)

Part 3: Design Document (20%)

Investigation Task

Timebox

Evaluation Criteria

Bonus Points

Suggested Repo Structure

How We'll Run It

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
clickhouse		clickhouse
dagster_project		dagster_project
data/raw		data/raw
dbt		dbt
.gitignore		.gitignore
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Take-Home Challenge

Overview

The Stack

Getting Started

Prerequisites

1. Clone and set up

2. Inspect the raw data

3. Run the existing pipeline

4. Run your pipeline

Raw Data

data/raw/events/*.jsonl

data/raw/publishers.csv

data/raw/campaigns/campaigns_export.csv

data/raw/ad_units.csv

data/raw/redelivery/ — Second Delivery Batch

Your Tasks

Part 1: Fix and Extend the Ingestion Pipeline (40%)

Part 2: Data Transformation with dbt (40%)

Part 3: Design Document (20%)

Investigation Task

Timebox

Evaluation Criteria

Bonus Points

Suggested Repo Structure

How We'll Run It

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`data/raw/events/*.jsonl`

`data/raw/publishers.csv`

`data/raw/campaigns/campaigns_export.csv`

`data/raw/ad_units.csv`

`data/raw/redelivery/` — Second Delivery Batch

Packages