You are joining the data team at Venatus, a programmatic advertising company. Raw data from our ad-serving platform lands as files in a data lake landing zone.
A previous engineer started building the ingestion pipeline and some dbt staging models, but left before the work was tested. The code in dagster_project/ and dbt/models/staging/ has not been run against the actual data.
Your task is to:
- Get the existing pipeline working — the Python ingestion code has bugs that you'll need to find and fix by running it.
- Extend the dbt layer — staging models are started but incomplete. Build the mart models.
- Investigate the data — find and document as many data quality issues as you can. There are over 20.
- Document your design decisions and trade-offs in
DESIGN.md. - Commit frequently - we want to observe the process of discovery, not just a single final commit.
AI use is encouraged, but this challenge requires running code, hitting real errors, and debugging iteratively. Generating code without testing it will not work. AI will not find all of the data quality issues on its own.
This exercise is designed to assess the core skills of a data engineer: debugging existing code, handling messy real-world data, data modeling, and writing clean Python and SQL.
| Tool | Purpose | Access |
|---|---|---|
| ClickHouse | Columnar data warehouse (empty tables pre-created) | http://localhost:8123/play |
| Dagster | Pipeline orchestration framework | Runs locally via dagster dev |
| dbt (dbt-clickhouse) | Data transformation framework | Runs via Docker |
| Python | Ingestion pipeline code | Local environment |
Everything runs locally or in Docker. A few commands to get started.
New to ClickHouse or Dagster? That's completely fine — we don't expect prior experience with either.
- ClickHouse is very similar to PostgreSQL for SQL purposes. The main difference is the table engine syntax (already set up for you in
clickhouse/init-db.sql). The SQL dialect is nearly standard — most things you'd write in Postgres will work. See ClickHouse SQL Reference if you get stuck.- Dagster is a Python-first orchestration framework similar to Airflow but with a more modern API. Instead of DAGs and operators, you define assets (decorated Python functions). The Dagster tutorial covers the basics in ~20 minutes. If you're more comfortable with Airflow concepts, think of assets as tasks that produce named outputs.
- The code in
dagster_project/is a partially-complete implementation. The ClickHouse resource (resources.py) and Dagster wiring (definitions.py) are working — the bugs are in the data handling code (assets/ingest.py). You should not need to learn ClickHouse internals to fix them.
- Docker and Docker Compose (v2+)
- Python 3.10+
make(standard on macOS / Linux)- Git
- DBeaver optional UI tool for Clickhouse if desired.
git clone <this-repo>
cd data-coding-challenge
# Start ClickHouse
make up
# Install Python dependencies
make setupBefore writing any code, explore the raw data files:
# See what's in the raw data
ls -la data/raw/
ls -la data/raw/events/ | head -20
# Look at the structure of event files
head -5 data/raw/events/*.jsonl | head -20
# Look at dimension files
head -5 data/raw/publishers.csv
head -5 data/raw/campaigns/campaigns_export.csv
head -5 data/raw/ad_units.csvAlso verify ClickHouse has empty tables ready:
-- Open http://localhost:8123/play
SHOW TABLES FROM raw;
-- Should show: ad_events, ad_units, campaigns, publishers (all empty)The ingestion pipeline has already been started. Try running it:
# Start Dagster dev server
make dagster
# Open http://localhost:3000 and try to materialize the assets
# Some will fail — that's expected. Debug and fix the issues.
# After data is loaded, run dbt
make dbt-deps
make dbt-run # Some models may fail — fix them
make dbt-test # Some tests may fail — investigate whyThe following files are provided in data/raw/:
Daily ad-serving event files in JSON Lines format. One file per day, covering ~30 days.
Each line is a JSON object representing one ad event. Inspect the files carefully — the schema is not entirely consistent across all files.
Typical fields include:
| Field | Type | Notes |
|---|---|---|
event_id |
string | UUID event identifier |
event_type |
string | impression, click, viewable_impression |
timestamp |
string | ISO 8601 timestamp |
publisher_id |
integer | Publisher (website) |
ad_unit_id |
string | Ad placement identifier |
campaign_id |
integer | Advertiser campaign — absent for unfilled events |
advertiser_id |
integer | Advertiser — absent for unfilled events |
device_type |
string | desktop, mobile, tablet, ctv |
browser |
string | Browser name |
revenue_usd |
float | Revenue in USD |
bid_floor_usd |
float | Minimum bid price |
is_filled |
boolean | Whether an ad was served |
site_domain |
string | Domain where ad was served |
| Geography fields | varies | Inspect the data — structure changes between files |
Publisher / website dimension data (20 publishers). Standard CSV format.
Advertiser campaign dimension data (27 campaigns). Exported from a different system — inspect the format carefully.
Ad unit / placement configuration (60 ad units across all publishers). Standard CSV format.
In ad-tech, revenue figures for recent days are not final when first reported. SSP/DSP reconciliation means that the last 2–3 days of data routinely get corrected as final figures settle. Our production pipeline re-ingests the most recent days every run.
The redelivery/ directory simulates this. It contains a second delivery of data that arrived one day later:
redelivery/events/— The last 3 days of event files, re-delivered with corrections. Some events have updated revenue figures. Some new late-arriving events appear that weren't in the initial batch.redelivery/publishers.csv— Updated publisher data where some attributes have changed (e.g., account manager reassignment, category reclassification).redelivery/campaigns/campaigns_export.csv— Updated campaign data where some attributes have changed (e.g., budget increase, status change from completed to active when a campaign is extended).redelivery/ad_units.csv— Unchanged (included for completeness).
Your pipeline should be designed to handle being run against both the initial data and the redelivery data. Think carefully about how your dbt models should handle:
- Event data that gets corrected and re-delivered (same
event_id, different values) - Dimension data where attributes change over time
This is a core design decision — document your reasoning in DESIGN.md.
Note: Treat this data as you would data from a production system. Explore it thoroughly before writing your pipeline — not all of it is clean and not all files have identical schemas.
The ingestion code in dagster_project/assets/ingest.py was started by a previous engineer but has never been tested against the actual data files. The ClickHouse resource (resources.py) and Dagster definitions (definitions.py) are working — the bugs are in the data handling logic.
Required:
- Get all four ingestion assets (publishers, ad_units, campaigns, ad_events) materializing successfully
- Data must be loaded correctly — not just without errors, but with accurate values
- Handle both the initial delivery and the redelivery batch
- Document what you fixed and why in
DESIGN.md
Expectations:
- Debug by running the pipeline, reading error messages, and inspecting the data
- Clean, readable Python fixes with proper typing
- Verify loaded data is correct (not just that the pipeline didn't crash)
- Idempotent loading (running the pipeline twice should be safe)
Nice to have (if time permits):
- Partition-aware assets (one partition per event day file)
- Data quality checks within Dagster (row counts, schema validation)
- A sensor or schedule for detecting new files
Staging models have been started in dbt/models/staging/ but may have issues. Fix any problems, then build the mart layer.
Staging models (models/staging/) — provided, may need fixes:
stg_ad_events— Deduplicated, cleaned ad events with consistent schemastg_publishers— Cleaned publisher dimensionstg_campaigns— Cleaned campaign dimension with normalized datesstg_ad_units— Cleaned ad unit dimension
Required mart model (models/marts/) — you build this:
fct_ad_events_daily— Daily aggregated ad metrics at the grain of your choosing (document it)- Required metrics:
impressions,clicks,revenue_usd,fill_rate
- Required metrics:
Materialization strategy — a key design decision:
The redelivery batch means your pipeline runs multiple times with updated data. Consider carefully which dbt materialization strategy is appropriate for each model type:
- How should event models handle re-delivered events with corrected revenue?
- How should dimension models handle attributes that change over time?
- Document your choices and reasoning in
DESIGN.md.
Nice to have (if time permits):
dim_publishers,dim_campaigns,dim_ad_units- Additional metrics:
viewable_impressions,ctr,viewability_rate fct_publisher_performance— Publisher-level daily summary
Expectations:
- Follow dbt best practices:
source(),ref(), staging → marts pattern, consistent naming - Handle data quality issues you discover — and document your decisions
- Write tests in
schema.ymlfiles:uniqueandnot_nulltests on primary / surrogate keys- At least one other test type (
accepted_values,relationships, or a custom singular test)
- Write documentation:
- Descriptions for all models and key columns
- Document any assumptions or business logic applied
Write a DESIGN.md in the repo root covering:
- What you fixed — What bugs did you find in the existing pipeline? How did you identify and fix each one?
- Data quality issues — What problems did you find in the raw data? How did you handle each?
- Trade-offs — What shortcuts did you take? What would you change with more time?
While exploring the raw data and building your pipeline, you should notice anomalous patterns that suggest data quality or business integrity issues.
For each issue you find, document in your DESIGN.md:
- What the issue is — with specific evidence: exact row counts, affected IDs, date ranges, or example values. Vague descriptions (e.g., "there might be duplicates") are not sufficient.
- Why it matters from a business/revenue perspective
- How you handled it in your pipeline (or would handle it in production)
- The query or command you used to detect it — include the actual SQL or Python snippet
We expect candidates to find issues across multiple data sources and types. Some issues are visible from inspecting individual files; others require loading the data and running analytical queries across tables. The strongest candidates will find patterns that are only visible through aggregation or cross-referencing between different data sources.
Hint: Not all issues are in the event files. Inspect everything. Look at distributions, not just individual records. Cross-reference between sources.
2–4 hours. We are not judging line count — focus on clarity, pipeline design, correctness, and simplicity.
| Area | Weight | What we look for |
|---|---|---|
| Python / Dagster | 40% | Debugging the existing code, clean fixes, proper typing, error handling, idempotency, verifying data correctness |
| dbt | 40% | Best practices (sources, staging, marts), tests, documentation, data quality handling |
| Design Document | 20% | Clarity of reasoning, data quality findings, production readiness thinking |
We evaluate your engineering thinking, not your familiarity with specific tools. If you've never used ClickHouse or Dagster before, that's fine — we're looking at how you approach data problems, write clean Python, and model data in SQL. Candidates who come from a Postgres + Airflow background (or any other stack) are not at a disadvantage.
- ClickHouse-aware design (MergeTree engine choice, ordering key selection)
- Incremental loading in dbt
- Dagster partitions for daily event files
- Data quality checks in both Dagster and dbt
- Clear, well-structured code that's easy to review
.
├── Makefile
├── README.md
├── DESIGN.md ← your design document
├── pyproject.toml
├── docker-compose.yml
├── clickhouse/
│ ├── init-db.sql
│ └── users.xml
├── data/
│ └── raw/ # raw data files (pre-populated)
├── dagster_project/ ← partially-built pipeline (has bugs)
│ ├── __init__.py
│ ├── definitions.py
│ ├── resources.py
│ └── assets/
│ ├── __init__.py
│ └── ingest.py
└── dbt/ ← your dbt project
├── Dockerfile
├── dbt_project.yml
├── profiles.yml
├── packages.yml
├── models/
│ ├── sources.yml
│ ├── example/
│ ├── staging/ ← started (may have issues)
│ └── marts/ ← you build these
├── tests/
├── macros/
└── snapshots/
make up # start ClickHouse
make setup # install Python deps
make dagster # start Dagster → materialize assets from initial data
make dbt-deps # install dbt packages
make dbt-run # build models
make dbt-test # run tests
# Then we'll re-run your pipeline against the redelivery data and
# verify that corrected events and changed dimensions are handled correctly.Then we'll verify:
- Raw tables in ClickHouse contain rows
- dbt models build and tests pass
- Re-delivered events correctly update (not duplicate) existing data
- Dimension attribute changes are handled appropriately
- Code is clean, well-structured, and handles the data quality issues
- Design document shows thoughtful engineering decisions