Data Pipeline Testing and Validation Framework

A recruiter-friendly, end-to-end data quality project that turns a multi-table e-commerce dataset into a tested SQLite pipeline with automated validation in CI.

Why This Project Stands Out

Most student data projects stop at analysis notebooks. This one is built like a small production-facing data workflow:

multi-table relational data instead of a single flat CSV
reusable validation functions instead of one-off assertions
89 automated Pytest checks covering schema, integrity, and business rules
deterministic CI that reruns the pipeline and validation suite on every push and pull request
a bad-data demo that proves the framework catches real failures

At A Glance

Area	What this project does
Dataset	Curated subset of the Olist Brazilian e-commerce dataset
Pipeline	CSV -> Pandas -> SQLite
Tables	`customers`, `orders`, `order_items`, `products`
Testing	89 Pytest checks with reusable validators
Validation scope	schema, nulls, duplicates, FKs, dtypes, dates, business rules
CI	GitHub Actions on every push and PR
Failure demo	Intentional bad data in `data/bad/`

What It Does

This project simulates a lightweight analytics pipeline with a strong testing layer:

extracts curated Olist CSV files from data/raw/
applies light cleaning and type normalization in Pandas
loads four related tables into SQLite
validates the data with reusable validator functions in src/validation/validators.py
runs the full pipeline and test suite automatically through GitHub Actions

Tech Stack

Python
Pandas
Pytest
SQLite
GitHub Actions

Dataset

The repository uses a curated subset of the Olist Brazilian E-Commerce Public Dataset from Kaggle. The original dataset contains roughly 100k Brazilian e-commerce orders from 2016 to 2018 across multiple related marketplace tables.

To keep the repo lightweight and CI-friendly, this project commits a deterministic subset of four tables:

customers
orders
order_items
products

Committed subset sizes:

customers: 3,000 rows
orders: 3,000 rows
order_items: 10,171 rows
products: 3,309 rows

Note: in this Olist slice, customer_id is effectively order-scoped, so preserving referential integrity for 3,000 orders also means keeping 3,000 customer rows.

Architecture

flowchart LR
    A["Curated CSVs in data/raw"] --> B["Pandas extraction"]
    B --> C["Light transforms and type normalization"]
    C --> D["SQLite warehouse tables"]
    D --> E["Reusable validation functions"]
    E --> F["Pytest suite"]
    F --> G["GitHub Actions CI"]

Additional docs:

Validation Coverage

The framework includes reusable checks for:

table existence
schema and column order
not-null enforcement
uniqueness and composite uniqueness
non-negative numeric fields
allowed categorical values
dtype-family validation
foreign-key validation
date ordering rules
blank-string detection
minimum row-count thresholds

Example Quality Rules

A few representative rules enforced by the suite:

orders.customer_id must exist in customers.customer_id
order_items.product_id must exist in products.product_id
order_status must be in the allowed status set
price >= 0 and freight_value >= 0
order_approved_at >= order_purchase_timestamp when present
(order_id, order_item_id) must be unique in order_items

Project Structure

.
├── data/
│   ├── raw/
│   ├── bad/
│   └── processed/
├── database/
├── docs/
├── src/
│   ├── pipeline/
│   ├── validation/
│   └── utils/
├── tests/
└── .github/workflows/

How To Run

Install dependencies:

pip install -r requirements.txt

Run the pipeline:

python -m src.pipeline.run_pipeline

Run the validation suite:

pytest -v

Run the standalone validation summary:

python -m src.validation.validation_runner

Bad Data Demo

The data/bad/ folder contains intentionally corrupted copies of the same tables, including:

null primary-key values
duplicate IDs and composite keys
orphan foreign keys
invalid order_status values
negative numeric values
invalid date ordering
schema mismatch from a removed column

To reproduce a failing run against the corrupted dataset:

PIPELINE_RAW_DIR=data/bad pytest -v

Captured example outputs:

CI

GitHub Actions runs the full workflow on every push and pull request:

checks out the repository
sets up Python 3.12 with pip caching
installs dependencies from requirements.txt
runs python -m src.pipeline.run_pipeline
runs pytest -v

GitHub Pinned Blurb

Use this short blurb for the repo description or a pinned-project summary:

Built a reusable data validation framework for a multi-table e-commerce pipeline using Pandas, SQLite, Pytest, and GitHub Actions. The project loads a curated Olist dataset subset into SQLite, runs 89 automated quality checks for schema, nulls, duplicates, referential integrity, and business rules, and includes a bad-data demo that proves the framework catches real failures.

Interview Summary

A concise way to describe the project:

I used a multi-table subset of the Olist Kaggle e-commerce dataset to build a realistic pipeline instead of validating a single flat CSV. I ingested the data with Pandas, loaded it into SQLite, built reusable validation functions, and exercised them through parameterized Pytest checks for schema, nulls, duplicates, foreign keys, data types, and business rules. I then integrated GitHub Actions so every push and pull request reruns the pipeline and test suite automatically.

Resume Version

Data Pipeline Testing and Validation Framework | Python, Pandas, Pytest, SQLite, GitHub Actions

Built a reusable validation framework for a multi-table e-commerce pipeline using Pandas and SQLite, implementing 40+ parameterized Pytest checks for schema enforcement, null detection, duplicate detection, referential integrity, and business-rule validation
Integrated a GitHub Actions CI pipeline to run automated data quality tests on every push and pull request, making validation reproducible across code changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline Testing and Validation Framework

Why This Project Stands Out

At A Glance

What It Does

Tech Stack

Dataset

Architecture

Validation Coverage

Example Quality Rules

Project Structure

How To Run

Bad Data Demo

CI

GitHub Pinned Blurb

Interview Summary

Resume Version

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
data		data
database		database
docs		docs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Testing and Validation Framework

Why This Project Stands Out

At A Glance

What It Does

Tech Stack

Dataset

Architecture

Validation Coverage

Example Quality Rules

Project Structure

How To Run

Bad Data Demo

CI

GitHub Pinned Blurb

Interview Summary

Resume Version

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages