A recruiter-friendly, end-to-end data quality project that turns a multi-table e-commerce dataset into a tested SQLite pipeline with automated validation in CI.
Most student data projects stop at analysis notebooks. This one is built like a small production-facing data workflow:
- multi-table relational data instead of a single flat CSV
- reusable validation functions instead of one-off assertions
- 89 automated Pytest checks covering schema, integrity, and business rules
- deterministic CI that reruns the pipeline and validation suite on every push and pull request
- a bad-data demo that proves the framework catches real failures
| Area | What this project does |
|---|---|
| Dataset | Curated subset of the Olist Brazilian e-commerce dataset |
| Pipeline | CSV -> Pandas -> SQLite |
| Tables | customers, orders, order_items, products |
| Testing | 89 Pytest checks with reusable validators |
| Validation scope | schema, nulls, duplicates, FKs, dtypes, dates, business rules |
| CI | GitHub Actions on every push and PR |
| Failure demo | Intentional bad data in data/bad/ |
This project simulates a lightweight analytics pipeline with a strong testing layer:
- extracts curated Olist CSV files from
data/raw/ - applies light cleaning and type normalization in Pandas
- loads four related tables into SQLite
- validates the data with reusable validator functions in
src/validation/validators.py - runs the full pipeline and test suite automatically through GitHub Actions
- Python
- Pandas
- Pytest
- SQLite
- GitHub Actions
The repository uses a curated subset of the Olist Brazilian E-Commerce Public Dataset from Kaggle. The original dataset contains roughly 100k Brazilian e-commerce orders from 2016 to 2018 across multiple related marketplace tables.
To keep the repo lightweight and CI-friendly, this project commits a deterministic subset of four tables:
customersordersorder_itemsproducts
Committed subset sizes:
customers: 3,000 rowsorders: 3,000 rowsorder_items: 10,171 rowsproducts: 3,309 rows
Note: in this Olist slice, customer_id is effectively order-scoped, so preserving referential integrity for 3,000 orders also means keeping 3,000 customer rows.
flowchart LR
A["Curated CSVs in data/raw"] --> B["Pandas extraction"]
B --> C["Light transforms and type normalization"]
C --> D["SQLite warehouse tables"]
D --> E["Reusable validation functions"]
E --> F["Pytest suite"]
F --> G["GitHub Actions CI"]
Additional docs:
The framework includes reusable checks for:
- table existence
- schema and column order
- not-null enforcement
- uniqueness and composite uniqueness
- non-negative numeric fields
- allowed categorical values
- dtype-family validation
- foreign-key validation
- date ordering rules
- blank-string detection
- minimum row-count thresholds
A few representative rules enforced by the suite:
orders.customer_idmust exist incustomers.customer_idorder_items.product_idmust exist inproducts.product_idorder_statusmust be in the allowed status setprice >= 0andfreight_value >= 0order_approved_at >= order_purchase_timestampwhen present(order_id, order_item_id)must be unique inorder_items
.
├── data/
│ ├── raw/
│ ├── bad/
│ └── processed/
├── database/
├── docs/
├── src/
│ ├── pipeline/
│ ├── validation/
│ └── utils/
├── tests/
└── .github/workflows/
Install dependencies:
pip install -r requirements.txtRun the pipeline:
python -m src.pipeline.run_pipelineRun the validation suite:
pytest -vRun the standalone validation summary:
python -m src.validation.validation_runnerThe data/bad/ folder contains intentionally corrupted copies of the same tables, including:
- null primary-key values
- duplicate IDs and composite keys
- orphan foreign keys
- invalid
order_statusvalues - negative numeric values
- invalid date ordering
- schema mismatch from a removed column
To reproduce a failing run against the corrupted dataset:
PIPELINE_RAW_DIR=data/bad pytest -vCaptured example outputs:
GitHub Actions runs the full workflow on every push and pull request:
- checks out the repository
- sets up Python 3.12 with pip caching
- installs dependencies from
requirements.txt - runs
python -m src.pipeline.run_pipeline - runs
pytest -v
Use this short blurb for the repo description or a pinned-project summary:
Built a reusable data validation framework for a multi-table e-commerce pipeline using Pandas, SQLite, Pytest, and GitHub Actions. The project loads a curated Olist dataset subset into SQLite, runs 89 automated quality checks for schema, nulls, duplicates, referential integrity, and business rules, and includes a bad-data demo that proves the framework catches real failures.
A concise way to describe the project:
I used a multi-table subset of the Olist Kaggle e-commerce dataset to build a realistic pipeline instead of validating a single flat CSV. I ingested the data with Pandas, loaded it into SQLite, built reusable validation functions, and exercised them through parameterized Pytest checks for schema, nulls, duplicates, foreign keys, data types, and business rules. I then integrated GitHub Actions so every push and pull request reruns the pipeline and test suite automatically.
Data Pipeline Testing and Validation Framework | Python, Pandas, Pytest, SQLite, GitHub Actions
- Built a reusable validation framework for a multi-table e-commerce pipeline using Pandas and SQLite, implementing 40+ parameterized Pytest checks for schema enforcement, null detection, duplicate detection, referential integrity, and business-rule validation
- Integrated a GitHub Actions CI pipeline to run automated data quality tests on every push and pull request, making validation reproducible across code changes
