Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
ab5c6b8
feat: UV migration, integration test refactor, ClickBench/ELTBench su…
mwc360 Feb 24, 2026
7384b51
test: use scale_factor=0.1 for faster CI integration runs
mwc360 Feb 24, 2026
c0c3b21
docs: add integration test README with uv sync commands per engine
mwc360 Feb 24, 2026
88e5dc6
fix: DaftELTBench path handling and API compatibility
mwc360 Feb 24, 2026
2cdd2fa
feat: add ClickBench support for Polars and Daft engines
mwc360 Feb 24, 2026
9555ee2
feat: auto-generate per-engine benchmark reports via pytest_sessionfi…
mwc360 Feb 24, 2026
4c4d25f
refactor: rename docs/benchmarks -> reports/coverage
mwc360 Feb 24, 2026
20fbb48
fix local issues with spark tests
mwc360 Feb 25, 2026
ebb9e36
refactor: rename test_tpc_* -> test_* integration test files
mwc360 Feb 25, 2026
aeec74d
fix version
mwc360 Feb 25, 2026
d22c2d1
use DeltaTable merge buillder due to error on local spark: DELTA_MERG…
mwc360 Feb 25, 2026
9f41676
uv lock
mwc360 Feb 25, 2026
393d5ae
chore: bump version
mwc360 Feb 25, 2026
9ad3969
fix: Python 3.8 type hint compat + Daft URI and glob fixes
mwc360 Feb 25, 2026
e1ef3f2
fix: cast Decimal columns to Float64 in Polars load_parquet_to_delta
mwc360 Feb 25, 2026
e3e02bb
fix: add 'from __future__ import annotations' to duckdb, polars, sail…
mwc360 Feb 25, 2026
325d4fe
fix: revert to_file_uri on Daft engine working dir in test_daft.py
mwc360 Feb 25, 2026
1d7290c
fix: use strict=False on Polars Decimal->Float64 cast
mwc360 Feb 25, 2026
a33235a
ci: shorten integration test job names to just the engine
mwc360 Feb 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# LakeBench Codebase Reference

> Quick-reference for Copilot and contributors. Keep this in sync when adding major features.

---

## What is LakeBench?

LakeBench is a **Python-native, multi-modal benchmarking framework** for evaluating performance across multiple lakehouse compute engines and ELT scenarios. It supports industry-standard benchmarks (TPC-DS, TPC-H, ClickBench) and a novel ELT-focused benchmark (ELTBench), all installable via `pip`.

---

## Project Layout

```
src/lakebench/
├── __init__.py
├── benchmarks/
│ ├── base.py # BaseBenchmark ABC — result schema, timing, post_results()
│ ├── elt_bench/ # ELTBench: load, transform, merge, maintain, query
│ ├── tpcds/ # TPC-DS: 99 queries, 24 tables
│ ├── tpch/ # TPC-H: 22 queries, 8 tables
│ └── clickbench/ # ClickBench: 43 queries on clickstream data
├── datagen/
│ ├── tpch.py # TPCHDataGenerator (uses tpchgen-rs, ~10x faster than alternatives)
│ ├── tpcds.py # TPCDSDataGenerator (wraps DuckDB TPC-DS extension)
│ └── clickbench.py # Downloads dataset from ClickHouse host
├── engines/
│ ├── base.py # BaseEngine ABC — fsspec, runtime detection, result writing
│ ├── spark.py # Generic Spark engine
│ ├── fabric_spark.py # Microsoft Fabric Spark (auto-authenticates via notebookutils)
│ ├── synapse_spark.py # Azure Synapse Spark
│ ├── hdi_spark.py # HDInsight Spark
│ ├── duckdb.py # DuckDB
│ ├── polars.py # Polars
│ ├── daft.py # Daft
│ ├── sail.py # Sail (PySpark-compatible engine)
│ └── delta_rs.py # Shared DeltaRs write helper (used by non-Spark engines)
└── utils/
├── query_utils.py # transpile_and_qualify_query(), get_table_name_from_ddl()
├── path_utils.py # abfss_to_https(), to_unix_path()
└── timer.py # Context-manager timer; stores results for post_results()
```

---

## Core Abstractions

### `BaseEngine` (`engines/base.py`)
Abstract base for all compute engines.

| Attribute | Description |
|---|---|
| `SQLGLOT_DIALECT` | SQLGlot dialect string for auto-transpilation (e.g. `"duckdb"`) |
| `SUPPORTS_SCHEMA_PREP` | Whether the engine can create an empty schema-defined table before data load |
| `SUPPORTS_MOUNT_PATH` | Whether the engine can use mount-style URIs (`/mnt/...`) |
| `TABLE_FORMAT` | Always `'delta'` |
| `schema_or_working_directory_uri` | Base path where Delta tables are stored |
| `storage_options` | Dict passed through to DeltaRs / fsspec for cloud auth |
| `extended_engine_metadata` | Dict of key/value pairs appended to benchmark results |

Key methods: `get_total_cores()`, `get_compute_size()`, `get_job_cost(duration_ms)`, `create_schema_if_not_exists()`, `_append_results_to_delta()`.

Runtime is auto-detected at init via `_detect_runtime()` — returns `"fabric"`, `"synapse"`, `"databricks"`, `"colab"`, or `"local_unknown"`.

### `BaseBenchmark` (`benchmarks/base.py`)
Abstract base for all benchmarks.

| Attribute | Description |
|---|---|
| `BENCHMARK_IMPL_REGISTRY` | `Dict[EngineClass → ImplClass]` — maps engines to optional engine-specific implementations |
| `RESULT_SCHEMA` | Canonical 21-column result schema (see below) |
| `VERSION` | Benchmark version string |

The result schema includes: `run_id`, `run_datetime`, `lakebench_version`, `engine`, `engine_version`, `benchmark`, `benchmark_version`, `mode`, `scale_factor`, `scenario`, `total_cores`, `compute_size`, `phase`, `test_item`, `start_datetime`, `duration_ms`, `estimated_retail_job_cost`, `iteration`, `success`, `error_message`, `engine_properties` (MAP), `execution_telemetry` (MAP).

`post_results()` collects timer results → builds result rows → optionally appends to a Delta table via `engine._append_results_to_delta()`.

---

## Engine & Benchmark Registration

Benchmarks declare engine support via `BENCHMARK_IMPL_REGISTRY`. If an engine uses only shared `BaseEngine` methods, the value is `None`; otherwise it maps to a specialized implementation class.

```python
# Register a custom engine with an existing benchmark
from lakebench.benchmarks import TPCDS
TPCDS.register_engine(MyNewEngine, None) # use shared methods
TPCDS.register_engine(MyNewEngine, MyTPCDSImpl) # use custom impl class
```

To add a new engine, subclass an existing one:
```python
from lakebench.engines import BaseEngine

class MyEngine(BaseEngine):
SQLGLOT_DIALECT = "duckdb" # or whichever dialect applies
...

from lakebench.benchmarks.elt_bench import ELTBench
ELTBench.register_engine(MyEngine, None)
benchmark = ELTBench(engine=MyEngine(...), ...)
benchmark.run()
```

---

## Query Resolution Strategy (3-Tier Fallback)

For each query, LakeBench resolves in this order:

1. **Engine-specific override** — `resources/queries/<engine_name>/q14.sql` (rare; e.g. Daft decimal casting)
2. **Parent engine class override** — `resources/queries/<parent_class>/q14.sql` (rare; e.g. Spark family)
3. **Canonical + auto-transpilation** — `resources/queries/canonical/q14.sql` transpiled via SQLGlot using the engine's `SQLGLOT_DIALECT`

Tables are automatically qualified with catalog and schema when applicable. To inspect the resolved query:

```python
benchmark = TPCH(engine=MyEngine(...))
print(benchmark._return_query_definition('q14'))
```

---

## Optional Dependency Groups

Install only what you need:

| Extra | Installs |
|---|---|
| `duckdb` | `duckdb`, `deltalake`, `pyarrow` |
| `polars` | `polars`, `deltalake`, `pyarrow` |
| `daft` | `daft`, `deltalake`, `pyarrow` |
| `tpcds_datagen` | `duckdb`, `pyarrow` |
| `tpch_datagen` | `tpchgen-cli` |
| `sparkmeasure` | `sparkmeasure` |
| `sail` | `pysail`, `pyspark[connect]`, `deltalake`, `pyarrow` |

```bash
pip install lakebench[duckdb,polars,tpch_datagen]
```

---

## Supported Runtimes & Storage

**Runtimes**: Local (Windows), Microsoft Fabric, Azure Synapse, HDInsight, Google Colab (experimental)

**Storage**: Local filesystem, OneLake, ADLS Gen2 (Fabric/Synapse/HDInsight), S3 (experimental), GCS (experimental)

**Table format**: Delta Lake only (via `delta-rs` for non-Spark engines)

---

## Timer (`utils/timer.py`)

`timer` is a context-manager function with a `.results` list attached. Use it inside benchmark `run()` implementations to time each phase/test item:

```python
with self.timer(phase="load", test_item="q1", engine=self.engine) as t:
t.execution_telemetry = {"rows": 1000} # optional metadata
do_work()

self.post_results() # flush timer.results → self.results → optionally Delta
```

---

## Key Conventions

- **All Delta writes for non-Spark engines** go through `engines/delta_rs.py` (`DeltaRs().write_deltalake(...)`).
- **SQLGlot transpilation** is the default path; engine-specific SQL files are exceptions, not the rule.
- **`storage_options`** on `BaseEngine` is the single place for cloud auth credentials (bearer token, SAS, etc.).
- **`extended_engine_metadata`** on `BaseEngine` is the right place to attach runtime-specific metadata that ends up in the `engine_properties` MAP column of results.
- **TPC-DS / TPC-H spec compliance**: LakeBench intentionally diverges from `spark-sql-perf` to follow the official specs (see `customer.c_last_review_date_sk` and `store.s_tax_percentage` fixes in README).
- **New benchmarks** should subclass `BaseBenchmark`, define `RESULT_SCHEMA`, `BENCHMARK_IMPL_REGISTRY`, `VERSION`, and implement `run()`.
12 changes: 4 additions & 8 deletions .github/workflows/publish_to_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,11 @@ jobs:
build-and-publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install build dependencies
run: python -m pip install build twine
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Build package
run: python -m build
run: uv build
- name: Publish package to PyPI
uses: pypa/gh-action-pypi-publish@v1.4.2
with:
Expand Down
74 changes: 74 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Tests

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
unit-tests:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: uv sync --group dev

- name: Run unit tests
run: uv run pytest tests/ --ignore=tests/integration -v --tb=short

integration-tests:
name: integration (${{ matrix.engine }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- engine: duckdb
extras_flags: "--extra duckdb --extra tpcds_datagen --extra tpch_datagen"
test_file: "tests/integration/test_duckdb.py"
- engine: daft
extras_flags: "--extra daft --extra tpcds_datagen --extra tpch_datagen"
test_file: "tests/integration/test_daft.py"
- engine: polars
extras_flags: "--extra polars --extra tpcds_datagen --extra tpch_datagen"
test_file: "tests/integration/test_polars.py"
- engine: spark
extras_flags: "--extra spark --extra tpcds_datagen --extra tpch_datagen"
test_file: "tests/integration/test_spark.py"
java: "17"
- engine: sail
extras_flags: "--extra sail --extra tpcds_datagen --extra tpch_datagen"
test_file: "tests/integration/test_sail.py"

steps:
- uses: actions/checkout@v4

- name: Set up Java ${{ matrix.java }}
if: matrix.java != ''
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: ${{ matrix.java }}

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.11"

- name: Install dependencies (${{ matrix.engine }})
run: uv sync --group dev ${{ matrix.extras_flags }}

- name: Run integration tests (${{ matrix.engine }})
run: uv run pytest ${{ matrix.test_file }} -v -s --tb=short -W always
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ __pycache__/
*.pyd
*.so

# Development artifacts
dev/

# Virtual environment
.venv/
env/
Expand Down Expand Up @@ -34,6 +37,10 @@ build/
.DS_Store
Thumbs.db

# Spark metastore (Derby embedded DB)
metastore_db/
derby.log

# Logs
*.log

Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,15 +70,30 @@ LakeBench supports multiple lakehouse compute engines. Each benchmark scenario d
| Synapse Spark | ✅ | ✅ | ✅ | ✅ |
| HDInsight Spark | ✅ | ✅ | ✅ | ✅ |
| DuckDB | ✅ | ✅ | ✅ | ✅ |
| Polars | ✅ | ⚠️ | ⚠️ | 🔜 |
| Daft | ✅ | ⚠️ | ⚠️ | 🔜 |
| Polars | ✅ | ⚠️ | ⚠️ | ⚠️ |
| Daft | ✅ | ⚠️ | ⚠️ | ⚠️ |
| Sail | ✅ | ✅ | ✅ | ✅ |

> **Legend:**
> ✅ = Supported
> ⚠️ = Some queries fail due to syntax issues (i.e. Polars doesn't support SQL non-equi joins, Daft is missing a lot of standard SQL contructs, i.e. DATE_ADD, CROSS JOIN, Subqueries, non-equi joins, CASE with operand, etc.).
> 🔜 = Coming Soon
> (Blank) = Not currently supported
> (Blank) = Not currently supported

For detailed pass rates and per-query failure analysis, see the [coverage reports](reports/coverage/).

## 📊 Engine Coverage Reports

Per-engine coverage reports are auto-generated by the integration test suite and show pass rates with individual query failure details.
To refresh: run the integration tests for your engine of choice (see [`tests/integration/README.md`](tests/integration/README.md)).

| Engine | Report |
|--------|--------|
| DuckDB | [reports/coverage/duckdb.md](reports/coverage/duckdb.md) |
| Polars | [reports/coverage/polars.md](reports/coverage/polars.md) |
| Daft | [reports/coverage/daft.md](reports/coverage/daft.md) |
| Spark | [reports/coverage/spark.md](reports/coverage/spark.md) |
| Sail | [reports/coverage/sail.md](reports/coverage/sail.md) |

## Where Can I Run LakeBench?
Multiple modalities doesn't end at just benchmarks and engines, LakeBench also supports different runtimes and storage backends:
Expand Down
Loading