Benchmark: azula sampler comparison on two_moons by JiwaniZakir · Pull Request #1825 · sbi-dev/sbi

JiwaniZakir · 2026-03-24T20:18:46Z

Summary

Benchmark comparing azula's diffusion samplers (DDIM, DDPM, Euler, Heun) against sbi's default Euler-Maruyama on the two_moons task from mini-sbibm, for both VE and VP SDE types.

Building on the architectural design proposed by @satwiksps in #1468, this implements the score-to-denoiser bridge and addresses noise schedule alignment. Responds to @janfb's request for performance benchmarks before proceeding with the full azula integration.

Related to #1468

Approach

SBISchedule: Custom noise schedule adapter that delegates to sbi's mean_t_fn/std_fn, resolving the parameterization mismatch between sbi's VP-SDE (beta_min/beta_max) and azula's VPSchedule (alpha_min). Includes numerical validation showing the schedules match exactly.
SBIDenoiser: Adapter converting sbi's score output to azula's denoiser protocol via Tweedie's formula: θ̂₀ = (θₜ + σₜ² · score) / αₜ
Evaluation: C2ST + MMD against mini-sbibm reference posteriors, 4 step counts, 3 seeds, 3 observations

Technical Note: Noise Schedule Mismatch

A key finding during implementation: sbi's VP-SDE and azula's VPSchedule use mathematically incompatible formulas for α(t):

sbi: α(t) = exp(-0.25t²(β_max - β_min) - 0.5t·β_min) → α(1) = 0.082
azula: α(t) = exp(t²·log(α_min)) → α(1) = 0.001

Using azula's built-in VPSchedule directly would produce incorrect samples. The SBISchedule adapter resolves this by delegating to the score estimator's own mean_t_fn and std_fn, which is also the approach I'd recommend for the full integration. (For VE-SDE, both libraries use the same geometric sigma schedule, so this mismatch only affects VP/SubVP.)

Note: mean_t_fn (the raw signal scaling) must be used rather than approx_marginal_mean (which multiplies by mean_0). After z-scoring, mean_0 ≈ 0, so the latter collapses to zero and breaks the Tweedie division.

Results

VE-SDE (default)

Sampler	Steps	NFE	C2ST (mean±std)	Time (s)
Euler-Maruyama (sbi)	50	50	0.637 ± 0.018	0.16
DDIM (η=0)	50	50	0.627 ± 0.023	0.16
DDPM	50	50	0.614 ± 0.016	0.16
Heun	50	100	0.624 ± 0.022	0.32
Euler-Maruyama (sbi)	500	500	0.630 ± 0.015	1.32
DDIM (η=0)	500	500	0.620 ± 0.024	1.41
DDPM	500	500	0.628 ± 0.012	1.40
Heun	500	1000	0.625 ± 0.024	2.75

All samplers converge to similar quality for VE, confirming the adapter works correctly.

VP-SDE

Sampler	Steps	NFE	C2ST (mean±std)	Time (s)
Euler-Maruyama (sbi)	50	50	0.855 ± 0.007	0.15
DDIM (η=0)	50	50	0.801 ± 0.008	0.24
DDPM	50	50	0.787 ± 0.006	0.15
Heun	50	100	0.794 ± 0.021	0.36
Euler-Maruyama (sbi)	500	500	0.782 ± 0.007	1.88
DDIM (η=0)	500	500	0.787 ± 0.013	1.92
DDPM	500	500	0.774 ± 0.010	1.76
Heun	500	1000	0.792 ± 0.019	3.19

Key observations:

At low step counts (50), azula's DDPM and DDIM outperform Euler-Maruyama — C2ST of 0.787/0.801 vs 0.855 (lower is better, 0.5 = perfect).
Heun converges fastest per step (2nd-order accuracy) but at 2x NFE cost — at 50 steps (100 NFE) it matches DDIM at 100 steps (100 NFE).
All methods converge at high step counts, confirming schedule alignment is correct.
VP overall has higher C2ST than VE, consistent with VP being a harder optimization landscape for this task.

NFE = number of function (score) evaluations. Heun uses 2 per step; all others use 1. For fair efficiency comparisons, NFE is the relevant cost metric.

Reproducibility

pip install azula
python benchmarks/azula_sampler_benchmark.py --task two_moons --sde-types ve vp
# Quick run: python benchmarks/azula_sampler_benchmark.py --sde-types ve --steps 50 100 --seeds 0

Test plan

Script runs end-to-end on CPU for both VE and VP
Schedule validation assertions pass (α/σ alignment < 1e-7)
All sampler configurations produce finite, correctly-shaped output
C2ST values converge at high step counts (confirms correctness)
ruff check + ruff format + pre-commit hooks all pass
No modifications to sbi core files — benchmark is fully self-contained

Add benchmark script that trains NPSE on two_moons (mini-sbibm) and compares sampling performance between sbi's default Euler-Maruyama predictor and azula's DDIM, DDPM, Euler, and Heun samplers via a score-to-denoiser adapter. Key components: - SBISchedule: custom noise schedule adapter resolving the parameterization mismatch between sbi's VP-SDE (beta_min/beta_max) and azula's VPSchedule (alpha_min), with numerical validation - SBIDenoiser: score-to-denoiser conversion via Tweedie's formula - Evaluation: C2ST + MMD against mini-sbibm reference posteriors across multiple step counts, seeds, and observations Related to sbi-dev#1468

- Replace sys.path.insert hack with importlib-based loader for mini_sbibm, registering the module in sys.modules so relative imports work without polluting the global path - Add smoke test suite (tests/test_azula_benchmark.py) verifying schedule alignment, denoiser output shape, and finite samples from all 4 azula samplers — all 6 tests pass

JiwaniZakir · 2026-03-24T20:37:00Z

Just pushed a follow-up addressing a couple things I noticed after the initial commit:

Import cleanup — Replaced the sys.path.insert hack for loading mini_sbibm with an importlib-based loader that registers the module in sys.modules without polluting the global path. Feels less brittle if the repo structure changes.

Smoke tests — Added tests/test_azula_benchmark.py with 6 tests covering schedule alignment validation, denoiser output shape, and finite sample generation from all 4 azula samplers (DDIM, DDPM, Euler, Heun). All pass in ~1.5s. Marked @pytest.mark.slow so they don't run in standard CI unless opted in, and @pytest.mark.skipif so they're skipped gracefully if azula isn't installed.

A few notes on scope:

Results shown are for two_moons since that's what was suggested in Adding interface to azula samplers #1468. The script also supports --task slcp if results on a higher-dimensional task would be useful — happy to add those.
azula is only imported inside the benchmark script and test file (behind try/except), not added to pyproject.toml — keeping it out of core dependencies. If this lands, it could be added as an optional extra (sbi[azula]) during the full integration.

Zakir Jiwani added 2 commits March 24, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: azula sampler comparison on two_moons#1825

Benchmark: azula sampler comparison on two_moons#1825
JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
JiwaniZakir:benchmark/azula-sampler-comparison

JiwaniZakir commented Mar 24, 2026

Uh oh!

JiwaniZakir commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JiwaniZakir commented Mar 24, 2026

Summary

Approach

Technical Note: Noise Schedule Mismatch

Results

VE-SDE (default)

VP-SDE

Reproducibility

Test plan

Uh oh!

JiwaniZakir commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant