Benchmark: azula sampler comparison on two_moons#1825
Open
JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
Open
Benchmark: azula sampler comparison on two_moons#1825JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
Conversation
added 2 commits
March 24, 2026 16:17
Add benchmark script that trains NPSE on two_moons (mini-sbibm) and compares sampling performance between sbi's default Euler-Maruyama predictor and azula's DDIM, DDPM, Euler, and Heun samplers via a score-to-denoiser adapter. Key components: - SBISchedule: custom noise schedule adapter resolving the parameterization mismatch between sbi's VP-SDE (beta_min/beta_max) and azula's VPSchedule (alpha_min), with numerical validation - SBIDenoiser: score-to-denoiser conversion via Tweedie's formula - Evaluation: C2ST + MMD against mini-sbibm reference posteriors across multiple step counts, seeds, and observations Related to sbi-dev#1468
- Replace sys.path.insert hack with importlib-based loader for mini_sbibm, registering the module in sys.modules so relative imports work without polluting the global path - Add smoke test suite (tests/test_azula_benchmark.py) verifying schedule alignment, denoiser output shape, and finite samples from all 4 azula samplers — all 6 tests pass
Author
|
Just pushed a follow-up addressing a couple things I noticed after the initial commit: Import cleanup — Replaced the Smoke tests — Added A few notes on scope:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark comparing azula's diffusion samplers (DDIM, DDPM, Euler, Heun) against sbi's default Euler-Maruyama on the two_moons task from mini-sbibm, for both VE and VP SDE types.
Building on the architectural design proposed by @satwiksps in #1468, this implements the score-to-denoiser bridge and addresses noise schedule alignment. Responds to @janfb's request for performance benchmarks before proceeding with the full azula integration.
Related to #1468
Approach
SBISchedule: Custom noise schedule adapter that delegates to sbi'smean_t_fn/std_fn, resolving the parameterization mismatch between sbi's VP-SDE (beta_min/beta_max) and azula'sVPSchedule(alpha_min). Includes numerical validation showing the schedules match exactly.SBIDenoiser: Adapter converting sbi's score output to azula's denoiser protocol via Tweedie's formula:θ̂₀ = (θₜ + σₜ² · score) / αₜTechnical Note: Noise Schedule Mismatch
A key finding during implementation: sbi's VP-SDE and azula's
VPScheduleuse mathematically incompatible formulas forα(t):α(t) = exp(-0.25t²(β_max - β_min) - 0.5t·β_min)→α(1) = 0.082α(t) = exp(t²·log(α_min))→α(1) = 0.001Using azula's built-in
VPScheduledirectly would produce incorrect samples. TheSBIScheduleadapter resolves this by delegating to the score estimator's ownmean_t_fnandstd_fn, which is also the approach I'd recommend for the full integration. (For VE-SDE, both libraries use the same geometric sigma schedule, so this mismatch only affects VP/SubVP.)Note:
mean_t_fn(the raw signal scaling) must be used rather thanapprox_marginal_mean(which multiplies bymean_0). After z-scoring,mean_0 ≈ 0, so the latter collapses to zero and breaks the Tweedie division.Results
VE-SDE (default)
All samplers converge to similar quality for VE, confirming the adapter works correctly.
VP-SDE
Key observations:
NFE = number of function (score) evaluations. Heun uses 2 per step; all others use 1. For fair efficiency comparisons, NFE is the relevant cost metric.
Reproducibility
pip install azula python benchmarks/azula_sampler_benchmark.py --task two_moons --sde-types ve vp # Quick run: python benchmarks/azula_sampler_benchmark.py --sde-types ve --steps 50 100 --seeds 0Test plan